Files And Persistence

Files And Persistence

Video

Motivation

Most computer activities take place over short periods - a few minutes at most. Mathematicians, however, are prone to computations that can take hours - days - weeks - even, at the extreme end, years to run.

These computations cannot be relied upon to complete cleanly in one session with no oversight. For this sort of thing, we need persistence - the ability to periodically save the state of our computation, and resume it.

If your computation loop is simple - for example, searching all the integers up to some value $n$ - then you can print relevant information to the terminal with print. As we covered in Bash, this can be easily turned into a file with the File Write Operator >.

Accepting a parameter - even as a variable - for where to start the search allows for a simple, fast, and easy way to save and resume your state - and a bit of Bash knowledge (with the sys.argv list) can allow you to pipe the parameter from the tail of the file.

However, the real world is often more complicated and Python allows us some tools to deal with that.

File Objects

Reading

In Python, files are opened with the open command - but must be closed cleanly when you are done.

Fortunately, in pretty much all contexts, you can use another Python syntax - called context managers.

Say we have a file exampledata.csv. This is a comma separated value file - essentially, a list of lists of values.

exampledata.csv

1,20,5
2,19
3,18

In python, we can get the contents of the file as a string with:

with open("exampledata.csv") as file:
    data = file.read()

The with context management statement allows us to create a context in which the file is open and accessible - and cleanly close it whenever it exits, even if there was an error.

note that we are pulling the data out inside the context block as a string. We can then parse it into its data after using string methods and list comprehensions:

with open("exampledata.csv") as file:
    data = file.read()
lines = data.split('\n')
int_data = [[int(i) for i in line.split(',')] for i in line.split('\n')]

File Readers

This works well for simple files - but often you will want to use a storage system that can handle the parsing in and out for you.

The three you will use are the modules:

module	use case
`csv`	importing simple data from other programs; storing tabular data; simple numbers/strings
`json`	importing complex data; storing hierarchical data; simple objects;
`pickle`	Complex objects stored (and restored)

We will touch on the others at the end, but for now, we will try the csv function:

import csv
with open("exampledata.csv") as file:
    reader = csv.reader(file)
    data = list(reader)

Note that this creates a reader as an generator similar to range - so you can read the lines one at a time. In the same way, we convert the output to a list to make it reusable.

This gets the csv values as strings - and handles strange cases, such as delimiter escapes (for example, commas within data.)

We may still have to go through and parse the data:

int_data = [[int(i) for i in line.split(',')] for i in line.split('\n')]

Note that the title csv is somewhat misleading. The csv.Reader class can accept an extra parameter - for example, csv.Reader(file,delimiter='\t') - to use other delimiters. The csv module is your go-to resource for handling any tabular data.

Writing

Now that we have handled reading files, lets talk about writing new files.

The syntax is almost identical, with the optional parameter 'w' for write:

with open(`exampleoutput.txt`,'w') as f:
    f.write("An output for the future.")

The context manager will automatically close the file for you, and finish the write.

There are other modes you can pass open - most notably a - which can append to a file, allowing you to update with new lines.

Writers

Just as with reading, for machine-readable files, you can use writers.

csv implements csv.writer:

with open("example.csv",'w') as f:
    writer = csv.writer(f)
    writer.writerow([1,2,3])
    writer.writerows([["text",5],
                      [1,2,3,4]])

Advanced CSV Read/Writing with csv.DictReader and csv.DictWriter

When I store or read CSVs, it is often tabular data with some syntactic meaning to the columns - a name, an ID, perhaps a grade. To keep track of the state, I use the more advanced:

csv.DictReader

For example, if we had some student data:

studentdata.csv

"last_name","first_name","id"
"Bond","James",007
"Fleming","Ian",000

I could parse this with:

with open('studentdata.csv') as csvfile:
    students = list(csv.DictReader(csvfile, quoting=csv.QUOTE_NONNUMERIC))

to read in the values and parse them. This creates a collection of dictionaries - I could iterate over the students, and access student.first_name to get the first name.

If I wanted to write them again, I could use:

with open('studentdata.csv','w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=students[0].keys(), quoting=csv.QUOTE_NONNUMERIC)
    writer.writeheader()
    writer.writerows(students)

To store the values.

Storing Dictionaries with JSON

Hierarchical data is remarkably useful in practice. The prevailing standard is called json - the JavaScript Object Notation. While we won't be talking about JavaScript much in this course, the relevant part is that - as (almost) everything in Python is an object - (almost) everything in JavaScript is a dictionary.

For dictionaries containing trees of lists, strings, floats, and other simple types (with string keys), you can store and load them almost trivially with:

import json

example_dict = {'1':'algebra',
                '2':{
                  'betamax':['coin',3,4]
                }}

with open("example.json","w") as f:
    json.dump(example_dict,f)

with open("example.json") as f:
    reloaded_example_dict = json.load(f)

assert reloaded_example_dict == example_dict

If your object can't be converted easily to a list of integers or floating point numbers, chances are it can be converted to/from a dictionary. This is also an excellent option.

Saving Complicated Objects

String Conversion of More Complex Objects

Philosophically, every state in the computation of mathematics can be reduced to integers - and it is often good to think about how that is.

However, there is often friction if you have to go back and forth all the time in practice.

Let's say that you are working on Elliptic Curves over Finite Fields. You want to store a specific Elliptic Curve, but you don't have an obvious way to store the object.

We have talked in the class section about the __repr__ magic method. It is your responsibility to figure out some valid way to represent your objects in terms of basic python terms - likely you had to do that when constructing them.

You can get the representation of an object with repr(). For most standard objects - and all objects where that representation makes sense - evaluating that string will recreate the Python object.

You can evaluate a string with eval(). Note, however, that this runs whatever python code is in the string - which, if you trust user input, can do all kinds of python things.

Pickling

the pickle module should, in many ways, be your last resort for storing objects. Essentially like json and repr taken to extremes, it turns python Objects with picklable content - pretty much everything but function definitions - into a set of bytes that represent an object.

with open("trianglepickle.pickle","wb") as f:
    pickle.dump(T,f)

This does not rely on any __repr__ methods, but is just as unsafe as a repr and eval - but much harder to read. However, it can't be beat for quickly dumping state into files.

Pickled objects can be loaded just like json:

with open("trianglepickle.pickle","rb") as f:
    T = pickle.load(f)

Timestamps

Computations take time, and data becomes obsolete. Because of these invariants, there is a helpful concept to keep track of: time.

One thing every programmer learns some day is: NEVER DEAL WITH TIME WITHOUT HELP.

I won't get in to all the whys of this - they range from unusual nuances of our culture and orbit (leapdays and leapseconds) to ancient mistakes (windows setting system time to local time) to modern cruelties (daylight savings).

It is not an oversight to say billions of dollars have been lost by corporations improperly rolling their own time. Don't be like them. Use datetime.

datetime is a core library which gives access to the (confusingly named) datetime and timedelta objects, representing a point in time - and a difference between points in time - respectively.

We just need to get now, so we will run the factory class method:

from datetime import datetime

now = datetime.now()

This gets us now, as a datetime.

You can get a lot of human-relevant information in the attributes. Looking through the help, some relevant times are:

now.year
now.month
now.day

and so on.

You can get a default string of the time by formatting it as a string:

str(now)

Or specify your own format using the versatile Date Formatting Strings, such as:

now.strftime("%A %B %d, %Y")

However, the standard for computer-readable time is called a timestamp (occasionally, unix timestamp, to distinguish from worse ideas.)

Roughly speaking, in Python, this is given as the number of seconds since the Unix Epoch. To find out when that is, we can do:

datetime.fromtimestamp(0).strftime("%A %B %d, %Y")

To get a timestamp for now as a floating point number, simply run:

datetime.now().timestamp()

Worksheet

Today's worksheet takes you through some file processing examples, with real-world situations where you might read, write, and append to files.