Pandas: Pseudodatabases for Pseudopeople

Pandas: Pseudodatabases for Pseudopeople

Video

Motivation

Real world data is, quite often, not neatly tabular. It comes with metadata - like titles - or related data of different types. One datum - like a student record - may match to several gradebooks.

Keeping track of these relationships, titles, and metadata can quickly spiral out of hand. Pandas (the software) is a bridge - metaphorically, and if you configure it correctly, literally - to the power of Relational Databases from Numpy and MatPlotLib.

Common reasons to use Pandas include:

You have complex data spread across multiple interrelated tables, and want to extract a clean table of relevant data
You have files with absurdly many rows (think: tens of gigabytes), and want to access/store them efficiently while doing simple (and complex, built from simple) operations on the data
You want your data fields to have names and validation
Your data has inherent relationships which you want to keep track of

These are all great reasons to learn SQL. However, Pandas does in a pinch.

Other reasons to use Pandas include:

You want to quickly turn a CSV (or other tabular file(s)) into a labeled MatPlotLib plot, and don't want to think about the details.
You want to use a more complex database interface from your SQL to MatPlotLib or Numpy.
You love Excel and its eccentricities, but it just can't handle large files.
You want to get into SQL, but need to dip your toes in the relational data ocean before you dive in.
You really like Pandas (the animal), and are willing to restrict yourself just to see that word a bunch

The `pandas.DataFrame` object

The primary object in Pandas (the software) is called a DataFrame.

Pandas DataFrames are wrappers around numpy arrays, which contain the tabular data the DataFrame represents - strings or other datatypes, but most often (and automatic), numpy.floats.

This is very similar to the structure of an Excel .xls file- a metadata wrapper around simple tabular data. Excel documents are good way to think about DataFrame collections if you are used to Excel-like interfaces; many of the developers thought of them that way, so it will lead you down the intended path.

DataFrames are typically created in one of two ways: from files, or from existing Numpy data.

From Data

We can create a simple DataFrame with the constructor:

import pandas
pandas.DataFrame([[1,2],[3,4]])

Note the prettyprinting, and the automatic column and row names - indices from the array.

We can name the columns:

import pandas
df = pandas.DataFrame([[1,2],[3,4]])
df.columns=["a","b"]
df

Or tell pandas the names at construction:

import pandas
df = pandas.DataFrame([[1,2],[3,4]],columns=["a","b"])

The main advantage of named columns is the ability to insert values by name.

df = df.append({"a":5,"c":"Bananas"},ignore_index=True)
df.at[0,"d"]=2
df.values

Immediately, some things should jump out at you:

Append does not change the original DataFrame - it creates a new one. Dataframes are not row-mutable.
We defined a new column value - DataFrames are Column-mutable, and columns can be created or destroyed.
We only defined one existing column's value - Pandas allows for default blank ("NaN" or Not A Number) values. (This is a behavior inhereted all the way from the low-level floats, for numbers.)
We told it to ignore_index. This auto-increments an index - our new row is now row $2$.

We could specify a specific index by creating a row object Series:

df.append(pandas.Series(data={'a':10}, name='New Label'))

Row indices - which you will hear me refer to as primary keys or often just keys for short - do not have to be integers. They can also be strings. (Actually, they can be all manner of hashables. In general, don't do that to yourself - except perhaps for some exceptions like carefully crafted datetimes.)

You can set the index to a specific column of values - which all exist - with:

df.set_index("a")

You can even set it to multiple indices - what I will refer to as a Compound Key - by passing a list:

df = df.set_index(["a","b"])

Accessing Single Values

Indices allow you to address specific table positions - to read and change values, or add new values/columns to existing rows.

df.at[(1,2),"d"] = 1
df

From a File

The most common way, however, to get DataFrame objects is from files. It can handle a variety of common tabular files - mainly csv, tsv, xls (which is just tsv with extra metadata) and json. It can read others - even correctly formatted html tables (which are relatively rare) - but these are by far the most used.

I am partial to csv files. My wife is partial to tsv, because she deals with data that contains a lot more commas (that would have to be escaped). The delimiter doesn't matter much - you can use whatever simple character doesn't show up in your data much, so long as you tell Pandas - but we will cover csv and json in this course.

The other readers have their eccentricities, but you can likely figure them out.

File Structure

CSV

Say I had a csv file:

example.csv

Name,Student ID,GPA
Alice, 001, 4.0
Bob, 002, 3.7
Eve, 005, 2.3

We can quickly get it in Pandas with:

df = pandas.read_csv("example.csv")

Watch out for whitespace!

Notice that it immediately treats the first line as a "header" - unless we ask it not to with a parameter header= for a custom header, row index, or None for integer ordering - but uses the row number as the default index.

We can set an index with a parameter:

df = pandas.read_csv("example.csv",index_col="Student ID")

There are actually lots of useful parameters - to ignore rows or columns with skiprows (a blacklist) and usecols (a whitelist), to take the data in byte size chunks with chunksize, and countless others you can read about in the docs.

JSON

We mentioned JSON briefly, but it is a very common and useful data format for hierarchical data that is sometimes abused for tabular data.

Pandas can handle JSON input in pretty much all the ways it can handle dictionaries.

To see how Pandas wants a JSON file, however, we can do a quick trick:

print(df.to_json())

And now we know what to put in a json file to read back our data.

We could save it directly:

df.to_json("example.json")

Big(ish) Data with Pandas

Pandas is increasingly popular in the Big(ish) Data community for its ability to handle limited views of large arrays. Pretend your 100 gigabyte file is in csv (and pray it isn't in json); then you can load it in with:

for df in pandas.read_csv("example.csv",chunksize=10):
    # Do stuff to df
    pass

To create an iterator of smaller dataframes - in this case, 10 lines long - so you can parse the data in bite-size chunks.

This allows you to work on datasets that are larger than you can hold in random access memory, but not larger than you can hold in a single file. Hence... Big(ish).

Using Data

Immediately we have our data loaded and named, ready to be used - we can typecast it as a numpy.array and handle it in Numpy if we want, or let our numpy functions typecast it for us - but we will see that there are a couple snappy ways to get Pandas to do that for us.

Cleaning and Validating Data

The first step for any real-world data analyst is to clean and validate their data. Real world data is messy - no matter how good the specification, eventually, lines or entire institutions make it in to your ecosystem with bad (or just missing) data.

Some surveys have suggested that nearly all data analysts spend the majority of their time cleaning and formatting data. From experience, I would say this is incredibly conservative - the actual data analysis is the cherry on top of a large pile of bad data.

Most of the advanced validation tools are a bit cumbersome to learn, so come up with sanity checks on your own - then apply them to your columns before interacting with the data.

A simple validator that is part of core Pandas, however, is df.isnull().

This returns a Dataframe containing a truth array - like in numpy - and like in numpy, we can use numpy.all to validate it. Or, we can use some data processors to analyze it with the metadata intact:

Processing Data

We can get an overwhelming list of available methods with dir(df) or dir(pandas), so we can refer to the official documentation for a prettier view.

Let's go back to a simple example, and count our dataframe's null values with:

df = pandas.DataFrame([{"a":1},
                  {"a":2,"b":3},
                  {"b":4}])
df.isnull().sum()

And we can find the rows which misbehave by operating on the other axis:

df.isnull().any(axis=1)

You can actually slice your dataframe by this value like in Numpy -

df[df.isnull().any(axis=1)]

to see the data and if it is fixable, but the simplest thing to do is to delete it.

You could slice it, of course, but you could also do:

df = df.dropna()

Remember that DataFrames are row immutable - so you have to assign a new DataFrame.

Plotting

All of the matplotlib functionality is available to you, but there are some common plots that Pandas wants to make for you.

To see a list, you can do dir(df.plot), or read the documentation..

A very simple example is a histogram:

df = pandas.DataFrame([{"a":1},
    {"a":2,"b":3},
    {"a":1,"b":5},
    {"b":4}])
df.hist()

which throws up a histogram of the values, or

df.plot.line(x="a",y="b")

Which choose some default matplotlib settings, throw out unplottable data, and render you a quick plot.

Relational Data

The relational data model is very powerful, and will be the subject of several days later in the course. For now, a simple example will suffice.

Suppose we have three tables:

users.csv

User ID,Name,Address
001,Alice,107 Secret St.
002,Bob,108 Secret St.
003,Eve,

books.csv

ISBN,Catalog Number,Title
9781470410483,0,The Joy Of Factoring
9781470410483,1,The Joy Of Factoring
9780538497817,0,Stewart's Calculus
9780538497817,1,Stewart's Calculus

checkouts.csv

User ID,ISBN,Catalog Number,Date
002,9781470410483,1,1594181696
003,9780538497817,1,1594187234
001,9781470410483,0,1593652411
003,9780538497817,0,1593653501

Now let us read them in as DataFrames:

users = pandas.read_csv("users.csv",index_col="User ID")
books = pandas.read_csv("books.csv",index_col=["ISBN","Catalog Number"])
checkouts = pandas.read_csv("checkouts.csv",index_col=["User ID","ISBN","Catalog Number"])

Say we wanted a list of who had checked out which book. We can use a Join to match up the index columns:

checkouts.join(users)

We can even stack joins:

checkouts.join(users).join(books)

To get all the juicy data on who checked out what.

There are a large number of other relational data operations, but joins are the bread and butter of a relational data model, and a great way to keep more complicated metadata.

Worksheet

Today's worksheet has you return to the world of gradebooks, to read and parse the examples. With Pandas, searching for a particular student or their grades is a breeze.