Managing Multi-File, Related Data with Pandas

Let's go back to our classic example, and the studentdata.zip

The Story So Far

Your technical wizardry, and inability to keep it a secret, has caused you great stress.

This new semester, you have been tasked with computing everyone's GPA. Fortunately, you have new tools at your disposal.

If you want to dive in, and feel ready, do so - the rest of the worksheet will be a step-by-step walkthrough for one way to do this.

Graded Assignment: Create a function which uses Pandas to write a .csv file with headers, and two columns: The name (first, space, then last) and the GPA, added up over all .txt gradebooks in a target directory.

  • Consider any missing value a 0.
  • HWs are 30%, Quizzes are 30%, Exams are 40%
  • Ignore the precomputed grade column - computed columns are often full of lies
  • You don't have to drop any HWs or Quizzes, but if you want to make that an optional parameter for the challenge, go ahead!
  • Compute the data, and upload your .py and .csv to Brightspace.

Step 1

Problem solving - especially when writing a program, but in all areas - is about breaking the problem down into manageable steps. The first step will be calculating grades for a single class. We will start by computing HW grades:

Read in one of the gradebooks (read the documentation for read_csv to see how to skip the first row and set the header). Set the index to the Student ID - we don't want this to change.

It's easy to set missing values to 0:

df = df.fillna(0)

Now let's get the column names for our categories as separate lists.

You can use a list comprehension filter to select all the columns that start with a given value:

hw_columns = [i for i in df.columns if i.startswith("HW")]

This lets you slice the original dataframe with your list of HWs, to get a dataframe with just the HW columns. Use the pandas statistical functions - make sure you use the horizontal axis:

hw_means = df[hw_columns].mean(axis=1)

Calculate the other means, then sum them up - just like numpy arrays, you can multiply by scalars to turn a score out of 20, 10, or 100 into their appropriate percentage contribution.

The result will be a pandas.Series object with the student ids and student grades. Congratulations!

Step 2

Now we want to convert these from grades to GPAs.

We can do that with buckets, but let's practice using a mapping:

from itertools import product

# Some pythonic construction, because I couldn't be bothered to parse the table
grade_cutoffs = [i+j for i,j in product(range(90,50,-10),[7,3,0])]+[0]
gpas = [4.33,4]+[i+j for i,j in product(range(3,0,-1),[.6,.3,0])] + [0.6,0]
grade_map = list(zip(grade_cutoffs, gpas))

def grade_to_grade_point(grade):
    for i,j in grade_map:
       if grade>i:
          return j
    raise ValueError("{grade} not positive")

grade_points = grade_percentages.map(grade_to_grade_point)

Remember to name your grade points with:

grade_points.rename("Grade Points")

Wrap up everything you did to get this far into a function, and call it on some more examples - to get grade points from provided grade files.

Step 2

Now we need to average them.

We want to get the grades from all the files, without confusion. Use your python skills to iterate over all the .txt files in the directory. Get a list of all the grade points for each course (as a pandas.Series, like we constructed above).

Pass this list to pandas.DataFrame to create a new DataFrame where the columns are students and the rows are their grades. You can see that, by using indexes, pandas can automatically line up the matching data!

From here, a simple .mean of the columns gets us each student's GPA.

Step 3

Now we need to get a list of names and GPAs. For that, read in the student metadata from students.csv.

All we need is the first name, last name, and student ID, so filter those with the usecols parameter.

We can then get a series with the appropriate data using python object arithmetic:

student_names = student_metadata['first_name']+' '+student_metadata['last_name']

Set a name for this computed data.

Step 4

Join them together into a single dataframe. We want the students to be rows, so take the .transpose().

Then write it to a file with to_csv. Read the documentation to find out how to avoid writing the index column!

Department of Mathematics, Purdue University
150 N. University Street, West Lafayette, IN 47907-2067
Phone: (765) 494-1901 - FAX: (765) 494-0548
Contact the Webmaster for technical and content concerns about this webpage.
Copyright© 2018, Purdue University, all rights reserved.
West Lafayette, IN 47907 USA, 765-494-4600
An equal access/equal opportunity university
Accessibility issues? Contact the Web Editor (webeditor@math.purdue.edu).