NumPy Data Mocking

The Story So Far

Suppose you are teaching a course on programming for Mathematicians, and you want to use a relatable example such as calculating grades.

However, federal law - primarily, the Family Educational Rights and Privacy Act - restricts your ability to do use real-life data without full anonymization, and university policy may make you get clearance for uses such as this.

So you want to create some believable fake data. Your teaching experience has given you a fairly good grasp of what kinds of assignments math classes have - quizzes, homeworks, and exams - and what kinds of grades to expect - roughly normally distributed around something in the 7-10 range, depending on difficulty.

Your goal for this exercise is to construct a function which creates a reasonable gradebook for an imagined course.

Your Assignment: Upload a module containing a Python function generate_gradebook(filename, student_count, quiz_count, hw_count, exam_count) which uses NumPy to create a convincing gradebook csv file:

A header line containing:

A column for student ids

A provided number of homeworks, quizzes, and exams: by default, a random number of homework and quizzes between 3 and 12 each, and exams between 1 and 5.

A provided number of rows containing:

a randomly generated student id, in a roughly purdue-like range: 9 digit numbers starting with 100

grades for each hw,quiz, and exam; with a maximum score of 10 for quizzes, 20 for homeworks, and 100 for exams. For believability, each student's scores should be normally distributed around an average percentage unique to that student - and the student's percentages should be normally distributed such that you get a reasonable grading curve.

If this description is enough for you to get started, feel free. The rest of this document consists of hints to achieve it, and an optional challenge at the end for those who really want to learn the NumPy way.

One Student At A Time

Solving problems is all about breaking them down into manageable chunks. Let's imagine a much simpler problem: What is a student's grade on a single homework?

In this case, we can take a page out of the statistician's book - people believe in normal distributions, regardless of evidence, so try a normal distribution with an appropriate mean value.

To convert the percentage to a score, you need to round them, and clip them to the appropriate range.

The Pythonic way is to construct your larger gradebook from this using list comprehensions. However, try to do as much of it as possible with larger arrays.

Remember that numpy functions - like clip and even multiplication - can take arrays as arguments.

When generating a student's entire scores, using the full distribution doesn't look right. Instead, generate a average for that student, and then use that as the mean of a new normal distribution for all the grades.

The NumPy Way

Try to achieve this with no flow control outside of the arrays - just the tools for assembling matrices.

The first step is generating a $1 \times n$ matrix of student averages.

Then use the numpy.outer product to construct a 2-dimensional matrix (with all the same columns) of the length you want.

Calling numpy.normal with this matrix as the means will generate a full 2d matrix.

Then you can column multiply by a $1 \times n$ matrix with the max scores, round, and clip with that same matrix to generate a matrix of grades.

Join that array with generated student ids, and - after printing the header - you have a full array of grades.