NumPy: A Different Way to Think About Numbers

NumPy: A Different Way to Think About Numbers

Video

Motivation

Python is versatile - live, interpreted, duck typed with free passing and modification of objects - but that versatility comes at a real performance cost.

However, most of the things you might want to do in Python are of negligible cost - and through modules like NumPy, we can outsource the complex stuff to lower level libraries while making use of powerful numerical computation tools.

NumPy is the tool of choice for numerical floating point computaiton in Python, and is the technology underpinning many other Python packages. Underpinning NumPy are the LAPACK and BLAS Fortran libraries, the same libraries that power nearly all serious numerical computation - including MatLab, Maple, and Mathematica. This means that you can get equivalent (or sometimes greater) linear algebra power, for free, within a real, modern programming language.

The Core of NumPy

NumPy revolves around a few core objects, most notably, the "n-dimensional array" ndarray.

In exchange for the dramatic performance gains over raw Python, and for some nice utilities, ndarrays come with some restrictions. They are:

Rectangular: you cannot construct an ndarray with different row or column lengths.
Homogeneous: you cannot mix datatypes in NumPy.
Computational: NumPy arrays support batch computations on all elements.

NumPy Datatypes

Raw Python integers are arbitrary precision - they grow to fill your need, and keep track of precise arithmetic. As objects, they contain a variety of useful defined behaviors and methods.

This is antithetical to fast and parallel computing.

NumPy has datatypes akin to lower level programming languages like c - because they are thin wrappers around those lower level types. This makes them less versatile individually, but significantly faster.

The full list is too long to explain here, but some relevant types are:

type	Description
`np.bool`	Boolean
`np.int64`	integer - any precise integer in $[-2^
`np.uint64`	Unsigned integer - basically `int64` but shifted to the positive range.
`np.float32`	"single" precision number - think of it like scientific notation. Typically 32 bits - 1 for positivity, 8 for the decimal point position, and 23 for the number. However - as you may have learned in a science class - arithmetic decays your confidence by varying amounts, making floating point numbers less precise the more you use them.
`np.float64`	twice the bits, significantly larger numbers, and (depending on what you are doing) roughly twice the arithmetic you can get away with.
`np.complex64`	A pair of `float32`s that support complex arithmetic.

Choosing an appropriate precision comes down to your mathematical skill, and estimations of the errors involved. At the most extreme, it can be the subject of entire Computer Science research careers and graduate courses.

We won't be covering the basics here - for most purposes, you can choose a larger type than your largest integer/need for precision, and be fine. This is particularly true for number theorists - who know what precision they need (it's all of it) and the rough size of their integers (if we're lucky, it will be roughly logarithms of things) - and for people who do not care much about accuracy.

NumPy has a certain amount of "automatic escalation" of types when doing scalar arithmetic or joining arrays, but not for most array arithmetic.

Defining new `ndarray`s

The ndarray datatype is beneath a lot of the functionality you will use, but most people never call it. Instead, they use the arrays generated by other functions - most commonly:

constructor	use case
`numpy.array(content)`	arrays from lists of lists
`numpy.zeros(dimensions)`	empty (zero) arrays
`numpy.ones(dimensions)`	one arrays
`numpy.arange(lower,upper,step)`	array of appropriately sized numpy objects - by default, `int`s - stepping up by `step` - default `1` - in the half closed interval $[\text{lower},\text{higher})$.

together with combining commands. For example,

import numpy
numpy.array([8,6,7,5,3,0,9],numpy.int32)
numpy.concatenate([numpy.zeros([5],numpy.int32),numpy.ones([2],numpy.int32)])

Today we will focus on one-dimensional arrays. (nearly) everything about them generalizes to higher dimensions - and we will get to that in time - but there is plenty to learn before we do that.

Arrays as Parallel Computations

In my opinion, the best way to think of NumPy arrays is as a kind of parallelization of your computations.

For example, say you had a list of integers. You have to cube them - add 1 - multiply by 2 - and take modulus 15.

In traditional python, you could do all that in a loop or list comprehension:

[(i**3+1)*2%5 for i in range(100)]

But the secret - that theoretical computer science won't tell you - is that doing many simple computations to one number is significantly slower than doing a single simple computation to many numbers. As a result, the following type of computation can be significantly faster:

import numpy
(numpy.arange(100)**3 +1)*2 %5

In A test run, the pure python took 7 times as long - over 35 microseconds, compared to less than 5 from NumPy. And that is on a relatively small list.

Note that the syntax for operators in NumPy - as regards operations with scalars - is the same as would operate on their individual values.

This applies to more than just arithmetic - for example,

numpy.arange(20)%5 == numpy.arange(20)%3

will give you an array of truth values. This is somewhat surprising - we are used to those types always returning a single bool - but remember that numpy is inherently parallel.

This array of truth values can be used for a variety of purposes - for example, to filter the same array, or other arrays, using "slice notation":

a = numpy.arange(20)
a[a%5 == a%3]

To check for true equality - in every instance - you can use the python builtin all, or its more ndarray friendly cousin, numpy.all:

numpy.all(numpy.arange(20)%5 == numpy.arange(20)%3)

Note that most functions on arrays in NumPy are defined twice (a slight violation of the zen of python) - once as an operator in numpy, and once as a method. For example,

(numpy.arange(20)%5 == numpy.arange(20)%3).all()

Is the same.

all has a cousin, any - equivalent to not(all(not(...))), for you logicians.

As a note, depending on your specific use case, the parallel version of any and all can be slower in some cases than the sequential version, because the sequential version can terminate early if it finds a True or False value respectively.

With Array Operators

We have seen, up to now, that in python manipulating arrays - especially multidimensional - is a headache.

Worry no more, because numpy is here to save the day. Arrays of the same dimension can be added or subtracted with ease:

numpy.array([1,2,3])+numpy.array([4,5,6])

What gets more surprising - to mathematicians, at least - is that they support the other operators as well:

numpy.array([1,2,3])*numpy.array([4,5,6])

In fact, the following statements are equivalent for operators:

base_array = numpy.arange(5)
assert (5*base_array == (5*numpy.ones(base_array.shape))*base_array

This versatile syntax allows you to construct very complicated conditional arithmetic as a series of simple, relatively fast array operations - combining the values only when you need to.

Type Conversion

If you want to convert an array from one dtype to another, you can use its .astype(type) method. However, this uses explicit typecasting - for example, converting from floating point numbers to ints takes the floor of the number.

numpy.arange(0,1,.1).astype(numpy.int32)

This biases statistical functions downward fairly heavily - so it should be avoided.

Rounding

You can often get better results with an intermediate call to .round(). This rounds numbers to a specific number of decimal places - by default, 0 - without changing type:

numpy.arange(0,1,.1).round()
numpy.arange(0,1,.1).round().astype(numpy.int32)

Note that it rounds 0.5 down. This is not an error, but a deliberate statistical choice - given large quantities of random decimal numbers, such as financial data, rounding half-values consistently up or down biases things like averages.

If we simulate the traditional rounding by taking the floor of n+1/2, then averages can come out as:

numpy.arange(.5,10,.5).mean() # 5.0
numpy.arange(.5,10,.5).round().mean() # 5.0
numpy.floor(numpy.arange(.5,10,1)+.5).mean() # 5.5

The mathematical rounding function biases the rounded numbers slightly upwards.

As a result, both numpy and Python round half-values to the nearest even integer -

numpy.arange(.5,5,1).round() # array([0.,2.,2.,4.,4.])

(some languages choose to solve this by rounding randomly. Perl can die for all I care - and fortunately, is dying.)

Randomness

With traditional python, we could easily get a list of n random integers with a quick list comprehension:

from random import randint
hws = [randint(0,20) for i in range(10)]

But NumPy is all about arrays, so we can do the same - faster - with:

randints = numpy.random.default_rng().integers
randints(0,20+1,size=10)

(default_rng is a recent syntax - introduced in 1.17, the previous vesion as of this writing. You may see imports such as numpy.random.randint; those are legacy functions, which are no longer being updated.)

NumPy's random functions - like a lot of NumPy - can take arrays as well as scalars (even as a mix):

randints(numpy.arange(2,5),numpy.arange(4,10,2))
randints(numpy.arange(1,10),20)

So far we have chosen evenly from integers in a range. However, numpy generators contain pseudorandom functions for a wide variety of distributions.

A few of note:

Generator	default type	distribution
`randint`	`numpy.int`	uniform
`normal`	`float`	uniform
`multinomial`	array of `numpy.int` outcomes	multinomial ($n$ weighted choices)

Slicing and Views

NumPy supports a versatile syntax for slicing - briefly mentioned earlier. Unlike in Python, these don't always create new arrays/lists - but sometimes views, or what is known to linear algebraists as submatrices.

This is significantly faster, and allows for a variety of useful tricks, but can surprise you if you are not ready for it.

For example:

a=numpy.arange(10)
b = a[5:]
print(b) # [5,6,7,8,9]
b *= 2
print(a) # [1,2,3,4,10,12,14,16,18]
a //= 2
print(a) # [0,0,1,1,5,6,7,8,9]
print(b) # [5,6,7,8,9]

You can slice on conditions matrices as well, as shown earlier, but this does create a copy:

a = numpy.arange(10)
b = a[a%3==1]
print(b)
b *= 2
print(a)

So to apply operations to arrays conditionally, you can use numpy.where:

a = numpy.arange(10)
a
a = numpy.where(a%3 == 1,a+1,a)
print(a) # [0,2,2,3,5,5,6,8,8,9]

Some conditional are so common that they have their own methods. For example, clipping the values at a minimum and/or maximum value can be as easy as:

print(numpy.arange(-4,5).clip(-2,3)) # [-2,-2,-2,-1,0,1,2,3,3]

worksheet

Today's worksheet takes you through one of my recent NumPy projects that came up in my teaching - and the proccess of going from Pythonic to NumPythonic.