This notebook can be found on my github: https://github.com/tonygallen/JPUG

Machine Learning with Julia

(a tutorial by someone who knows nothing about Machine Learning)

Options?

TensorFlow.jl

Cons:

  • Does not support Windows :(

Pros:

  • Doesn't bother supporting Windows :)
Flux.jl

Pros:

  • Julia to its core (100% julia stack)
  • Very easy to read ("If Python is executable pseudocode, Julia is executable math")
  • Lightweight, hackable
  • Creator is funny

Cons:

  • ?

I should mention Knet.jl seems like a good option as well.

Note: If you want GPU support, check out CuArrays.jl (https://fluxml.ai/Flux.jl/stable/gpu/) if you have an appropriate NVIDIA GPU. I do not use it in this tutorial.

Getting Started

In [1]:
#using Pkg
#Pkg.add("Flux")
using Flux

Gradients (Automatic Differentiation)

The soul of Machine Learning

In [2]:
using Flux.Tracker

# executable math
f(x) = x^2+1

# f'(x) = 2x
df(x) = gradient(f,x,nest=true)[1] # df is a tuple, [1] gets the first coordinate
Out[2]:
df (generic function with 1 method)
In [3]:
df(4)
Out[3]:
8.0 (tracked)
In [4]:
# f''(x) = 2
ddf(x) = gradient(df,x,nest=true)[1]
ddf(0)
Out[4]:
2.0 (tracked)
In [5]:
h(x) = -cos(x)^cos(x)

# h'(x) = tan(x)cos(x)^(cos(x)+1)(log(cos(x))+1) obviously
dh(x)=gradient(h,x)[1]
dh(pi/4)
Out[5]:
0.36161922410769803 (tracked)

But in ML, the functions are over something like $\mathbb{R}^{bajillion}$.

So for functions of multiple variables:

In [6]:
f(x,y,z) = x^2 + y^2 + z^2

#grad(f) = (2x,2y,2z)
gradient(f,1,2,3)
Out[6]:
(2.0 (tracked), 4.0 (tracked), 6.0 (tracked))

And if we have a bunch of different parameters:

In [7]:
# Quick Example to introduce Params(): Linear Regression

# random initial parameters
W = rand(5,10)
b = rand(5)

fhat(x) = W*x + b 

function loss(x,y)
    yhat = fhat(x) # our prediction for y
    return sum((y-yhat).^2)
end
Out[7]:
loss (generic function with 1 method)
In [8]:
x = rand(10)
y = rand(5)

loss(x,y) # big loss with random parameters
Out[8]:
27.542824876021225
In [9]:
# I have 50+ paramters, how do I pass them all at once?

W = param(W)
b = param(b)

grads = gradient(() -> loss(x, y), Params([W, b]))
Out[9]:
Grads(...)
In [20]:
# gradient descent

alpha = 0.01 # learning rate, step size, etc
gW = grads[W]
gb = grads[b]

Tracker.update!(W,-alpha*gW) # essentially W = W - alpha * gW. It does something else I don't understand
Tracker.update!(b,-alpha*gb);

loss(x,y)
Out[20]:
0.7906016300425174 (tracked)

Run this several times and watch the loss go down!

Enough about gradients, there are other packages if you're interested in doing more (ForwardDiff.jl for forward-mode, Calculus.jl for symbolic/finite differences, Zygote)

Building Basic Models

Naive Approach

Just repeat the above linear regression example and compose.

In [21]:
function sigmoid(x)
    return 1/(1+exp(-x))
end

W1 = param(rand(7, 10))
b1 = param(rand(7))
layer1(x) = W1 * x .+ b1

W2 = param(rand(5, 7))
b2 = param(rand(5))
layer2(x) = W2 * x .+ b2

model(x) = layer2(sigmoid.(layer1(x)))
Out[21]:
model (generic function with 1 method)
Using Flux's Dense() and Chain() !
In [22]:
layer1(x) = Dense(10,7,sigmoid)
layer2(x) = Dense(7,5)

model(x) = layer2(layer1(x))

# or equivalently
model2(x) = Chain(layer1,layer2)

# cool thing about Chain is that it supports indexing
model2(x)[1]
Out[22]:
layer1 (generic function with 1 method)

Training Networks

To train a model, we need 3 things:
  • Training Data
  • An objective function (e.g. loss function)
  • An optimizer (e.g. gradient descent)

Flux has the common objective functions and optimizers built in.

In [ ]:
# To train a model call somthing like


train!(objective, parameters, data, optimizer, cb = () -> println("still training..."))

# cb stands for callback. Its useful to updating you about training (e.g. what the loss is currently)
# By default, it is called after every batch. Use Flux.throttle() to change this

This only trains for 1 epoch. To train for more, use the @epoch macro. e.g.

In [ ]:
@epoch 5 train!(...)

Flux offers a lot more I don't feel qualified to talk about

  • Recurrent models
  • Normalization & Regularization
  • More on optimization
  • GPU Support ( CuArrays.jl )
  • Saving and Loading Models ( BSON.jl )

Example: MNIST (what else would I do?)

In [23]:
using Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle
using Base.Iterators: repeated
#using CuArrays if you want to use GPU
In [24]:
imgs = Flux.Data.MNIST.images()
labels = Flux.Data.MNIST.labels();

The Training data consists of 60,000 images of hand written digits like this:

In [25]:
imgs[27454] # pick a number 1-60000
Out[25]:

The goal is learn how to identify them:

In [26]:
labels[27454]
Out[26]:
7
In [27]:
## Boring Preprocessing
X = hcat(float.(reshape.(imgs, :))...) #stack all the images
Y = onehotbatch(labels, 0:9) # just a common way to encode categorical variables
Out[27]:
10×60000 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
 false   true  false  false  false  …  false  false  false  false  false
 false  false  false   true  false     false  false  false  false  false
 false  false  false  false  false     false  false  false  false  false
 false  false  false  false  false     false   true  false  false  false
 false  false   true  false  false     false  false  false  false  false
  true  false  false  false  false  …  false  false   true  false  false
 false  false  false  false  false     false  false  false   true  false
 false  false  false  false  false     false  false  false  false  false
 false  false  false  false  false      true  false  false  false   true
 false  false  false  false   true     false  false  false  false  false

Lets create our model!

In [28]:
# Our model, just like before, chaining dense layers
# Go from 28^2 dimensional space (images are 28x28) to 10 dimensional space (labels are 0-9)

m = Chain(
  Dense(28^2, 32, relu),
  Dense(32, 10),
  softmax)

# softmax just converts output to probability distribution
Out[28]:
Chain(Dense(784, 32, NNlib.relu), Dense(32, 10), NNlib.softmax)

Now, to choose our objective function (loss) and optimizer!

In [29]:
loss(x, y) = crossentropy(m(x), y) 
opt = ADAM(); # popular stochastic gradient descent variant

accuracy(x, y) = mean(onecold(m(x)) .== onecold(y)) # cute way to find average of correct guesses

dataset = repeated((X,Y),200) # repeat the data set 200 times, as opposed to @epochs 200 ...
evalcb = () -> @show(loss(X, Y)) # callback to show loss
Out[29]:
#5 (generic function with 1 method)

Time to train!

In [30]:
Flux.train!(loss, params(m), dataset, opt, cb = throttle(evalcb, 10)); #took me ~5 minutes to train on CPU
loss(X, Y) = 2.3259583f0 (tracked)
loss(X, Y) = 1.6830894f0 (tracked)
loss(X, Y) = 1.1227762f0 (tracked)
loss(X, Y) = 0.7927527f0 (tracked)
loss(X, Y) = 0.6152953f0 (tracked)
loss(X, Y) = 0.51356655f0 (tracked)
loss(X, Y) = 0.44959342f0 (tracked)
loss(X, Y) = 0.4059622f0 (tracked)
loss(X, Y) = 0.3741082f0 (tracked)
loss(X, Y) = 0.3512681f0 (tracked)
loss(X, Y) = 0.33128205f0 (tracked)
loss(X, Y) = 0.31474704f0 (tracked)
loss(X, Y) = 0.3016968f0 (tracked)
loss(X, Y) = 0.28936785f0 (tracked)
loss(X, Y) = 0.27849576f0 (tracked)
loss(X, Y) = 0.2688136f0 (tracked)

10,000 images were saved to test our model. Lets look at one of them.

In [31]:
Flux.Data.MNIST.images(:test)[5287] # give me a number 1-10000
Out[31]:
In [32]:
# Same preprocessing
test_X = hcat(float.(reshape.(Flux.Data.MNIST.images(:test), :))...)
test_Y = onehotbatch(Flux.Data.MNIST.labels(:test), 0:9);

m(test_X[:,5287]) # Note the 7th index ( corresponding to the digit 6 ) is nearly 1
Out[32]:
Tracked 10-element Array{Float32,1}:
 2.0658601f-6   
 7.2631983f-6   
 0.00012641726f0
 2.52413f-6     
 0.00013809346f0
 0.00012591082f0
 0.9995042f0    
 4.4408864f-8   
 8.1187456f-5   
 1.2359584f-5   
In [33]:
#decode
onecold(m(test_X[:,5287])) - 1 #minus 1 since we start from 0, but indexing in Julia starts at 1
Out[33]:
6

Overall, heres how our model does:

In [34]:
# Training set accuracy
accuracy(X, Y)
Out[34]:
0.92735
In [35]:
# Test set accuracy
accuracy(test_X, test_Y)
Out[35]:
0.924