Deep Learning from Scratch to GPU - 16 - Momentum

April 29, 2019

Please share this post in your communities. Without your help, it will stay burried under tons of corporate-pushed, AI and blog farm generated slop, and very few people will know that this exists.

These books fund my work! Please check them out.

Today we are going to implement momentum, a ubiquitous learning optimization technique. What's more, we'll do it without any performance penalty.

If you haven't yet, read my introduction to this series in Deep Learning in Clojure from Scratch to GPU - Part 0 - Why Bother?.

The previous article, Part 15, is here: Weight Decay.

To run the code, you need a Clojure project with Neanderthal () included as a dependency. If you're in a hurry, you can clone Neanderthal Hello World project.

Don't forget to read at least some introduction from Neural Networks and Deep Learning, start up the REPL from your favorite Clojure development environment, and let's continue with the tutorial.

First, the usual requires of the namespaces we're going to use. I'm basing this post on the code we developed so far, which is not repeated here.

(require '[uncomplicate.commons.core
           :refer [with-release let-release Releaseable release]]
         '[uncomplicate.clojurecuda.core :as cuda
           :refer [current-context default-stream synchronize!]]
         '[uncomplicate.clojurecl.core :as opencl
           :refer [*context* *command-queue* finish!]]
         '[uncomplicate.neanderthal
           [core :refer [mrows ncols dim raw view view-ge vctr copy row
                         entry! axpy! copy! scal! mv! transfer! transfer
                         mm! rk! view-ge vctr ge trans nrm2 zero axpby! amax]]
           [math :refer [sqr]]
           [vect-math :refer [tanh! linear-frac! sqr! mul! cosh! inv!]]
           [native :refer [fge native-float]]
           [cuda :refer [cuda-float]]
           [opencl :refer [opencl-float]]])
(import 'clojure.lang.IFn)

The challenge

You may remember this figure from the time when we discussed the basics of gradient descent.

The algorithm is "walking" down the slope with the step size controlled by the learning rate, and once it steps into a local minimum, there's nowhere to go. It misses deeper valleys behind the tiniest boulder. With larger steps, it can jump over the boulders and hills. It can jump over valleys too, however, and hop from hill to hill. If we could only find a way to keep the steps as fine grained as necessary, but jump over these boulders…

There is a way, an intuitive one. Keeping up with the analogy of a snow ball rolling down the mountain, does the ball stop in front of every rock? It depends. If the rock is small enough, or the ball has picked up speed, it can even make a record-winning ski jump!

Momentum

The solution, thus, is to track velocity of this virtual "ball" we are implementing, and take it into account when we update the weights with the gradient. If the gradient has been pointing at the same direction for a while, we won't let an occasional single direction switch to revert the general movement, but just to slow it down. Eventually, this slowdown can add up, and the direction of the update can change, or the ball might have rolled over to the other side of the hill and continue toward deeper valleys.

This optimization technique is called momentum.

The following figure shows how momentum keeps the ball rolling upwards for some time (but, in this particular case, not enough to climb the hill and cross over).

In practice, momentum is one of the tried and true learning optimization techniques that usually helps in speeding up the convergence towards good solutions.

Implementing momentum

Similarly to how we implemented weight decay, we will look at basic equations of backpropagation, compare them to how velocity is tracked, and see if we can fit it into the existing implementation. If we can do that, we can implement momentum at no additional computational cost!

The \(\nabla{C_0}\) has been calculated as usual, and stored in the matrix v in the FullyConnectedTraining type.

To implement momentum, we have to do two things. First we update velocity with the gradient, and then we add the updated velocity to weights. The updated velocity then sits in place waiting for the next cycle. When updating velocity, we assume that the old values should be taken into account less and less. Therefore, the old velocity is multiplied by mu, which is expected to be a number between 0 and 1.

\(v \rightarrow v' = \mu v - \eta \nabla{C}\) (1)
\(w \rightarrow w' = w + v'\) (2)

You'll notice that our existing implementation already covers the second part. The weight update already adds v to w (including the weight decay from the previous article).

(axpby! 1.0 v (inc (* eta-avg (double lambda))) w)

The current implementation just erases the old values of v when updating gradients, in effect calculating \(v \rightarrow v' = 0 v - \eta \nabla{C}\). The only change we need to make is to multiply v by mu instead of with zero!

We change this expression:

(mm! eta-avg z (trans a-1) 0.0 v)

Into this:

(mm! eta-avg z (trans a-1) mu v)

We have already had the implementation of momentum. We just need to switch it on! Note that this implementation does not clash with the existing implementation of weight decay. Way to go, Clojure :)

The update to FullyConnectedtraining contains the addition of mu to the second argument of the backward method, and changing 0.0 to mu in the appropriate mm! call.

(deftype FullyConnectedTraining [v w b a-1 z a ones activ-fn first?]
  Releaseable
  (release [_]
    (release v)
    (release w)
    (release b)
    (release a-1)
    (release z)
    (release a)
    (release ones)
    (release activ-fn))
  Parameters
  (weights [_] w)
  (bias [_] b)
  Transfer
  (input [_] a-1)
  (output [_] a)
  (ones [_] ones)
  Backprop
  (forward [_]
    (activ activ-fn (rk! -1.0 b ones (mm! 1.0 w a-1 0.0 z)) a))
  (backward [_ [eta lambda mu]] ;; We need mu in addition to eta and lambda
    (let [eta-avg (- (/ (double eta) (dim ones)))]
      (mul! (prime activ-fn z) a)
      (mm! eta-avg z (trans a-1) mu v) ;; mu instead of 0.0
      (when-not first? (mm! 1.0 (trans w) z 0.0 a-1))
      (mv! eta-avg z ones 1.0 b)
      (axpby! 1.0 v (inc (* eta-avg (double lambda))) w))))

(defn training-layer
  ([inference-layer input ones-vctr first?]
   (let-release [w (view (weights inference-layer))
                 v (zero w) ;; Has to be initialized to zero
                 b (view (bias inference-layer))
                 a-1 (view input)
                 z (ge w (mrows w) (dim ones-vctr))
                 a (raw z)
                 o (view ones-vctr)]
     (->FullyConnectedTraining v w b a-1 z a o ((activation-fn inference-layer) z) first?)))
  ([inference-layer input ones-vctr]
   (training-layer inference-layer input ones-vctr true))
  ([inference-layer previous-backprop]
   (training-layer inference-layer
                   (output previous-backprop)
                   (ones previous-backprop)
                   false)))

Gradient descent with momentum

Cool. Let's see whether momentum helps in the simple artificial example we used in Learning a Regression.

First, we will generate simulated training data.

(def x-train (fmap! rand-uniform (ge native-float 4 10000)))

x-train

#RealGEMatrix[float, mxn:4x10000, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       0.06    0.94    ⁙       0.19    0.96
    →       0.51    0.21    ⁙       0.22    0.46
    →       0.91    0.77    ⁙       0.05    0.99
    →       0.41    0.96    ⁙       0.43    0.63
    ┗                                               ┛

(def y-train (ge native-float 1 10000 (map my-fn (cols x-train))))

y-train

#RealGEMatrix[float, mxn:1x10000, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       1.82    3.36    ⁙       1.41    2.88
    ┗                                               ┛

We create the network with the same old API we've used before.

(def inference (init! (inference-network
                       native-float 4
                       [(fully-connected 32 sigmoid)
                        (fully-connected 16 tanh)
                        (fully-connected 1 linear)])))

(def training (training-network inference x-train))

We'll also need additional data to test the inference.

(def x-test (fmap! rand-uniform (ge native-float 4 10000)))

x-test

#RealGEMatrix[float, mxn:4x10000, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       0.79    0.06    ⁙       0.72    0.33
    →       0.63    0.79    ⁙       0.87    0.96
    →       0.32    0.07    ⁙       0.16    0.06
    →       0.81    0.16    ⁙       0.53    0.55
    ┗                                               ┛

(def y-test (ge native-float 1 10000 (map my-fn (cols x-test))))

y-test

#RealGEMatrix[float, mxn:1x10000, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.48    0.86    ⁙       1.74    1.25
    ┗                                               ┛

We start with a small number of epochs.

(sgd training y-train quadratic-cost! 50 [0.05 0.1 0.1])

0.10884985870917153

The network learned to approximate the function, with the cost of 0.1. Let's see if the cost during inference is similar.

(quadratic-cost! (axpy! -1 y-test (inference x-test)))

0.10732509712443061

The cost is practically the same, which means that the network generalizes well. This is expected, since the data used in training and testing comes from the same simulated process, and the data points are uniformly distributed on the domain.

Now the question is whether momentum can help in fitting this function faster with fewer epochs. Let's see how this works with 1000 epochs and \(\mu = 0.1\)

(sgd training y-train quadratic-cost! 1000 [0.05 0.1 0.1])

0.009486114623502363

This seems effective., You can run this same example with \(\mu = 0.0\) for comparison.

Let's test how it does the inference.

(quadratic-cost! (axpy! -1 y-test (inference x-test)))

0.009206578117888102

It still generalizes well in this example, as expected.

Can a larger number of epochs find a better solution?

(sgd training y-train quadratic-cost! 4000 [0.01 0.01 0.1])

0.004036112503858931

It seems that momentum helps even with this simple case, in which the network can learn well even without any optimization techniques.

Comparison with vanilla gradient descent

I'll create yet another small network and directly compare vanilla gradient descent, gradient descent with weight decay, and gradient descent with weight decay and momentum, in regard to the learning speed.

The network I'm using has two hidden layers with 8 neurons each. I'll measure the cost after every 50 epochs. When you run this code at home, don't forget to re-create the network after each sgd run. Otherwise, the subsequent runs will pick up from the weights that are already well learned.

(def inference-88 (init! (inference-network
                          native-float 4
                          [(fully-connected 8 sigmoid)
                           (fully-connected 8 tanh)
                           (fully-connected 1 linear)])))

(def training-88 (training-network inference-88 x-train))

(sgd training-88 y-train quadratic-cost!
     (repeat 8 [50 [0.2 0 0]]))

(sgd training-88 y-train quadratic-cost!
     (repeat 8 [50 [0.2 0.02 0]]))

(sgd training-88 y-train quadratic-cost!
     (repeat 8 [50 [0.2 0.02 0.01]]))

I summarized the results I've got on my machine in the following table.

epoch	[0.2 0 0]	[0.2 0.02 0]	[0.2 0.02 0.01]
50	0.112	0.111	0.106
100	0.107	0.111	0.101
150	0.101	0.109	0.094
200	0.097	0.107	0.082
250	0.091	0.105	0.059
300	0.084	0.103	0.027
350	0.075	0.100	0.008
400	0.062	0.094	0.005

We can see that the network slowly progressed when vanilla gradient descent has been used. Regularization with weight decay caused a slight slowdown of the process. This might be just a coincidence and the effect of this specific coefficient of \(\lambda = 0.02\) but this is expected since weight decay usually helps with generalization versus overfitting.

The interesting results are in the third column. Despite being "penalized" by weight decay, the network still learned much faster when momentum was turned on, and the final cost after 400 epochs was an order of magnitude lower. The network would need to be trained without momentum for much longer to reach that result.

Microbenchmarks

I'll compare the running time on the original 4-32-16-1 network.

CPU Intel i7-4790k (2013)

First, the CPU performance for the 4000 epochs run.

(time (sgd training y-train quadratic-cost! 4000 [0.2 0.1 0.4]))

"Elapsed time: 7598.796603 msecs"
0.0026728047445931337

Nvidia GTX 1080Ti (2017)

Next, a good Nvidia GPU.

(cuda/with-default
  (with-release [factory (cuda-float (current-context) default-stream)]
    (with-release [cu-x-train (ge factory 4 10000)
                   cu-y-train (ge factory 1 10000)
                   inference (init! (inference-network
                                     factory 4
                                     [(fully-connected 32 sigmoid)
                                      (fully-connected 16 tanh)
                                      (fully-connected 1 linear)]))
                   training (training-network inference cu-x-train)]
      (transfer! x-train cu-x-train)
      (transfer! y-train cu-y-train)
      (time
         (sgd training cu-y-train quadratic-cost! 4000 [0.2 0.1 0.4])))))

"Elapsed time: 1395.654362 msecs"
0.0029641628056095216

As expected, it was only 4 times faster, since the network size is not enough to challenge the GPU hardware.

Since 4 times is still something, I'll spare half a minute and run this network for 100,000 epochs to see whether 0.003 is really a limit for this network's leaning.

"Elapsed time: 33750.170093 msecs"
5.8388337286891813E-5

After 100,000 epochs, the network learned so well, that the cost is only 0.00006! I've tried to up it up to 1,000,000 epochs, but the cost didn't get any lower. Maybe a billion epoch could help, but that is something that you may try as homework.

AMD Radeon R9 290X (2013)

I'll drop the OpenCL code here without the report.

(opencl/with-default
  (with-release [factory (opencl-float *context* *command-queue*)]
    (with-release [cl-x-train (ge factory 4 10000)
                   cl-y-train (ge factory 1 10000)
                   inference (init! (inference-network
                                     factory 4
                                     [(fully-connected 32 sigmoid)
                                      (fully-connected 16 tanh)
                                      (fully-connected 1 linear)]))
                   training (training-network inference cl-x-train)]
      (transfer! x-train cl-x-train)
      (transfer! y-train cl-y-train)
      (finish!)
      (time
         (last (sgd training cl-y-train quadratic-cost! (repeat 4000 [1 [0.2 0.1 0.4]])))))))

Donations

If you feel that you can afford to help, and wish to donate, I even created a special Starbucks for two tier at patreon.com/draganrocks. You can do something even cooler: adopt a pet function.

The next article

In this article, we've seen another useful leaning optimization technique: moment. We got the implementation for free with the existing infrastructure. Since it does not incur any performance penalty, and it clearly helps, it is reasonable to turn it on by default unless you know better.

As the Clojurists Together funding period is nearing its end, I think it is a good idea to wrap up what we have got by now in the next article.

If you've followed this series and liked it, I have good news: this journey will continue in the form of a nicely typeset book! The drafts will be available soon, and I'll actively work on the book so expect frequent updates. Subscribe to the mailing list below to get (infrequent) notifications.

Thank you

Clojurists Together financially supported writing this series. Big thanks to all Clojurians who contribute, and thank you for reading and discussing this series.