Deep Learning from Scratch to GPU - 15 - Weight Decay

April 23, 2019

Please share this post in your communities. Without your help, it will stay burried under tons of corporate-pushed, AI and blog farm generated slop, and very few people will know that this exists.

These books fund my work! Please check them out.

In this article we explore a simple but useful technique for keeping weights from growing too big. Weight Decay is useful as a regularization technique that improves generalization, and can help with improving even the basic learning on the technical level.

If you haven't yet, read my introduction to this series in Deep Learning in Clojure from Scratch to GPU - Part 0 - Why Bother?.

The previous article, Part 14, is here: Learning a Regression.

To run the code, you need a Clojure project with Neanderthal () included as a dependency. If you're in a hurry, you can clone Neanderthal Hello World project.

Don't forget to read at least some introduction from Neural Networks and Deep Learning, start up the REPL from your favorite Clojure development environment, and let's continue with the tutorial.

First, the usual requires of the namespaces we're going to use.

(require '[uncomplicate.commons.core
           :refer [with-release let-release Releaseable release]]
         '[uncomplicate.clojurecuda.core :as cuda
           :refer [current-context default-stream synchronize!]]
         '[uncomplicate.clojurecl.core :as opencl
           :refer [*context* *command-queue* finish!]]
         '[uncomplicate.neanderthal
           [core :refer [mrows ncols dim raw view view-ge vctr copy row
                         entry! axpy! copy! scal! mv! transfer! transfer
                         mm! rk! view-ge vctr ge trans nrm2 zero axpby! amax]]
           [math :refer [sqr]]
           [vect-math :refer [tanh! linear-frac! sqr! mul! cosh! inv!]]
           [native :refer [fge native-float]]
           [cuda :refer [cuda-float]]
           [opencl :refer [opencl-float]]])
(import 'clojure.lang.IFn)

Runaway weights

When we discussed weight initialization, we saw that large weights can saturate the activation function. We solved that problem by initializing weights with an appropriate normal distribution, centered around zero, with a small standard deviation. Most weights start much smaller than 1, with an occasional larger value.

If we don't control the weight size during the learning process, one or several of them can become large. A large weight combined with a large learning rate might lead to even larger weights, and quickly enough, matrix multiplication of these large matrices can exhaust the capacity of floating point numbers. Our matrices become full of NaN. The network can't recover from that state.

Let's see this on the regression task we examined in the previous article.

First, we generate some training data.

(def x-train (fmap! rand-uniform (ge native-float 4 10000)))

x-train

#RealGEMatrix[float, mxn:4x10000, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       0.42    0.64    ⁙       0.96    0.88
    →       0.00    0.59    ⁙       0.29    0.71
    →       0.46    0.34    ⁙       0.50    0.49
    →       0.45    0.77    ⁙       0.32    0.74
    ┗                                               ┛

(def y-train (ge native-float 1 10000 (map my-fn (cols x-train))))

y-train

#RealGEMatrix[float, mxn:1x10000, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.04    2.34    ⁙       2.34    2.53
    ┗                                               ┛

Next, we create a neural network and a training engine.

(def inference (init! (inference-network
                       native-float 4
                       [(fully-connected 32 sigmoid)
                        (fully-connected 16 tanh)
                        (fully-connected 1 linear)])))

(def training (training-network inference x-train))

Here's what happens with a large learning rate, 0.5, even if I train the network for only 5 epochs.

(sgd training y-train quadratic-cost! 5 0.3)

7.1628803287245795

The cost is suspiciously large. Call it a couple times more if necessary.

(sgd training y-train quadratic-cost! 15 0.3)

1481239.1710476684

Note that the cost skyrocketed beyond one million!

Let's see the weight values.

(weights (first (layers inference)))

#RealGEMatrix[float, mxn:32x4, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ┓
    →      73.71   91.31   59.00   50.57
    →     -25.35  -26.44  -25.18  -21.80
    →       ⁙       ⁙       ⁙       ⁙
    →     720.67  742.89  744.95  814.44
    →     -89.35  -89.37  -89.73  -85.80
    ┗                                       ┛

These numbers are not that big, but this is only because we do not have space to display all entries of the weight matrix. We can use Neanderthal's amax function to find out the largest absolute values of weights at each layer.

(map amax (map weights (layers inference)))

(950.06134 753.5348 393.80255)

The largest weight for the first layer is 950! This is an indication that the learning rate is too big. The learning process is hopping from mountain slope to mountain slope, not being fine-grained enough to step into the valleys.

Try to run this for hundreds of epochs, and you'll see weight matrices full of NaN values.

Obviously, a smaller learning rate would help here, but in general, we do not know what is the appropriate learning rate. We would like to make the learning process more robust.

Weight decay (L2 Regularization)

Weight decay is a form of regularization. Regularization is one of the techniques for decreasing overfitting and improving generalization. This particular example can not demonstrate overfitting, since the function we're learning can give us lots of useful training data, so I will discuss overfitting when we build up this infrastructure a bit more, and tackle tougher problems.

On the technical level, weight decay reduces overfitting by controlling the growth of weight values, which we need to make this example work at all.

As we discussed in an earlier article, Learning and Backpropagation, the learning algorithm we are using is centered around the quest to minimize the cost function. So far, we have been using the quadratic cost function, so I'll base the discussion on the same formula. The base cost function and the weight regularization addition in general can be based on some other formulas, too.

Here's the existing cost function that we are minimizing.

\(C(w,b) \equiv \frac{1}{2n} \sum\limits_x \lVert \mathbf{y}_x - \mathbf{a}_x \lVert ^2\)

The idea of weight decay is that we add a component that depends on weights to this function.

For example, something like this:

\(C(w,b) \equiv \frac{1}{2n} \sum\limits_x \lVert \mathbf{y}_x - \mathbf{a}_x \lVert ^2 + \frac{\lambda}{2n}\sum\limits_w w^2\)

More generally, if we denote the vanilla cost function with \(C_0\), and the weight component as \(C_w\), the regularized cost function \(C\) is computed as \(C = C_0 + C_w\). When \(C_w = \frac{\lambda}{2n}\sum\limits_w w^2\) we are doing the L² Regularization.

The value of \(C\) is always larger than the value of \(C_0\), since we are assuming that there are at least some weights that are different from zero, and we are summing their squares. Intuitively, the \(C_w\) will be smaller when the absolute values of weights are smaller, so this component favors small weights, ideally zeros. On the other hand the \(C_0\) component favors weights that minimize the base cost, which are typically weights that are different than zero. The complete cost function balances these two tendencies, weighted by the coefficient \(\lambda\).

The question is now how this change in the cost function reflects on the backpropagation algorithm and the derivative. Fortunately, the derivative of a sum is the sum of the derivatives. This is wonderful news. Our backpropagation implementation can stay the same. The gradient with respect to weights is now:

\(\nabla{C} \equiv (\frac{\partial{C}}{\partial{w_1}},\frac{\partial{C}}{\partial{w_2}},\dots) = \nabla{C_0} + \frac{\lambda}{n} w\)

The gradient with respect to biases stays the same, since it does not depend on weights.

Implementing weight decay

Let's implement this change.

After \(\nabla{C_0}\) has been calculated, it is stored in the matrix v in the FullyConnectedTraining type.

After we do a few calculations with the old value of w, at the end of the backward method, we updated w by adding v to it.

(axpy! 1.0 v w)

The new update should be done with \(\nabla{C} = \nabla{C_0} + \frac{\lambda}{n} w\). Weights should be updated with \(\nabla{C_0}\) (v), as before, and with weights scaled by \(\frac{\eta\lambda}{n}\).

\(w \rightarrow w - \eta \nabla{C_0} - \frac{\eta\lambda}{n} w = (1 - \frac{\eta\lambda}{n}) w - \eta \nabla{C_0}\).

We can not implement this with axpy!, since it does not support coefficient scaling for the second matrix, only for the first. Fortunately, the function axpby! does support coefficient scaling for both matrix operands.

In terms of axpby!, the old implementation was:

(axpby! 1.0 v 1.0 w)

Instead of with 1.0, weight decay scales w with the coefficient \(1 - \frac{\eta\lambda}{n}\). We have already computed \(-\frac{\eta}{n}\) and stored it in the eta-avg. The updated implementation just adds lambda into the picture.

(axpby! 1.0 v (inc (* eta-avg (double lambda))) w)

Modifying one line of code implemented weight decay for us? We didn't even have to introduce new memory objects, protocols, functions, calls? That's right, none of that! Isn't Clojure amazing?

Here's the improved implementation of FullyConnectedTraining.

(deftype FullyConnectedTraining [v w b a-1 z a ones activ-fn first?]
  Releaseable
  (release [_]
    (release v)
    (release w)
    (release b)
    (release a-1)
    (release z)
    (release a)
    (release ones)
    (release activ-fn))
  Parameters
  (weights [_] w)
  (bias [_] b)
  Transfer
  (input [_] a-1)
  (output [_] a)
  (ones [_] ones)
  Backprop
  (forward [_]
    (activ activ-fn (rk! -1.0 b ones (mm! 1.0 w a-1 0.0 z)) a))
  (backward [_ [eta lambda]] ;; We need lambda in addition to eta
    (let [eta-avg (- (/ (double eta) (dim ones)))]
      (mul! (prime activ-fn z) a)
      (mm! eta-avg z (trans a-1) 0.0 v)
      (when-not first? (mm! 1.0 (trans w) z 0.0 a-1))
      (mv! eta-avg z ones 1.0 b)
      (axpby! 1.0 v (inc (* eta-avg (double lambda))) w))))

We need to re-evaluate the constructor so that it picks up the new deftype. Otherwise, training-layer doesn't require any change. I just modified the construction of v to be zero instead of raw, although it doesn't have any effect on weight decay, since it will be important in the next article.

(defn training-layer
  ([inference-layer input ones-vctr first?]
   (let-release [w (view (weights inference-layer))
                 v (zero w)
                 b (view (bias inference-layer))
                 a-1 (view input)
                 z (ge w (mrows w) (dim ones-vctr))
                 a (raw z)
                 o (view ones-vctr)]
     (->FullyConnectedTraining v w b a-1 z a o ((activation-fn inference-layer) z) first?)))
  ([inference-layer input ones-vctr]
   (training-layer inference-layer input ones-vctr true))
  ([inference-layer previous-backprop]
   (training-layer inference-layer
                   (output previous-backprop)
                   (ones previous-backprop)
                   false)))

Weight decay keeps the weights in check

The network is constructed in the same way as before, as the few changes that we have made were internal implementation details.

(def inference-l2 (init! (inference-network
                           native-float 4
                           [(fully-connected 32 sigmoid)
                            (fully-connected 16 tanh)
                            (fully-connected 1 linear)])))
(def training-l2 (training-network inference-l2 x-train))

The sgd function signature stayed unchanged, but the semantics is a bit different. Instead of only the learning rate eta, it accepts a Clojure vector that contains at least eta and lambda.

I'll test it with the same small number of epochs, 5, and large learning rate 0.3. I chose a large-ish lambda of 0.9 by considering the formula \(1 - \frac{\eta\lambda}{n}\) that takes in the consideration the batch size.

(sgd training-l2 y-train quadratic-cost! 5 [0.3 0.9])

1.7236447462826152

Following the previous example, I'll run it further.

(sgd training-l2 y-train quadratic-cost! 15 [0.3 0.9])

0.1106800062466209

It works. The cost function stayed small. Let's see how the weights are doing.

(map amax (map weights (layers inference-l2)))

(0.6338241 0.19060016 0.60441613)

The weights are quite small, in a very safe zone between 0 and 1.

This contained the weight growth. However, when you try this, it may prove that the weights were randomly initialized in a way that the they still grow. For the sake of this exercise, re-create the network and try again. This illustrates an important point with neural networks and machine learning in general: the hyper-parameters are brittle, and the algorithm depends on how well they were chosen. The problem: there is no guaranteed way to choose them other than experience, luck, and a bit of trial and error.

(sgd training-l2 y-train quadratic-cost! 150 [0.3 0.9])

0.07075539496243

Another 150 epochs later, the cost has shrunk.

(map amax (map weights (layers inference-l2)))

(0.67901236 0.47297716 0.7296158)

The weights are still small.

Let's try an additional thousand.

(sgd training-l2 y-train quadratic-cost! 1000 [0.3 0.9])

0.004285580638051033

The cost improved a lot. But it didn't have to. When you try this, the network might have found a cosy place, and all other nice states that it finds are not much better.

(map amax (map weights (layers inference-l2)))

(1.0893276 1.3150218 1.2590612)

The weights are somewhat bigger, but still small. As expected, training the network for a longer time discovered that some connections have more influence. However, consider that these are the largest weights in the whole network, and this small increase has been compensated by a decrease in other weights. On balance, the whole matrix stays in a safe zone where matrix multiplication does not lead to runaway weight increase.

Just to be sure, I'll run this for additional 1000 epochs.

(sgd training-l2 y-train quadratic-cost! 1000 [0.3 0.9])

0.003962985249223311

Cost decreased further.

(map amax (map weights (layers inference-l2)))

(1.0714844 1.3105394 1.2935557)

But the weights remained small. This stuff works!

Overfitting and Generalization

We have trained the network, and we know the cost when doing the inference on the training data. This is a good indicator, but the real measurement of how the network works is the cost on the test data that network has never seen during the training.

What typically happens is that a network does much better with training data than with the test data. If the network works well with the test data, and other data that is has never seen before, we say that it generalizes well. One common problem arises when we train the network too much, so that it does exceptionally well on the training data, but poorly on the test data. That would be the case of overfitting.

This time, I'll use the same number of data points in the testing data set, as I had used for training.

(def x-test (fmap! rand-uniform (ge native-float 4 10000)))

x-test

: nil#RealGEMatrix[float, mxn:4x10000, layout:column, offset:0]
:    ▥       ↓       ↓       ↓       ↓       ↓       ┓
:    →       0.08    0.08    ⁙       0.18    0.75
:    →       0.04    0.13    ⁙       0.39    0.12
:    →       0.18    0.24    ⁙       0.71    0.09
:    →       0.79    0.40    ⁙       0.69    0.26
:    ┗                                               ┛

(def y-test (ge native-float 1 10000 (map my-fn (cols x-test))))

y-test

: nil#RealGEMatrix[float, mxn:1x10000, layout:column, offset:0]
:    ▥       ↓       ↓       ↓       ↓       ↓       ┓
:    →       1.89    1.47    ⁙       2.20    1.83
:    ┗                                               ┛                                             ┛

Let's check how the infered data looks like.

(inference-l2 x-test)

: nil#RealGEMatrix[float, mxn:1x10000, layout:column, offset:0]
:    ▥       ↓       ↓       ↓       ↓       ↓       ┓
:    →       1.91    1.55    ⁙       2.16    1.85
:    ┗                                               ┛

Compare that to the actual values of the function that generated the data.

y-test

: nil#RealGEMatrix[float, mxn:1x10000, layout:column, offset:0]
:    ▥       ↓       ↓       ↓       ↓       ↓       ┓
:    →       1.89    1.47    ⁙       2.20    1.83
:    ┗                                               ┛

Finally, let's formally compare the network's answers with the known outputs of the known function it approximates, by calculating the cost that we used for training. Note that the network was trained on other data points and it has never seen these particular data points from the test set.

(quadratic-cost! (axpy! -1 y-test (inference-l2 x-test)))

0.004082752853777902

In this particular example, the train and test data is generated from the same simulation process, so we were not able to demonstrate overfitting. You might decide that the network works well or poorly for your needs, but it generalizes well. The performance was the same on the test and train data sets.

This particular network generalizes well with or without regularization. However, although we did not need weight decay to improve generalization, it helped us as a technique for controlling weight values, which proved very useful!

Microbenchmarks

As usual, we will test how well our implementation works in terms of performance. This is something that I'd recommend doing regularly while developing high performance software. It can help you spot performance regressions and learn new stuff before it becomes too complex to track down.

Intel i7-4790k CPU (2013)

(time (sgd training-l2 y-train quadratic-cost! 1000 [0.1 0.01]))

"Elapsed time: 5940.058072 msecs"
0.0045072531225305735

Not so bad for an old CPU.

Nvidia GTX 1080Ti (2017)

(cuda/with-default
  (with-release [factory (cuda-float (current-context) default-stream)]
    (with-release [cu-x-train (ge factory 4 10000)
                   cu-y-train (ge factory 1 10000)
                   inference (init! (inference-network
                                     factory 4
                                     [(fully-connected 32 sigmoid)
                                      (fully-connected 16 tanh)
                                      (fully-connected 1 linear)]))
                   training (training-network inference cu-x-train)]
      (transfer! x-train cu-x-train)
      (transfer! y-train cu-y-train)
      (time
         (sgd training cu-y-train quadratic-cost! 4000 [0.1 0.01])))))

"Elapsed time: 1495.782635 msecs"
0.0038985687214939846

As expected after previous measurements, the engine backed by cuBLAS is faster than the CPU engine even for these small layers.

AMD Radeon R9 290X (2013)

(opencl/with-default
  (with-release [factory (opencl-float *context* *command-queue*)]
    (with-release [cl-x-train (ge factory 4 10000)
                   cl-y-train (ge factory 1 10000)
                   inference (init! (inference-network
                                     factory 4
                                     [(fully-connected 32 sigmoid)
                                      (fully-connected 16 tanh)
                                      (fully-connected 1 linear)]))
                   training (training-network inference cl-x-train)]
      (transfer! x-train cl-x-train)
      (transfer! y-train cl-y-train)
      (finish!)
      (time
         (last (sgd training cl-y-train quadratic-cost! (repeat 4000 [1 [0.1 0.01]])))))))

"Elapsed time: 27597.814035 msecs"
0.003629327010316956

Unfortunately, the current problems in the OpenCL-based engine backend with varying sizes during matrix multiplications, rules out OpenCL for smaller matrices. Since that backend is open-source, I expect this problem to be solved reasonably soon.

Donations

If you feel that you can afford to help, and wish to donate, I even created a special Starbucks for two tier at patreon.com/draganrocks. You can do something even cooler: adopt a pet function.

The next article

I pretty much like how we have demonstrated a quite important regularization technique in this article. Although we haven't applied it for what it's mainly used - improving generalization - we have applied it to something that is ever easier to see. We used weight decay to demonstrate how to control the size of the weights in a domain that ensures stable numeric computation.

The best part is that the implementation of this feature was trivial, thanks to Clojure's elegance and the spartan way of carefully crafting the implementation with optimal matrix operations guided by Neanderthal.

In the next article, we will see a technique that helps the network to learn faster, measured in epochs: momentum. Will we be able to achieve that with another simple change? We'll see. Stay tuned…

Thank you

Clojurists Together financially supported writing this series. Big thanks to all Clojurians who contribute, and thank you for reading and discussing this series.