Deep Learning from Scratch to GPU - 10 - The Backward Pass (CUDA, OpenCL, Nvidia, AMD, Intel)

You can adopt a pet function! Support my work on my Patreon page, and access my dedicated discussion server. Can't afford to donate? Ask for a free invite.

March 25, 2019

Please share: .

New books available for subscription.

We complete the basic implementation of the backward pass of backpropagation and gradient descent in this article.

If you haven't yet, read my introduction to this series in Deep Learning in Clojure from Scratch to GPU - Part 0 - Why Bother?.

The previous article, Part 9, is here: The Activation and its Derivative.

To run the code, you need a Clojure project with Neanderthal () included as a dependency. If you're in a hurry, you can clone Neanderthal Hello World project.

Don't forget to read at least some introduction from Neural Networks and Deep Learning, start up the REPL from your favorite Clojure development environment, and let's continue with the tutorial.

The relevant equations

Two equations from Part 7 (which introduced gradient descent and backpropagation) of the four which are relevant, do a similar thing.

\(\delta^L = \nabla_a C \odot \sigma ' (z^L)\). (1)

\(\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma ' (z^L)\). (2)

In both of these formulas, we compute the Haddamard product of the matrix that comes from the next layer (\(\nabla_a C\) and \((w^{l+1})^T \delta^{l+1}\)) and the matrix that is located in the layer at hand (\(\sigma ' (z^L)\)).

In the last article we handled the \(\sigma'\) part of the equation, and now we'll handle the part that comes from the next layer.

After that, we'll round the whole pass by dealing with the remaining two equations using the tricks we have developed for the first two.

\(\nabla_b C^l = \delta^l\) (3)

\(\nabla_w C^l = a^{l-1} \delta^l\) (4)

TL;DR

Translating these four equations to Clojure and Neanderthal is straightforward. The key issue that we will analyze in this article is how to do it with minimal use of memory.

This is what we would like to get.

...
(backward [_]
  (mul! (prime activ-fn z) a) ;; From the last article (1 and 2)
  (mm! 1.0 (trans w) z 0.0 a-1) ;; (2)
  (mv! (/ -1.0 (dim ones)) z ones 1.0 b) ;; (3)
  (mm! (/ -1.0 (dim ones)) a-1 z 1.0 w)) ;; (4)

Note that this is not possible without introducing extra memory. This is similar to the river crossing puzzle. In the code i've shown, there is a serious bug! The second line overwrites a-1 that we kept from the forward pass. However, we need that old value of a-1 in the fourth line.

The signal traveling backwards

In the previous article, we computed the Haddamard product of the gradient of C with respect to \(a\) and the activation derivative. To calculate that signal, we needed to access the next layer. That would put us in a position of accessing layers in both directions, which would require complex layer initialization. Fortunately, we remembered that the next layer already has a view to a (a-1 in the next layer).

The dimensions of \(((w^{l+1})^T \delta^{l+1})\) and a match, so why not just delegate to the next layer to compute that and just deliver the result through a-1? Now, while we receive this data from the next layer, in this layer, we should do that calculation and send the result to its predecessor via a-1.

...
(backward [_]
  (mul! (prime activ-fn z) a) ;; From the last article (1 and 2)
  (mm! 1.0 (trans w) z 0.0 a-1)) ;; (2)

In the first line, we compute \(\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma ' (z^L)\). In the second we calculate \(((w^{l})^T \delta^{l})\) and put it into the previous layer's a (this layer's a-1). Note that the first line puts \(\delta^l\) in z, since we do not need the old value of z after we have computed \(\sigma ' (z^L)\) in the previous line.

For the second line, we use the standard in-place matrix multiplication, and store the result in a-1, so the previous layer can access it in its backward pass.

The bias update

In a nice struck of luck, the rate of change of the cost in respect to bias has already been calculated, since it is equal to the error itself!

\(\nabla_b C^l = \delta^l\) (3)

What's left is to update the bias. Hey, but bias is a vector, while the error is a matrix… How do I subtract a matrix from a vector? Do I need some sort of broadcast? No!

The error is a matrix only because we are processing the whole batch of samples at once, instead of one by one. We want to shrink many vector updates into one. Broadcast would mean "expanding" the vector to fit a matrix. Here, we are going in the opposite direction. The way to do it here is to average the updates that each column carries.

The entries of the resulting vector should each be the average of its respective row.

One naive way to implement this is to write a loop (map, fmap, or a low level loop/recur) and call sum on each row. I guess that this idea popped up immediately in the minds of most readers.

There is, of course, an easier and faster way to do it with matrix operations. Recall what the matrix-vector product does. It multiplies a matrix with a vector and the result is a vector.

This "structure" matches the problem we are dealing with. But, how does that help with summing the numbers? Luckily, each entry in the resulting vector is a dot product of a matrix row and the other vector, \(e = \sum\limits_{j=n} r_j x_j\). Now, imagine that \(x_j\) is always one, and we have our sum: \(e = \sum\limits_{j=n} r_j\)

Cool, if we only had a vector of ones, that would be one call to mv!. Maybe you forgot, it was half a dozen articles ago, but we needed a vector of ones to implement broadcasting in the forward pass. That vector is still here! Problem solved; the third equation can be added to our implementation.

...
(backward [_]
  (mul! (prime activ-fn z) a) ;; From the last article (1 and 2)
  (mm! 1.0 (trans w) z 0.0 a-1) ;; (2)
  (mv! (/ -1.0 (dim ones)) z ones 1.0 b)) ;; (3)

Note that mv! does not only compute the matrix-vector product, but can add it to the resulting vector. Fine, we have just saved us some memory, since we can fuse the calculation of the change of bias, and its subtraction from the bias into one operation that does not need a place to store the intermediate result.

b has been updated :)

The (broken) weights update

Now, to the fourth equation. That should be easier than the bias update, since the dimensions of the matrices match. The formula says \(\nabla_w C^l = a^{l-1} \delta^l\). We have the reference to a-1 and \(\delta^l\) has been safely kept in z since the first line. We multiply these two, and subtract its average from w in one fused operation.

...
(backward [_]
  (mul! (prime activ-fn z) a) ;; From the last article (1 and 2)
  (mm! 1.0 (trans w) z 0.0 a-1) ;; (2)
  (mv! (/ -1.0 (dim ones)) z ones 1.0 b) ;; (3)
  (mm! (/ -1.0 (dim ones)) a-1 z 1.0 w)) ;; (4)

The whole backward pass in four lines of code; very nice! (in Borat's voice).

However, there is a huge bug in this implementation. We have destroyed the correct value of a-1 in the second line, so the fourth line computes a completely different formula, \(((w^{l})^T \delta^{l}) \delta^l\), which is useless for our purpose.

The correct weights update

First, I'd try to find a way to rearrange the equations so that the last value is computed before a-1 has been overwritten. Unfortunately, this is not possible, since the second line depends on the old value of w, which is being updated by the fourth line.

This brings us to the river crossing puzzle. We have a fox, goose, and a bag of beans, or wolf, goat, and cabbage, or missionaries and cannibals, and have to help them cross the river in a boat just too small to carry them all at once. While these puzzles have solutions, this one doesn't. I couldn't find a way to rearrange these elegant equations without introducing one more memory location to store one of the intermediate results.

Having decided that I need one, I have to choose whether I'll have one for storing \(\delta^l\), or one for storing \(\nabla_w C^l\).

One criterion would be to choose the smaller one. Since a-1 and w share one dimension, the number of neurons in the previous layer, the difference is in the batch size vs. this layer's number of neurons. The batch size might tend to be larger, but that doesn't have to be the case, so this criterion does not give a definitive answer.

I have a better criterion: introduce a new matrix instance that I might (re)use later. We do not have such need now, but I'll cheat, and predict that we will be able to (re)use an extra memory of the dimensions of w to implement momentum.

Knowing this, I'll introduce v to FullyConnectedTraining. The correct backward method implementation is now:

...
(backward [_]
  (mul! (prime activ-fn z) a) ;; From the last article (1 and 2)
  (mm! (/ -1.0 (dim ones)) a-1 z 0.0 v) ;; (4)
  (mm! 1.0 (trans w) z 0.0 a-1) ;; (2)
  (mv! (/ -1.0 (dim ones)) z ones 1.0 b) ;; (3)
  (axpy! 1.0 v w))

Note that we haven't implemented the learning rate \(\eta\) yet, but it's a simple addition of a scalar argument to the backward method, and replacing 1.0 with eta.

If you carefully examined this code, you'll notice that in the mm! in the second line the matrix dimensions of a-1 multiplied by z, do not match w. That's because in w we chose dimensions to be \(output \times input\), since it fits nicely in most formulas. For this reason, in this formula, we have to transpose the computation, following its properties \((AB)^T = B^T A^T\).

...
(backward [_]
  (mul! (prime activ-fn z) a) ;; From the last article (1 and 2)
  (mm! (/ -1.0 (dim ones)) z (trans a-1) 0.0 v) ;; (4)
  (mm! 1.0 (trans w) z 0.0 a-1) ;; (2)
  (mv! (/ -1.0 (dim ones)) z ones 1.0 b) ;; (3)
  (axpy! 1.0 v w))

The updated training deftype

We don't need any changes in the inference layer code, but it's nice to have the reference here, and we also need this code if we want this article to be self-sufficient.

(require '[uncomplicate.commons.core
           :refer [with-release let-release Releaseable release]]
         '[uncomplicate.clojurecuda.core :as cuda
           :refer [current-context default-stream synchronize!]]
         '[uncomplicate.clojurecl.core :as opencl
           :refer [*context* *command-queue* finish!]]
         '[uncomplicate.neanderthal
           [core :refer [mrows ncols dim raw view view-ge vctr copy row
                         entry! axpy! copy! scal! mv! transfer! transfer
                         mm! rk! view-ge vctr ge trans nrm2]]
           [math :refer [sqr]]
           [vect-math :refer [tanh! linear-frac! sqr! mul! cosh! inv!]]
           [native :refer [fge native-float]]
           [cuda :refer [cuda-float]]
           [opencl :refer [opencl-float]]])
(import 'clojure.lang.IFn)
(defprotocol Parameters
  (weights [this])
  (bias [this]))

(defprotocol ActivationProvider
  (activation-fn [this]))
(deftype FullyConnectedInference [w b activ-fn]
  Releaseable
  (release [_]
    (release w)
    (release b)
    (release activ-fn))
  Parameters
  (weights [_] w)
  (bias [_] b)
  ActivationProvider
  (activation-fn [_] (activ-fn))
  IFn
  (invoke [_ x ones a]
    (activ-fn (rk! -1.0 b ones (mm! 1.0 w x 0.0 a)))))
(defn fully-connected [factory activ-fn in-dim out-dim]
  (let-release [w (ge factory out-dim in-dim)
                bias (vctr factory out-dim)]
    (->FullyConnectedInference w bias activ-fn)))
(defprotocol Backprop
  (forward [this])
  (backward [this eta]))

(defprotocol Transfer
  (input [this])
  (output [this])
  (ones [this]))

(defprotocol Activation
  (activ [_ z a!])
  (prime [_ z!]))

Note the new field v and the completed backward method.

(deftype FullyConnectedTraining [v w b a-1 z a ones activ-fn]
  Releaseable
  (release [_]
    (release v)
    (release w)
    (release b)
    (release a-1)
    (release z)
    (release a)
    (release ones)
    (release activ-fn))
  Parameters
  (weights [_] w)
  (bias [_] b)
  Transfer
  (input [_] a-1)
  (output [_] a)
  (ones [_] ones)
  Backprop
  (forward [_]
    (activ activ-fn (rk! -1.0 b ones (mm! 1.0 w a-1 0.0 z)) a))
  (backward [_ eta]
    (let [eta-avg (- (/ (double eta) (dim ones)))]
      (mul! (prime activ-fn z) a) ;; From the last article (1 and 2)
      (mm! eta-avg z (trans a-1) 0.0 v) ;; (4)
      (mm! 1.0 (trans w) z 0.0 a-1) ;; (2)
      (mv! eta-avg z ones 1.0 b) ;; (3)
      (axpy! 1.0 v w))))

The constructor's signature is unchanged. It internally creates an extra reference that matches the dimensions of weights.

(defn training-layer
  ([inference-layer input ones-vctr]
   (let-release [w (view (weights inference-layer))
                 v (raw w)
                 b (view (bias inference-layer))
                 a-1 (view input)
                 z (ge w (mrows w) (dim ones-vctr))
                 a (raw z)
                 o (view ones-vctr)]
     (->FullyConnectedTraining v w b a-1 z a o ((activation-fn inference-layer) z))))
  ([inference-layer previous-backprop]
   (training-layer inference-layer
                   (output previous-backprop)
                   (ones previous-backprop))))

Let's check whether it works at all, without worrying whether the result is correct. We'll need to actually learn something concrete to be able to check that, and we'll do that soon, in a dedicated article. The only addition is the learning rate argument in the backward method.

But, first, we'd have to carry over the activation implementations from the last chapter.

(deftype SigmoidActivation [work]
  Releaseable
  (release [_]
    (release work))
  Activation
  (activ [_ z a!]
    (linear-frac! 0.5 (tanh! (scal! 0.5 (copy! z a!))) 0.5))
  (prime [this z!]
    (linear-frac! 0.5 (tanh! (scal! 0.5 z!)) 0.5)
    (mul! z! (linear-frac! -1.0 z! 1.0 work))))
(defn sigmoid
  ([]
   (fn [z]
     (let-release [work (raw z)]
       (->SigmoidActivation work))))
  ([z!]
   (linear-frac! 0.5 (tanh! (scal! 0.5 z!)) 0.5)))
(deftype TanhActivation []
  Activation
  (activ [_ z a!]
    (tanh! z a!))
  (prime [this z!]
    (sqr! (inv! (cosh! z!)))))
(defn tanh
  ([]
   (fn [_] (->TanhActivation)))
  ([z!]
   (tanh! z!)))

Checking whether the structure fits nicely:

(with-release [x (ge native-float 2 2 [0.3 0.9 0.3 0.9])
               ones (vctr native-float 1 1)
               layer-1 (fully-connected native-float tanh 2 4)
               layer-2 (fully-connected native-float sigmoid 4 1)
               training-layer-1 (training-layer layer-1 x ones)
               training-layer-2 (training-layer layer-2 training-layer-1)]
  (transfer! [0.3 0.1 0.9 0.0 0.6 2.0 3.7 1.0] (weights layer-1))
  (transfer! [0.7 0.2 1.1 2] (bias layer-1))
  (transfer! [0.75 0.15 0.22 0.33] (weights layer-2))
  (transfer! [0.3] (bias layer-2))
  (forward training-layer-1)
  (forward training-layer-2)
  (backward training-layer-2 0.05)
  (backward training-layer-1 0.05)
  (transfer (output training-layer-2)))
nil#RealGEMatrix[float, mxn:1x2, layout:column, offset:0]
   ▥       ↓       ↓       ┓
   →       0.44    0.44
   ┗                       ┛

The output is as expected. Nice job. I hope we continue making this work.

Micro benchmark

Let's check how much the newly added code affects performance. Since we had a few matrix multiplications and a few other operations, this should be noticeably slower than the forward pass, but not more than a few times slower.

Intel 4790k (2013)

(with-release [factory native-float]
  (with-release [x (ge factory 10000 10000)
                 ones (entry! (vctr factory 10000) 1)
                 layer-1 (fully-connected factory tanh 10000 5000)
                 layer-2 (fully-connected factory sigmoid 5000 1000)
                 layer-3 (fully-connected factory sigmoid 1000 10)
                 training-layer-1 (training-layer layer-1 x ones)
                 training-layer-2 (training-layer layer-2 training-layer-1)
                 training-layer-3 (training-layer layer-3 training-layer-2)]
    (time
     (do
       (forward training-layer-1)
       (forward training-layer-2)
       (forward training-layer-3)
       (backward training-layer-3 0.05)
       (backward training-layer-2 0.05)
       (backward training-layer-1 0.05)))
    true))
"Elapsed time: 7498.412034 msecs"

This is quite fast even on an old desktop CPU! Note that we used single floating point precision in CPU computations, which is twice as fast as the double precision engine that we used in the previous parts when we measured CPU speed.

Nvidia GTX 1080 Ti (2017)

(cuda/with-default
  (with-release [factory (cuda-float (current-context) default-stream)]
    (with-release [x (ge factory 10000 10000)
                   ones (entry! (vctr factory 10000) 1)
                   layer-1 (fully-connected factory tanh 10000 5000)
                   layer-2 (fully-connected factory sigmoid 5000 1000)
                   layer-3 (fully-connected factory sigmoid 1000 10)
                   training-layer-1 (training-layer layer-1 x ones)
                   training-layer-2 (training-layer layer-2 training-layer-1)
                   training-layer-3 (training-layer layer-3 training-layer-2)]
      (time
       (do
         (forward training-layer-1)
         (forward training-layer-2)
         (forward training-layer-3)
         (backward training-layer-3 0.05)
         (backward training-layer-2 0.05)
         (backward training-layer-1 0.05)
         (synchronize!))))))
"Elapsed time: 433.340259 msecs"

The result is as we expected. The forward pass does one matrix multiplication and takes 124 milliseconds, and the backward pass, which does two matrix multiplications, takes roughly twice as that at a bit less than 300 milliseconds.

Compared to CPU, this is around 20 times faster. What happened to the 50 times speedup we got earlier? The thing is that we compared the default on the GPU (single precision) with the default on the CPU (double precision). This is a kind of apples vs. oranges. Let's just say that GPU has an ace in the sleeve to make up for that; keep reading the following articles!

AMD R9 290X (2013)

(opencl/with-default
  (with-release [factory (opencl-float *context* *command-queue*)]
    (with-release [x (ge factory 10000 10000)
                   ones (entry! (vctr factory 10000) 1)
                   layer-1 (fully-connected factory tanh 10000 5000)
                   layer-2 (fully-connected factory sigmoid 5000 1000)
                   layer-3 (fully-connected factory sigmoid 1000 10)
                   training-layer-1 (training-layer layer-1 x ones)
                   training-layer-2 (training-layer layer-2 training-layer-1)
                   training-layer-3 (training-layer layer-3 training-layer-2)]
      (forward training-layer-1)
      (finish!)
      (time
       (do
         (forward training-layer-1)
         (forward training-layer-2)
         (forward training-layer-3)
         (backward training-layer-3 0.05)
         (backward training-layer-2 0.05)
         (backward training-layer-1 0.05)
         (finish!))))))
"Elapsed time: 18836.38278 msecs"

When I'm lucky, I get this timing, and when I'm not, I get an OpenCL error: CL_MEM_OBJECT_ALLOCATION_FAILURE. Something is suspicious here! This is not 3 times slower than the forward pass alone, but 50 times slower. We expected around 1 second, and got 18 seconds. Is our implementation bad? It might be, but it hasn't caused any issues with the CUDA engine, so I doubt.

Keep the network size in check!

I suspect that the issue is that we are overstretching the resources in the older AMD's GPU. R9 290X was a beast in its heyday, but that was 6 years ago. That in itself would not be a problem, but at that time 4GB of memory was a lot. It's not any more; GTX 1080Ti has 11GB.

We have created a network with (artificially) large layers. Let's count a bit.

  • The input (x) takes 10K × 10K elements. That's 100M.
  • Vector of ones: 10K.
  • Inference layer 1 (w and b): 10K × 5K + 5K = 50M + 5K.
  • Inference layer 2: 5K × 1K + 1K = 5M + 1K.
  • Inference layer 3: 1K × 10 + 10 = 10K + 10.
  • Training layer 1: (z, a, and v): 50M × 3 = 150M
  • Training layer 2: (z, a, v, and work): 5M × 4 = 20M
  • Training layer 3: 10K × 4 = 40K

If I haven't missed something that adds up to 325,057,010 entries. Since we used single precision floating point numbers, that would take 4 bytes each, so around 1.3GB. That alone should not be a problem for R9 270X's 4GB, of which more than 3GB is available for data.

If we think about it, of course it is not the problem. After all, the network did compute! It just took unexpectedly long time to do so. What I suspect is the case, is that a large structure strained on some other memory resource. My bet is that there's been some overcrowding that forced the registers to spill out to global GPU DRAM at some point.

I use this GPU as my major display, and I run a few dozen Chromium tabs with some YouTube video playing while I'm doing this, and this might eat up some resources. Ultimately, these things depend on the hardware. I expect that the new cards are more resilient to such corner cases.

I even found out where the problem was: in the first layer.

Consider making it just a bit smaller; instead of 10000, we'll set it up with 7200 neurons.

(opencl/with-default
  (with-release [factory (opencl-float *context* *command-queue*)]
    (with-release [x (ge factory 10000 7200)
                   ones (entry! (vctr factory 7200) 1)
                   layer-1 (fully-connected factory tanh 10000 5000)
                   layer-2 (fully-connected factory sigmoid 5000 1000)
                   layer-3 (fully-connected factory sigmoid 1000 10)
                   training-layer-1 (training-layer layer-1 x ones)
                   training-layer-2 (training-layer layer-2 training-layer-1)
                   training-layer-3 (training-layer layer-3 training-layer-2)]
      (forward training-layer-1)
      (finish!)
      (time
       (do
         (forward training-layer-1)
         (forward training-layer-2)
         (forward training-layer-3)
         (backward training-layer-3 0.05)
         (backward training-layer-2 0.05)
         (backward training-layer-1 0.05)
         (finish!))))))
"Elapsed time: 828.808597 msecs"

This is as we expected! No slowdown here.

The moral of the story is that you have to measure things for the particular use case and the hardware that you'll run it on. No magical framework can make a square peg fit into a smaller round hole. It may even be the opposite; they can make pegs much larger than necessary. When we know our tools, and invest some time in understanding the methods that we use, we'll at least be able to pick the right trade-offs.

Donations

If you feel that you can afford to help, and wish to donate, I even created a special Starbucks for two tier at patreon.com/draganrocks. You can do something even cooler: adopt a pet function.

The next article

It seems to me we have completed a basic implementation of the backpropagation based gradient descent algorithm. Let's think about how to wrap it in a user-friendly neural networks API. In the next article, we will take care of it, and then try to learn a simple function with a simple neural network, since I'm curious to find out whether this implementation is correct.

Have in mind that we have only tested the forward pass, and not the backward pass. Once we make sure that the basic implementation works with a simple example, we can try to learn something more interesting, and also to think about how to improve the quality of learning with a few optimizations to the algorithm itself.

Stay tuned for the next article.

Thank you

Clojurists Together financially supported writing this series. Big thanks to all Clojurians who contribute, and thank you for reading and discussing this series.

Deep Learning from Scratch to GPU - 10 - The Backward Pass (CUDA, OpenCL, Nvidia, AMD, Intel) - March 25, 2019 - Dragan Djuric