Deep Learning from Scratch to GPU - 9 - The Activation and its Derivative

Need help with your custom Clojure software? I'm open to (selected) contract work.

March 20, 2019

Please share: .

These books fund my work! Please check them out.

We implement the key part of the backward pass, the computation of the error of a layer. Along the way, we set up the infrastructure for the complete implementation of backpropagation.

If you haven't yet, read my introduction to this series in Deep Learning in Clojure from Scratch to GPU - Part 0 - Why Bother?.

The previous article, Part 8, is here: The Forward Pass.

To run the code, you need a Clojure project with Neanderthal () included as a dependency. If you're in a hurry, you can clone Neanderthal Hello World project.

Don't forget to read at least some introduction from Neural Networks and Deep Learning, start up the REPL from your favorite Clojure development environment, and let's continue with the tutorial.

The relevant equations

Two equations of the four from Part 6, which are relevant for this article, do a similar thing. In both of these formulas, we compute the Haddamard product of the matrix that comes from the next layer (\(\nabla_a C\) and \((w^{l+1})^T \delta^{l+1}\)) and the matrix that is located in the layer at hand (\(\sigma ' (z^L)\)).

\(\delta^L = \nabla_a C \odot \sigma ' (z^L)\).

\(\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma ' (z^L)\).

First, we will handle the \(\nabla_a C\) and \(((w^{l+1})^T \delta^{l+1})\).

TL;DR

This is the key line of code that we'll add to our implementation of the training layer. Both equations will be summed up in this tiny one-liner.

...
(backward [_]
  (mul! (prime activ-fn z) a))
...

In this article, we'll start developing the backward method. The end result will be just a few rather intense lines of Clojure code, in the Neanedrthal fashion that you've probably started getting used to. It could be dense just a bit too much if I had just thrown that at you all at once. I am going to explain the considerations during each step in detail.

The training layer

Again for reference, I repeat the inference and training layer code that we developed in the last article.

(require '[uncomplicate.commons.core
           :refer [with-release let-release Releaseable release]]
         '[uncomplicate.clojurecuda.core :as cuda
           :refer [current-context default-stream synchronize!]]
         '[uncomplicate.clojurecl.core :as opencl
           :refer [*context* *command-queue* finish!]]
         '[uncomplicate.neanderthal
           [core :refer [mrows dim raw view entry! axpy! copy! scal! transfer!
                         transfer mm! rk! view-ge vctr ge]]
           [vect-math :refer [tanh! linear-frac! sqr! mul! cosh! inv!]]
           [native :refer [fge native-float]]
           [cuda :refer [cuda-float]]
           [opencl :refer [opencl-float]]])
(import 'clojure.lang.IFn)

(defprotocol Parameters
  (weights [this])
  (bias [this]))

(defprotocol ActivationProvider
  (activation-fn [this]))

(deftype FullyConnectedInference [w b activ-fn]
  Releaseable
  (release [_]
    (release w)
    (release b))
  Parameters
  (weights [_] w)
  (bias [_] b)
  ActivationProvider
  (activation-fn [_] (activ-fn))
  IFn
  (invoke [_ x ones a]
    (activ-fn (rk! -1.0 b ones (mm! 1.0 w x 0.0 a)))))

(defn fully-connected [factory activ-fn in-dim out-dim]
  (let-release [w (ge factory out-dim in-dim)
                bias (vctr factory out-dim)]
    (->FullyConnectedInference w bias activ-fn)))

(defprotocol Backprop
  (forward [this])
  (backward [this]))

(defprotocol Transfer
  (input [this])
  (output [this])
  (ones [this]))

(deftype FullyConnectedTraining [w b a-1 z a ones-vctr activ-fn]
  Releaseable
  (release [_]
    (release w)
    (release b)
    (release a-1)
    (release z)
    (release a)
    (release ones))
  Parameters
  (weights [_] w)
  (bias [_] b)
  Transfer
  (input [_] a-1)
  (output [_] a)
  (ones [_] ones-vctr)
  Backprop
  (forward [_]
    (activ-fn (rk! -1.0 b ones (mm! 1.0 w a-1 0.0 z)) a))
  (backward [_]
    (throw (ex-info "TODO"))))

(defn training-layer
  ([inference-layer input ones-vctr]
   (let-release [w (view (weights inference-layer))
                 b (view (bias inference-layer))
                 a-1 (view input)
                 z (ge w (mrows w) (dim ones-vctr))
                 a (raw z)
                 o (view ones-vctr)]
     (->FullyConnectedTraining w b a-1 z a o (activation-fn inference-layer))))
  ([inference-layer previous-backprop]
   (training-layer inference-layer
                   (output previous-backprop)
                   (ones previous-backprop))))

The signal traveling backwards

From the definition of the Haddamard product, it is obvious that the dimensions of whatever is coming from the next layer (that was computed in the previous backwards step) is the same as the dimension of z.

Also note that the next layer has the view of a, it's the a-1 matrix in the next layer. The next layer uses a-1 data in the forward pass and it doesn't need it any more.

It is a rather nice coincidence that we can use this reference to receive whatever the next layer computes, \(\nabla_a C\) or \(((w^{l+1})^T \delta^{l+1})\), through this reference.

The first part is solved; we will let the next layer worry about how to compute \(\nabla_a C\) or \(((w^{l+1})^T \delta^{l+1})\) and receive it through a in the current layer when its turn comes during the backward pass.

Computing the derivative of the activation

At first look, this step seems trivial. We have the activation function \(\sigma\), we wisely saved the values of the linear outputs \(z^l\), and we just need to find out which function is the derivative \(\sigma '\) and apply it.

As you probably recall from the article Bias and Activation Function, although we could use any function that is polite enough to be differentiable in the whole domain that interests us, we typically choose one from a very narrow selection of typical ones: ReLU, sigmoid, hyperbolic tangent, and a few variants of these.

The key reason to use only a somehow standard selection is that then we can have their derivatives ready at hand. The reason to use these particular functions is that their derivatives are also easy-ish to implement.

So, given the activation function \(\sigma\), the first task is to be able to get a hold on the function that implements its derivative. We can do this in a few ways in Clojure:

  • (1) Offload it to the user: when the layer is constructed, in addition to activation, the user would have to provide activation-prime.
  • (2) Define a new type (using deftype) that can calculate a function, its prime, and, if necessary, pre-allocates the working memory
  • (3) Provide a zero-argument implementation in the activation function itself, which returns the prime function.

I don't like the first approach, since it puts an unnecessary burden on the user. Besides, they typically will not choose any silly function as activation, but one from the handful that are tried and tested. Therefore, I'd opt for one of the other two. Let me try to implement the last option.

I'll cover sigmoid and hyperbolic tangent; you should be able to work out the details of a few more as an exercise.

The function returns its own prime

In the previous article, we implemented sigmoid-prim!.

(defn sigmoid-prim!
  ([x!]
   (with-release [x-raw (raw x!)]
     (sigmoid-prim! x! x-raw)))
  ([x! prim!]
   (let [x (sigmoid! x!)]
     (mul! (linear-frac! -1.0 x 1.0 prim!) x))))

Now, we just improve sigmoid! a bit.

(defn sigmoid!
  ([] sigmoid-prim!) ;; Only this, and that's it? Yes!
  ([x]
   (linear-frac! 0.5 (tanh! (scal! 0.5 x)) 0.5))
  ([x y]
   (linear-frac! 0.5 (tanh! (scal! 0.5 (copy! x y))) 0.5)))
(def activation-prim (sigmoid!))
nil#'user/activation-prim

Yep, works.

During the backward pass, we will call this function with z, and it will return the derivative. Then, we will multiply that derivative with whatever comes from the next layer (that was computed previously when going backwards) and get some matrix that we'll use when updating weights.

But, wait a minute. The one-argument sigmoid-prim! creates a new instance of the whole resulting matrix. I'm afraid that will slow us down a bit (or a lot) and will take some memory toll. Luckily, we have the two-argument variant, so we should provide the memory that is ready to accept the results.

Now, the key thing is that, after this calculation, I won't need \(z^l\) data any more. What I would really like to do is to just overwrite it with the value of the derivative! Should I just call (activation-prim z z)? That is technically possible, but there lies a trap! Recall from the last article that in the particular case of the implementation of the sigmoid activation, , x! and prim! have to be different.

Can we reuse a? Well, we have already reserved it for the signal coming from the next layer in the backward pass. Sorry! Hmm… We have a-1 sitting around idly. We won't need it until we use it to pass the signal to the previous layer. Nice, but its dimensions are different than a.

So, then, add new field to the FullyConnectedTraining type? This could work. However, who guarantees that some other activation function won't need two or more temporal working matrices? Some functions (tanh-prim! for example) won't need any additional memory, so in that case it would be a waste.

The Sigmoid activation function as a deftype

It turns out that the second approach, a dedicated activation function type, might be worth trying.

First, we take advantage of the fact that when we are calling the activation function, we need to keep the input unchanged (the function is called with two arguments), but when we are calling the function's derivative, we want to overwrite the input (the function is called with one argument). activ and prime methods reflect that.

(defprotocol Activation
  (activ [_ z a!])
  (prime [_ z!]))

This implementation is molded after the ways we need it in inference and training layer implementations. The inference layer calls the plain function. The training layer calls the Activation methods.

(deftype SigmoidActivation [work]
  Releaseable
  (release [_]
    (release work))
  Activation
  (activ [_ z a!]
    (linear-frac! 0.5 (tanh! (scal! 0.5 (copy! z a!))) 0.5))
  (prime [this z!]
    (linear-frac! 0.5 (tanh! (scal! 0.5 z!)) 0.5)
    (mul! z! (linear-frac! -1.0 z! 1.0 work))))

Note that someone has to take care of the work lifecycle. That would be SigmoidActivation. In turn, the layer that references the activation object will take care of calling its release method. We will add a call to (release activ-fn) in the release implementation of the training layer, where we intend to use this type.

Using the same reasoning, we can simplify the implementation of the function used in the inference layer, by providing only a one-argument version, since we don't need to keep \(z^l\) there. The reason why we are not reusing the SigmoidActivation type is because its work argument's batch size has to be known in advance, while in the inference layer we support any batch size.

(defn sigmoid
  ([]
   (fn [z]
     (let-release [work (raw z)]
       (->SigmoidActivation work))))
  ([z!]
   (linear-frac! 0.5 (tanh! (scal! 0.5 z!)) 0.5)))

The plain sigmoid function has a zero-argument arity that returns the constructor of the sigmoid activation type. The intended use is that, while in the inference layer we use the plain function, during the construction of the training layer, we can call that function to provide the appropriate constructor for its variant suitable for the training algorithm.

(let [z (fge 2 3 [-0.6 0 0.2 0.5 0.7 1])
      a (raw z)]
  (with-release [activation ((sigmoid) z)]
    {:function (activ activation z a)
     :derivative (prime activation z)}))
nil{:function #RealGEMatrix[float, mxn:2x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       0.35    0.55    0.67
   →       0.50    0.62    0.73
   ┗                               ┛
, :derivative #RealGEMatrix[float, mxn:2x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       0.23    0.25    0.22
   →       0.25    0.24    0.20
   ┗                               ┛
}

This looks nice to me, while being memory-savvy, too.

Hyperbolic tangent as a deftype

Now, tanh!.

You can guess that the derivative of this fine function is straightforward.

\(\tanh'(x) = \frac{1}{cosh^2(x)}\)

We even don't need any working memory, so the implementation is easier than with sigmoid.

(deftype TanhActivation []
  Activation
  (activ [_ z a!]
    (tanh! z a!))
  (prime [this z!]
    (sqr! (inv! (cosh! z!)))))
(defn tanh
  ([]
   (fn [_] (->TanhActivation)))
  ([z!]
   (tanh! z!)))
nil#'user/tanh

The updated training deftype

(deftype FullyConnectedTraining [w b a-1 z a ones-vctr activ-fn]
  Releaseable
  (release [_]
    (release w)
    (release b)
    (release a-1)
    (release z)
    (release a)
    (release ones-vctr)
    (release activ-fn))
  Parameters
  (weights [_] w)
  (bias [_] b)
  Transfer
  (input [_] a-1)
  (output [_] a)
  (ones [_] ones-vctr)
  Backprop
  (forward [_]
    (activ activ-fn (rk! -1.0 b ones-vctr (mm! 1.0 w a-1 0.0 z)) a))
  (backward [_]
    (mul! (prime activ-fn z) a) ;; z now contains delta^l
    "TODO"))
(defn training-layer
  ([inference-layer input ones-vctr]
   (let-release [w (view (weights inference-layer))
                 b (view (bias inference-layer))
                 a-1 (view input)
                 z (ge w (mrows w) (dim ones-vctr))
                 a (raw z)
                 o (view ones-vctr)]
     (->FullyConnectedTraining w b a-1 z a o ((activation-fn inference-layer) z))))
  ([inference-layer previous-backprop]
   (training-layer inference-layer
                   (output previous-backprop)
                   (ones previous-backprop))))

Let's check it with our old friend, an arbitrary network with two hidden layers.

(with-release [x (ge native-float 2 2 [0.3 0.9 0.3 0.9])
               ones (vctr native-float 1 1)
               layer-1 (fully-connected native-float tanh 2 4)
               layer-2 (fully-connected native-float sigmoid 4 1)
               training-layer-1 (training-layer layer-1 x ones)
               training-layer-2 (training-layer layer-2 training-layer-1)]
  (transfer! [0.3 0.1 0.9 0.0 0.6 2.0 3.7 1.0] (weights layer-1))
  (transfer! [0.7 0.2 1.1 2] (bias layer-1))
  (transfer! [0.75 0.15 0.22 0.33] (weights layer-2))
  (transfer! [0.3] (bias layer-2))
  (forward training-layer-1)
  (forward training-layer-2)
  (backward training-layer-2)
  (backward training-layer-1)
  (transfer (output training-layer-2)))
nil#RealGEMatrix[float, mxn:1x2, layout:column, offset:0]
   ▥       ↓       ↓       ┓
   →       0.44    0.44
   ┗                       ┛

The output is as expected. Nice job. I hope we continue making this work.

Micro benchmark

Let's see how fast the network is. Does the backward pass, at least the part that we have implemented by now, affect performance?

Nvidia GTX 1080 Ti (2017)

(cuda/with-default
  (with-release [factory (cuda-float (current-context) default-stream)]
    (with-release [x (ge factory 10000 10000)
                   ones (entry! (vctr factory 10000) 1)
                   layer-1 (fully-connected factory tanh 10000 5000)
                   layer-2 (fully-connected factory sigmoid 5000 1000)
                   layer-3 (fully-connected factory sigmoid 1000 10)
                   training-layer-1 (training-layer layer-1 x ones)
                   training-layer-2 (training-layer layer-2 training-layer-1)
                   training-layer-3 (training-layer layer-3 training-layer-2)]
      (time
       (do
         (forward training-layer-1)
         (forward training-layer-2)
         (forward training-layer-3)
         (backward training-layer-3)
         (backward training-layer-2)
         (backward training-layer-1)
         (synchronize!))))))
"Elapsed time: 130.929826 msecs"

Not much of a slowdown here compared to the less complete implementation from the last article. That is expected, since the Haddamard product, sigmoid, and hyperbolic tangent scale linearly (\(O(n)\)), which is not much in comparison with matrix multiplications that are \(O(n^3)\).

AMD R9 290X (2013)

(opencl/with-default
  (with-release [factory (opencl-float *context* *command-queue*)]
    (with-release [x (ge factory 10000 10000)
                   ones (entry! (vctr factory 10000) 1)
                   layer-1 (fully-connected factory tanh 10000 5000)
                   layer-2 (fully-connected factory sigmoid 5000 1000)
                   layer-3 (fully-connected factory sigmoid 1000 10)
                   training-layer-1 (training-layer layer-1 x ones)
                   training-layer-2 (training-layer layer-2 training-layer-1)
                   training-layer-3 (training-layer layer-3 training-layer-2)]
      (forward training-layer-1)
      (finish!)
      (time
       (do
         (forward training-layer-1)
         (forward training-layer-2)
         (forward training-layer-3)
         (backward training-layer-3)
         (backward training-layer-2)
         (backward training-layer-1)
         (finish!))))))
"Elapsed time: 342.650642 msecs"

Compared to 337 ms we got before, we see that there is no slowdown in the OpenCL implementation.

Donations

If you feel that you can afford to help, and wish to donate, I even created a special Starbucks for two tier at patreon.com/draganrocks. You can do something even cooler: adopt a pet function.

The next article

So, finally, the implementation of the learning algorithm unwinds. The majority of the structure is in place. What's left for the straightforward backward pass is computing the gradients in respect to \(\nabla_b C^l\) and \(\nabla_w C^l\), \(\nabla_{a-1}\), and updating the weights and bias.

After that, we should take care of wrapping this whole infrastructure into a nice and user-friendly neural networks API. We don't expect our users to assemble all these layers by hand, right?

Then, we can play with some real-world examples, and see which optimizations we can add to our implementation so the learning itself become more robust.

We will? Yes! But, why didn't I show you the result first? Seeing what we are aiming at would be a terrific motivator! Perhaps, but I am not writing this series to serve as an easy entertainment. I want you to think about the details. I want you to imagine what you would like the end result to be. Besides, I am writing this implementation as we go :) I am yet to decide what the final result is going to be, and your suggestions are welcome!

Thank you

Clojurists Together financially supported writing this series. Big thanks to all Clojurians who contribute, and thank you for reading and discussing this series.

Deep Learning from Scratch to GPU - 9 - The Activation and its Derivative - March 20, 2019 - Dragan Djuric