Deep Learning from Scratch to GPU - 14 - Learning a Regression

April 15, 2019

Please share this post in your communities. Without your help, it will stay burried under tons of corporate-pushed, AI and blog farm generated slop, and very few people will know that this exists.

These books fund my work! Please check them out.

A great moment has arrived. We are going to apply our neural networks implementation to a regression problem. The network is going to learn a known function, which enables us to see how well it learns, and why it doesn't do a great job. We are also going to get some hints for improvements. But, hey, it works!

If you haven't yet, read my introduction to this series in Deep Learning in Clojure from Scratch to GPU - Part 0 - Why Bother?.

The previous article, Part 13, is here: Initializing Weights.

To run the code, you need a Clojure project with Neanderthal () included as a dependency. If you're in a hurry, you can clone Neanderthal Hello World project.

Don't forget to read at least some introduction from Neural Networks and Deep Learning, start up the REPL from your favorite Clojure development environment, and let's continue with the tutorial.

Neural networks approximate functions

When we step over the hype, what neural networks do is they approximate functions. Neural networks learn to do that based on a lot of examples taken out from the function's output. The real use case is to learn to approximate functions that would be impossible to explicitly implement in practice.

Their key ability is to approximate unknown functions. When I know a set of rules that transfer inputs to outputs, I can think of many ways of implementing that in programming languages, and all these ways are more efficient than using neural networks. If I don't know the process that transfers the inputs to the outputs, I can't program it explicitly. If the process is known, but the rules are numerous, and it is not feasible to elicit them, that could be hard to implement, too.

A typical example familiar to programmers would be an expert system. Expert systems were a promising area of AI several decades ago. The idea is to find an expert for a certain area, help her define the rules she is using when making some decisions, program there rules in fancy DSLs with if/then/else flavor, and profit. It turns out that expert time is expensive. Also, experts use lots of intuition; some rules work, but not always, with lots of 'however's. On top of that, the probabilistic nature of life kicks in, so you can't completely rely even on the rules that you can define.

Let's say that I would like to analyze traffic of a website to defend against malicious visitors. I consulted with an expert, and he told me most of the known ways of detecting these. I implement some filters: if the user is from this range of IP addresses, if he uses a web browser with a certain user agent, if he comes via a proxy, if… Lots of rules are possible, and they would filter a lot of unwanted traffic. They would also filter some wanted traffic. But, most importantly, the attackers also know a lots of detection rules, so they would adapt to pass these rules.

The approach that neural networks take is implicit. If I feed past data to the network, and label good and bad visitors, without saying why the good are good and the bad are bad the network can figure out how to recognize them on its own. Even better, it can figure out how to recognize traffic that it has never seen. Of course, it may not do it perfectly, but it can learn this sort of stuff well enough.

To summarize, if I have lots of input/output examples, I can train a neural network to approximate the unknown function that produced these examples.

We are going to do something less ambitious than the website traffic filtering in this article. We are going to train a neural network on a known function. It is obvious that neural networks are not the right tool for that job, since it is much easier and precise to code the function in Clojure right away. It is a great first example, though, since it makes it possible to easily see what the network is doing.

Generating artificial data

Since we are simulating the data, we know the exact function that produces it. I will use the function \(f(\mathbf{x}) = sin(x_0) + cos(x_1) + tanh(x_2) + {x_3}^2\) as an easy example.

(require '[uncomplicate.neanderthal.real :as real]
         '[uncomplicate.neanderthal.math :as math]
         '[uncomplicate.neanderthal.core :refer [col view-vctr cols]])

The implementation in Clojure is straightforward. The function takes a Neanderthal vector as an input, and calculates a number according to the formula shown above.

(defn my-fn ^double [xs]
  (+ (math/sin (real/entry xs 0))
     (math/cos (real/entry xs 1))
     (math/tanh (real/entry xs 2))
     (math/sqr (real/entry xs 3))))

Here's the function in action. I give it a vector, an it returns the resulting number.

(my-fn (vctr native-float [0.3 0.1 0.9 0.33]))

nil2.1157222504237048

Now I need lots of data that comes from this function. I would like if this data represented some domain well, so I am going to generate a lot of random vectors. Since I already have a usable rand-uniform function that I wrote in the previous article, I'll use it in combination with fmap! to populate a matrix with 10000 examples. If you're wondering how I jumped from talking about vectors to populating a matrix, remember that the columns of a matrix are vectors. It is more efficient to work with a bunch of vectors stuffed in a matrix, than to keep them in a Clojure sequence.

(fmap! rand-uniform (ge native-float 4 10000))

nil#RealGEMatrix[float, mxn:4x10000, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ↓       ↓       ┓
   →       0.69    0.60    ⁙       0.76    0.25
   →       0.49    0.08    ⁙       0.05    0.32
   →       0.35    0.31    ⁙       0.29    0.83
   →       0.03    0.50    ⁙       0.26    0.49
   ┗                                               ┛

The columns of this matrix are the inputs that we are feeding to the function that generates the "fake" data.

(map my-fn (cols (fmap! rand-uniform (ge native-float 4 10))))

nil(2.1075943923498737 2.1200940934733445 2.28641391448637 2.3282734715382625 2.6057539083298047 2.692321322967708 1.5812371635412812 2.4879016728441905 2.4969644208941015 2.0235401608373755)

These are precise results calculated by the real, known, function my-fn with the data provided in the matrix that we had generated. The nice bonus of using a known function is that we can always generate more data, either for learning or for testing. Do not forget that in the real world, if the function is known, there is no need to learn its approximation from data.

Learning to approximate

In the previous example, I have created 10 observations for a function. There are data analysis methods that can learn something from such a small sample, but neural networks require more. I'll just assume that 10 thousand observations is enough in this case, but this is only an arbitrary chosen size. I hope this is going to be enough.

(def x-train (fmap! rand-uniform (ge native-float 4 10000)))

x-train

#RealGEMatrix[float, mxn:4x10000, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       0.70    0.76    ⁙       0.84    0.79
    →       0.06    0.49    ⁙       0.39    0.24
    →       0.44    0.72    ⁙       0.18    0.65
    →       0.02    0.28    ⁙       0.85    0.61
    ┗                                               ┛

(def y-train (ge native-float 1 10000 (map my-fn (cols x-train))))

y-train

#RealGEMatrix[float, mxn:1x10000, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.05    2.27    ⁙       2.57    2.63
    ┗                                               ┛

I have generated 10 thousand observations and I know the exact value of the function at these particular points, y-train. I now create a neural network consisting of five fully connected layers. The input layer is the data matrix. Since each input vector has four dimensions, the input layer will have four neurons. The hidden layers have 16, 64, and 8 neurons, respectively. The activation functions are sigmoid, tanh, and tanh, respectively. I have chosen these completely arbitrarily. Not only that this is not the optimal choice, but I don't even have a way to tell what the optimal architecture for any random function would be. That's one of the catches in neural networks, and many other AI methods. Since the network should approximate a real-valued function the output will be one-dimensional.

(def inference (init! (inference-network
                       native-float 4
                       [(fully-connected 16 sigmoid)
                        (fully-connected 64 tanh)
                        (fully-connected 8 tanh)
                        (fully-connected 1 sigmoid)])))

(def training (training-network inference x-train))

Let's check whether the network seems to be properly initialized before we run the learning algorithm.

(weights (first (layers inference)))

#RealGEMatrix[float, mxn:16x4, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ┓
    →      -0.04    0.14    0.27   -0.03
    →      -0.00    0.38    0.22   -0.25
    →       ⁙       ⁙       ⁙       ⁙
    →      -0.26   -0.31   -0.12    0.31
    →      -0.33   -0.25    0.23   -0.11
    ┗                                       ┛

(weights (second (layers inference)))

#RealGEMatrix[float, mxn:64x16, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →      -0.02    0.04    ⁙       0.13   -0.04
    →       0.02    0.05    ⁙      -0.10   -0.01
    →       ⁙       ⁙       ⁙       ⁙       ⁙
    →       0.02    0.04    ⁙      -0.11    0.09
    →       0.06    0.10    ⁙      -0.02   -0.01
    ┗                                               ┛

(weights (last (layers inference)))

#RealGEMatrix[float, mxn:1x8, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       0.28    0.11    ⁙       0.03    0.04
    ┗                                               ┛

These matrices contain relatively small numbers around zero. This looks right.

(sgd training y-train quadratic-cost! [[1 0.05] [1000 0.03] [100 0.01]])

(1.4530567312903702 0.6852846934110391 0.6851082117576618)

I run the algorithm for 1101 epochs with a few different learning rates. I've chosen these values completely arbitrarily, and I do not know whether they are a good, bad, or a mediocre choice. To see whether the track I'm on is any good, I have to rely on the value of the cost function. What I can see from the result, the starting value was 1.32, and it decreased to 0.68. This is not necessarily great, but it is a good sign.

The cost itself only shows me that the algorithm goes in the right direction, we need a more direct way to see how well it works. What's more familiar to a programmer than a unit test? Since I know the real function, it is easy to compare its results with the results from the neural networks. I won't make it formal yet. I'll just generate a few random observations that we will use for testing only. It is extremely important that the network does not see these observations during training.

(def x-test (fmap! rand-uniform (ge native-float 4 5)))

(def y-test (ge native-float 1 5 (map my-fn (cols x-test))))

Let's see what the network says.

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       1.00    1.00    1.00    1.00    1.00
    ┗                                               ┛

Whoops. This doesn't look very useful. The network should have returned something close to y-test values:

y-test

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.28    2.15    3.38    2.52    2.19
    ┗                                               ┛

Regression requires linear output

The network returned a vector of ones because the output activation is the sigmoid function. Since the expected values are larger than 1, the output is saturated. Sigmoid is often spotted in the output layer of neural networks in various tutorials. That's because most tutorials start with classification examples, and often deal with classification of photos. There, the network usually has as many output neurons as there are categories that the output gets classified into, and it is expected that one neuron has a value close to one, while the others are closer to zero. Here, however, we are doing a different kind of task: regression.

In our case, there is only one neuron in the output, and it should directly return the value of the approximation. We do not want to mess up with the signal at the output, and do not need to do any activation there. Since we still need to fit that functionality into the existing architecture, we create a kind of do-nothing activation, similar to Clojure's identity function. The derivative of this linear function is a constant one.

(deftype LinearActivation []
  Activation
  (activ [_ z a!]
    (copy! z a!))
  (prime [this z!]
    (entry! z! 1)))

(defn linear
  ([]
   (fn [_] (->LinearActivation)))
  ([z!]
    z!))

We fix the network and repeat the process.

(def inference (init! (inference-network
                       native-float 4
                       [(fully-connected 16 sigmoid)
                        (fully-connected 64 tanh)
                        (fully-connected 8 tanh)
                        (fully-connected 1 linear)])))

(def training (training-network inference x-train))

Checking the inference on the untrained network, we, unsurprisingly, get useless answers.

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       0.51    0.51    0.51    0.51    0.51
    ┗                                               ┛

One epoch later, we see that the cost is quite high.

(sgd training y-train quadratic-cost! 1 0.05)

1.3255932671876625

We repeat the inference, only to see that the network didn't learn much, but it has changed.

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       0.80    0.80    0.80    0.80    0.80
    ┗                                               ┛

Another epoch, and the cost decreased.

(sgd training y-train quadratic-cost! 1 0.05)

0.9166161838265136

As expected, the inference is still bad.

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       1.05    1.05    1.05    1.05    1.05
    ┗                                               ┛

Is something like 10 epochs enough to see some improvement?

(sgd training y-train quadratic-cost! 10 0.05)

0.11156441768060976

Hooray, now the loss decreased 10 times! How's the inference doing?

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.02    2.02    2.02    2.02    2.02
    ┗                                               ┛

It doesn't seem to be any better. Let's do a 100 epochs more.

(sgd training y-train quadratic-cost! 100 0.05)

0.10893812269722111

The loss doesn't seem to go much lower. The inference is still bad.

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.07    2.07    2.07    2.07    2.07
    ┗                                               ┛

Maybe the learning rate is too big. Let's decrease it a bit.

(sgd training y-train quadratic-cost! 100 0.03)

0.10892282792763508

The loss seems to stay at the same level, and the inference hasn't improved.

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.07    2.07    2.07    2.07    2.07
    ┗                                               ┛

I'll try with 1000 epochs, and yet lower learning rate.

(sgd training y-train quadratic-cost! 1000 0.01)

0.10887324749642284

It hasn't helped at all.

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.07    2.07    2.07    2.07    2.07
    ┗                                               ┛

Maybe I need to vary the learning rate a bit. Let's try that.

(sgd training y-train quadratic-cost! [[100 0.03][100 0.01][100 0.005][100 0.001]])

(0.10885866925871306 0.10885377446494822 0.10885131820990937 0.10885081984270364)

We can see that, as the learning progresses, the cost stays roughly the same, which means that the network just strolls around, but can't progress much.

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.07    2.07    2.07    2.07    2.07
    ┗                                               ┛

Before throwing the towel, let's remember that the task that we are doing here is not classification, when it is enough that the network learns to discriminate between a few, or several, discrete categories. Here we are doing regression, which is more difficult, since the network has to learn to approximate the actual real value of the function. Maybe I need to give it a more time. Let's see what it can do with 40000 epochs.

(time (sgd training y-train quadratic-cost! 40000 0.05))

"Elapsed time: 116679.184528 msecs"
0.002820095415000833

Now the cost is significantly lower. Let's hope that it can be directly seen when we test the inference.

(inference x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.23    2.12    3.28    2.41    2.12
    ┗                                               ┛

Right! Much closer to the real values. We can never expect to get the exact floating point values that the real function is returning, especially not with the test observations that the network hasn't seen during the learning phase, but the difference is within an acceptable range.

(axpy! -1 y-test (inference x-test))

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →      -0.05   -0.03   -0.10   -0.11   -0.07
    ┗                                               ┛

If we wanted to improve the approximation, we should probably train the network for longer. However, do not assume that more training leads to better approximation. As the learning progresses, the network will generally decrease the cost, but after some time, some local optimum is reached, and the cost may oscillate, or even start to increase. There is no guarantee when or if the network will reach some optimal state.

Fortunately, we do not even want to decrease the cost too much. In practice, that might indicate overfitting. The network that is optimized for the training data too much, might work poorly on the data that it hasn't seen during the learning process, and this is exactly the data that we want it to work well with.

These are high-level things to worry about. For now, it is enough to see that our network works, and to get a feeling of how difficult the task of training is. We needed a huge number of epochs to get acceptable results, and may need even more to get something good. And we have a really tight implementation, without much resource waste. Imagine how long it would take with something that was less optimized.

GPU

On the CPU, this particular network took 2 minutes to learn something useful. Since GPU can be order(s) of magnitude faster, they should take few seconds. Right?

Nvidia GPU with CUDA

Let's try with our CUDA-based implementation.

(cuda/with-default
  (with-release [factory (cuda-float (current-context) default-stream)]
    (with-release [cu-x-train (ge factory 4 10000)
                   cu-y-train (ge factory 1 10000)
                   inference (init! (inference-network
                                     factory 4
                                     [(fully-connected 16 sigmoid)
                                      (fully-connected 64 tanh)
                                      (fully-connected 8 tanh)
                                      (fully-connected 1 linear)]))
                   training (training-network inference cu-x-train)]
      (transfer! x-train cu-x-train)
      (transfer! y-train cu-y-train)
      (time
         (sgd training cu-y-train quadratic-cost! 40000 0.05)))))

"Elapsed time: 23251.819897 msecs"
0.0028752992499551057

23 seconds! Faster than on the CPU, but as much as we hoped. It's worth remembering that the size of the task have to be demanding to see these orders of magnitudes in speedup. With relatively small matrices, it's good that the GPU engine wasn't even slower than the CPU!

AMD GPU with OpenCL OpenCL

You probably remember that our implementation for older AMD hardware and drivers had some performance issues.

(opencl/with-default
  (with-release [factory (opencl-float *context* *command-queue*)]
    (with-release [cl-x-train (ge factory 4 10000)
                   cl-y-train (ge factory 1 10000)
                   inference (init! (inference-network
                                     factory 4
                                     [(fully-connected 16 sigmoid)
                                      (fully-connected 64 tanh)
                                      (fully-connected 8 tanh)
                                      (fully-connected 1 linear)]))
                   training (training-network inference cl-x-train)]
      (transfer! x-train cl-x-train)
      (transfer! y-train cl-y-train)
      (finish!)
      (time
         (last (sgd training cl-y-train quadratic-cost! (repeat 40000 [1 0.05])))))))

"Elapsed time: 559357.909222 msecs"
0.002766931535136348

This is terrible. It is even slower than the CPU. The reason is that the engine is optimized for large matrices, and can not find a way to saturate hardware potentials it has with such small chunks of data that it's being fed.

Smaller is often better

We have seen that the brute force can help with efficiency, but eventually hits the wall. Since the architecture of the network was arbitrary, maybe we can get better results with a smaller network. As I am interested in experimenting with different sizes, I am only interested in how the cost behaves. I'll use the engine that seems to be the fastest with this task.

Let's try a 4-8-16-4-1 network. Since it is smaller, I hope that a smaller number of epochs would be enough.

(cuda/with-default
  (with-release [factory (cuda-float (current-context) default-stream)]
    (with-release [cu-x-train (ge factory 4 10000)
                   cu-y-train (ge factory 1 10000)
                   inference (init! (inference-network
                                     factory 4
                                     [(fully-connected 8 sigmoid)
                                      (fully-connected 16 tanh)
                                      (fully-connected 4 tanh)
                                      (fully-connected 1 linear)]))
                   training (training-network inference cu-x-train)]
      (transfer! x-train cu-x-train)
      (transfer! y-train cu-y-train)
      (time
         (sgd training cu-y-train quadratic-cost! 4000 0.05)))))

"Elapsed time: 2049.335159 msecs"
0.004670825056062358

4000 epochs took 2 seconds, and the error seems low enough to indicate that the network leaned something.

Does it learn more in 40000 epochs? Not that much, it seems.

"Elapsed time: 20982.76644 msecs"
0.003014854083318096

From this, I conclude that although the GPU didn't took less time to compute these smaller layers, the learning algorithm itself got better results since smaller space can be explored with fewer steps.

Let's try the same code with an even smaller network: 2 hidden layers with 8 and 4 neurons.

"Elapsed time: 1527.71754 msecs"
0.1007463554173708

What about changing the structure so the first hidden layer has 4 and the second layer has 8 neurons?

"Elapsed time: 1539.367704 msecs"
0.004476395398100112

Lucky me, this 4-8 network can learn with similar cost as the larger 8-16-4, or the much larger 16-64-8 networks.

Let's try this small network with OpenCL.

(opencl/with-default
  (with-release [factory (opencl-float *context* *command-queue*)]
    (with-release [cl-x-train (ge factory 4 10000)
                   cl-y-train (ge factory 1 10000)
                   inference (init! (inference-network
                                     factory 4
                                     [(fully-connected 4 sigmoid)
                                      (fully-connected 8 tanh)
                                      (fully-connected 1 linear)]))
                   training (training-network inference cl-x-train)]
      (transfer! x-train cl-x-train)
      (transfer! y-train cl-y-train)
      (finish!)
      (time
         (last (sgd training cl-y-train quadratic-cost! (repeat 4000 [1 0.05])))))))

"Elapsed time: 27948.357329 msecs"
0.09556360326748764

I expect the CPU engine to work particularly well with these small networks, since it doesn't rely on parallelization that much.

(def inference-881 (init! (inference-network
                           native-float 4
                           [(fully-connected 8 sigmoid)
                            (fully-connected 8 tanh)
                            (fully-connected 1 linear)])))
(def training-881 (training-network inference-881 x-train))

(time (sgd training-881 y-train quadratic-cost! 4000 0.05))

"Elapsed time: 2843.762284 msecs"
0.004605395143619535

Let's test the inference of this 4-8 network trained with 4000 epochs.

(inference-881 x-test)

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.29    2.17    3.08    2.40    2.16
    ┗                                               ┛

y-test

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       2.28    2.15    3.38    2.52    2.19
    ┗                                               ┛

Finally, let's compare the network's answers with the known outputs of the known function it approximates.

(axpy! -1 y-test (inference-881 x-test))

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
    ▥       ↓       ↓       ↓       ↓       ↓       ┓
    →       0.00    0.02   -0.30   -0.12   -0.03
    ┗                                               ┛

It seems acceptable, at least for these baby steps.

Donations

If you feel that you can afford to help, and wish to donate, I even created a special Starbucks for two tier at patreon.com/draganrocks. You can do something even cooler: adopt a pet function.

The next article

After quite a few articles we created an implementation that is complete enough to be able to be used for demo purposes. Depending on whether you expected some miracle, or you had already known how difficult are the problems that machine learning is applied to, you might be under-impressed or you are jumping from joy right now.

The bottom line is that our implementation works quite efficiently, but is not as effective as we would like it to be. Since I can be quite confident that the code we wrote is quite tight, it's time to see whether we can improve the algorithm itself.

As we identified that the algorithm is fragile in regards to actual weight values, it makes sense to implement a way to keep them in check, so they stay in a zone where the gradient descent can progress well. We have also seen that the algorithm can spend a lot of time chasing local optimums. It may be a good idea if we can improve it to jump over pebbles and boulders in search for the valleys, and stop only in front of mountain peaks.

We will implement a few obvious improvements to the stochastic gradient descent that could be said to work well universally, and then we will revisit this same example for comparison. After we address the low hanging fruit, we may even try to learn some useful things from some real data, instead of simulation.

Thank you

Clojurists Together financially supported writing this series. Big thanks to all Clojurians who contribute, and thank you for reading and discussing this series.