Recurrent Networks Hello World in Clojure with new Deep Diamond RNN support on CPU and GPU

Need help with your custom Clojure software? I'm open to (selected) contract work.

August 26, 2022

Please share: Twitter.

These books fund my work! Please check them out.

I've been busy in the last period working on new major features in Deep Diamond, one of which is the support for Recurrent Neural Networks (RNN). It's not been an easy ride, but I can finally show you some results! Big thanks for everyone that's helping me with this by buying my books (or subscribing to the upcoming editions), and Clojurists Together, who generously funded me in the past year to work on this.

I know that most of you probably don't have much more than passing familiarity with deep learning, let alone recurrent neural networks, and that's why I'll try to show a very simple example on the level of Hello World that anyone interested in machine learning and programming can understand and try.

So, enough talk, let's get down to business.

What are we doing

We are demonstrating a hammer: recurrent neural networks. Just kidding; we would like to create a (software) device that can learn to predict the next data point in a series. Depending on the data, this can be done in a number of ways (even by convolutional neural networks (CNN) that Deep Diamond already supported), one of which is RNN. So, we are creating a recurrent network, and training it with a set of data for this task.

An example of data that fits this task would be temperature at some place, stock prices, and any other (possibly infinite) sequence of numbers in one of more dimensions that have an ordinal relation, that is, has an abstract notion of time attached to it. Then, we are trying to forecast one or more values that are happening in the future (temperature the next day, or the closing price of a stock, etc.).

Since Hello World has to be dead simple, a real example would have too many opaque numbers, so we'll solve an artificially trivial task of teaching our network to predict the next number in the series for obvious series such as 1, 2, 3, 4, 5. Of course, in real life we rarely need to solve that exact task with such a bazooka as RNN, but if this is your first contact with time series prediction with deep learning, I guess it'd be just what works best.

Let's get down to code

First, we'll require a bunch of Clojure namespaces that we need.

(require '[uncomplicate.commons.core :refer [with-release let-release release]]
         '[uncomplicate.neanderthal
           [core :refer [ge dim amax submatrix subvector mrows trans transfer transfer! view-vctr
                         native view-ge cols mv! rk! raw col row nrm2 scal! ncols dim rows axpby!]]
           [native :refer [fge native-float fv iv]]]
         '[uncomplicate.diamond
           [tensor :refer [*diamond-factory* tensor offset! connector transformer
                           desc revert shape input output view-tz batcher]]
           [dnn :refer [rnn infer sum activation inner-product dense
                        network init! train cost train train-shuffle ending]]
           [native :refer []]])

All the AI in the world would be useless to us if we hadn't measured some data from the real world to feed it. Suppose (for the sake of Hello World), that I "measured" a narrow range of whole number domain by generating it from thin air. Does this data tell us anything about stock market? Of course not, and please does not expect any model that we train on this data to magically be informed on anything that could not be learned from the data it has seen.

(def simple-sequence (range -100 100))

(-100 -99 -98 -97 -96 -95 -94 -93 -92 -91 -90 -89 -88 -87 -86 -85 ...)

Now, I'll create a blueprint for an arbitrary recurrent neural network (RNN). This network has 3 recurrent layers (Gated Recurrent Unit cells), an abbreviation to one timestep, and two dense layers at the end. Please note that I used a completely arbitrary architecture. Neither this layer structure, nor the number of hidden neurons are optimal for this data; we are not even sure it's any good. If anything, it's probably a huge overkill. I chose it simply to show you how it's easy to construct with Deep Diamond(), as it will practically do everything on its own if you specify the bare minimum, that is "what you want". And we hope it'll at least learn to work well at the end, as non-optimal as it is.

(def net-bp (network (desc [5 32 1] :float :tnc)
                     [(rnn [128] :gru)
                      (rnn 2)
                      (abbreviate)
                      (dense [128] :relu)
                      (dense [1] :linear)]))

There is no place here to explain what RNN is and how it works internally, other than pointing that recurrent layers have an ability to handle sequential relations of its input by "memorizing" the signals that pass through it. The upcoming version of my book Deep Learning for Programmers (2.0) discusses RNNs in more detail.

Formatting the input data

The input of this network differs from fully connected or convolutional networks by explicitly modeling the time dimension, "t" in the ":tnc" format. Technically, you can feed it with any 3D tensor that matches its [5 32 1] shape, but for that data to be in context, it has to actually be arranged as 5 timesteps of a minibatch of 32 samples of 1-dimensional data.

We do have 1-dimensional data, but how do we fit our (range -100 100) sequence to its input? We do have more than 5 timesteps (we have 400), and we are far from 32 samples, since we only have one! We could try to just cram the sequence as-is by doing (transfer! simple-sequence (input net)) but this would be the "garbage in" part of "garbage in - garbage out". No. The solution is, as always in machine learning, to actually think what our data represents, and matching it with our knowledge of how the actual model intends to process its input.

What we do need is a bunch of 5-long sequences, such as [1 2 3 4 5] and the output that we would deem correct. In this case, I choose that the goal is to teach the network to output 6 to this input (or a number sufficiently close to it). So, the training data should be input sequences such as [3 4 5 6 7] and [-12 -11 -10 -9 -8], and target outputs such as [8] and [-7]. I hope you see how a bunch of these sequences, almost 400, and their respective target outputs could be extracted from simple-sequence.

The following function employs some stock Neanderthal () matrix functions to process the data and pack it into input and target output tensors [x-train] and [y-train]. I don't have time to explain each step, which is not trivial, but this is fairly standard vector/matrix/tensor stuff, well explained in both books from my Interactive Programming for Artificial Intelligence book series.

(defn split-series [fact s ^long t]
  (let [n (- (ncols s) t)
        c (mrows s)]
    (let-release [x-tz (tensor fact [t n c] :float :tnc)
                  y-tz (tensor fact [n c] :float :nc)
                  x-ge (trans (view-ge (view-vctr x-tz) (* n c) t))
                  s-vctr (view-vctr s)]
      (transfer! (submatrix s 0 t c n) (view-ge (view-vctr y-tz) c n))
      (dotimes [j t]
        (transfer! (subvector s-vctr (* j c) (* c n)) (row x-ge j)))
      [x-tz y-tz])))

Here's how the output looks like on an ever simpler example of 2-step sample sequences produced from 5 element long full sequence.

(def dummy (fge 1 5 (range 5)))

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ↓       ↓       ┓
   →       0.00    1.00    2.00    3.00    4.00
   ┗                                               ┛

(def dummy-split (split-series *diamond-factory* dummy 2))

[{:shape [2 3 1], :data-type :float, :layout [3 1 1]} (0.0 1.0 2.0 1.0 2.0 3.0)
 {:shape [3 1], :data-type :float, :layout [1 1]} (2.0 3.0 4.0)]

This split produces 3 samples for training, each sample has 2 entries, and for each sample there is a desired output. The tensor printout does not show dimensions, which would be super hard to make sense anyway due to large dimensionality and enormous number of entries in any tensor of any use. We can extract a matrix view, in cases when it makes sense.

(view-ge (view-vctr (dummy-split 0)) 3 2)

#RealGEMatrix[float, mxn:3x2, layout:column, offset:0]
   ▥       ↓       ↓       ┓
   →       0.00    1.00
   →       1.00    2.00
   →       2.00    3.00
   ┗                       ┛

So, inputs are arranged in rows: [0 1], [1 2], and [2 3]. That's because tensor's default layout is :tnc, meaning that innermost grouping is channels ($C=1$), (mini)batch size ($N=3$) and time ($T=2).

Ok, so, finally, we transform our own data so that the network can learn from it.

(def full-series (fge 1 200 simple-sequence))

#RealGEMatrix[float, mxn:1x200, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ↓       ↓       ┓
   →    -100.00  -99.00    ⁙      98.00   99.00
   ┗                                               ┛

(def train-data (split-series *diamond-factory* full-series 5))

[{:shape [5 195 1], :data-type :float, :layout [195 1 1]} (-100.0 -99.0 -98.0 -97.0 -96.0 -95.0 -94.0 -93.0 -92.0 -91.0 -90.0 -89.0 -88.0 -87.0 -86.0 -85.0)
 {:shape [195 1], :data-type :float, :layout [1 1]} (-95.0 -94.0 -93.0 -92.0 -91.0 -90.0 -89.0 -88.0 -87.0 -86.0 -85.0 -84.0 -83.0 -82.0 -81.0 -80.0)]

The printouts of long tensors show only a subset of the content of the tensor.

The actual network

The blueprint that we've created at the beginning of the article can be simplified for later reuse. Please note that it's not some super-opaque magical compiler. The network architecture is defined as a simple Clojure vector of straight Clojure functions!

(def architecture [(rnn [128] :gru)
                   (rnn 2)
                   (abbreviate)
                   (dense [128] :relu)
                   (dense [1] :linear)])

[#function[uncomplicate.diamond.dnn/rnn/fn--43604]
 #function[uncomplicate.diamond.dnn/rnn/fn--43604]
 #function[uncomplicate.diamond.dnn/abbreviate/fn--43609]
 #function[uncomplicate.diamond.dnn/fully-connected/fn--43493]
 #function[uncomplicate.diamond.dnn/fully-connected/fn--43493]]

This architecture is independent from the specific input dimensions. We create a blueprint by specifying the dimensions and the architecture.

(def net-bp (network (desc [5 32 1] :float :tnc)
                     architecture))

This blueprint is independent of the learning algorithm. It is a function that creates the actual network training program. In this case, I will use gradient descent with adaptive moments (:adam). Before we start learning, we initialize the network with random weights, automatically chosen by Deep Diamond() according to good practices, but you are free to use another initialization function that is more to your liking. Everything in Deep Diamond is modular and implemented in Clojure fashion of "assemble your own if you wish". If you are content with my choices, then everything is automatic!

(def net (init! (net-bp :adam)))

Here's the network printout, in case you

=======================================================================
#SequentialNetwork[train, input:[5 32 1], layers:5, workspace:1]
-----------------------------------------------------------------------
#Adam[topology::gru, shape:[5 32 128], activation: :identity]
 parameters: [{:shape [1 1 129 3 128], :data-type :float, :layout [49536 49536 384 128 1]} (2.1676771211787127E-5 -5.19975537827122E-6 5.7575953178456984E-6 2.380384103162214E-5 -2.894530371122528E-5 -2.4803119913485716E-7 -1.6729745766497217E-5 -2.902976348195807E-6 4.530028309090994E-5 5.464621608552989E-6 -5.7127394939016085E-6 1.593221713847015E-5 -1.1793279554694891E-5 -7.507987476174094E-8 -1.1438844921940472E-5 5.965541276964359E-6) {:shape [1 1 3 128], :data-type :float, :layout [384 384 128 1]} (0.0035437364131212234 -0.001584756188094616 0.0013698132243007421 -1.2431709910742939E-4 1.7048766312655061E-4 0.002398068318143487 0.00372034078463912 -0.0021170536056160927 8.799560600891709E-4 -0.0023348634131252766 8.223060867749155E-4 -1.1841517698485404E-4 -0.004045043606311083 9.459585417062044E-4 -0.0037800846621394157 -0.002804018324241042)]
-----------------------------------------------------------------------
#Adam[topology::gru, shape:[5 32 128], activation: :identity]
 parameters: [{:shape [2 1 256 3 128], :data-type :float, :layout [98304 98304 384 128 1]} (2.276021717761978E-7 3.804875632340554E-6 1.058972247847123E-5 2.8063212198503606E-7 1.8309128790860996E-5 -9.416969078301918E-6 -7.987003414200444E-8 1.0779360309243202E-5 2.222931470896583E-6 -1.2704362234217115E-5 1.5140467439778149E-5 -2.0407780539244413E-5 -1.5779027080498054E-6 -1.0661310625437181E-5 -3.834541985270334E-6 -1.0737002412497532E-5) {:shape [2 1 3 128], :data-type :float, :layout [384 384 128 1]} (8.910018368624151E-4 0.003690000157803297 -4.8378523206338286E-4 -0.0034620240330696106 0.0031146425753831863 6.610305863432586E-4 0.00204548635520041 0.001728582545183599 -0.0027434946969151497 0.007643579971045256 0.00425624568015337 -0.00295245717279613 1.8937387721962295E-5 -0.0027048818301409483 -0.0012806318700313568 -0.0028582836966961622)]
-----------------------------------------------------------------------
{:shape [32 128], :topology :abbreviate}
#Adam[topology::dense, shape:[32 128], activation: :relu]
 parameters: [{:shape [128 128], :data-type :float, :layout [8192 1024]} (-0.004568892996758223 -0.004355841316282749 0.005139997228980064 -0.005750759970396757 -0.004274052567780018 0.005599929019808769 -0.003351157531142235 -0.008299742825329304 -0.0031023912597447634 0.0013810413656756282 0.002118719043210149 0.0023180078715085983 -0.005323362536728382 -0.013326002284884453 7.393552223220468E-4 -0.013735005632042885) {:shape [128], :data-type :float, :layout [1]} (0.5049463510513306 -0.47153180837631226 -1.3509427309036255 -1.3017100095748901 -0.39814749360084534 0.8303372263908386 0.6530964970588684 -0.22249580919742584 -1.326366901397705 0.16360507905483246 -0.022157425060868263 2.0535836219787598 1.8076190948486328 -0.0799550786614418 -1.6791125535964966 -0.7451670169830322)]
-----------------------------------------------------------------------
#Adam[topology::dense, shape:[32 1], activation: :linear]
 parameters: [{:shape [1 128], :data-type :float, :layout [1 1]} (0.003621552372351289 -0.004416522569954395 2.548426273278892E-4 0.006995228119194508 0.0013199367094784975 -0.0018220240017399192 0.009454095736145973 0.003091101534664631 -0.01203352864831686 -0.014204473234713078 -0.007159397471696138 0.0039085038006305695 0.0029486482962965965 -0.009481357410550117 0.009158425033092499 -0.004999339580535889) {:shape [1], :data-type :float, :layout [1]} (-0.09112684428691864)]
=======================================================================

Training, finally!

Hey, I promised you a Hello World, and I've been beating the bush for half an hour formatting the data. And we didn't even touched the biggest obstacle: actually training the network. Is it hard? Yes, but not for the user. Now it's time for Deep Diamond() to beat the hell out of your CPU or GPU. But it will at leas do this on its own :)

Since this is a Hello World, we'll start with 50 epochs and see how it's going.

(time (train-shuffle net (train-data 0) (train-data 1) :quadratic 50 [0.005]))

"Elapsed time: 769.986155 msecs"
3238.298712769756

Hmmmm. The cost of 3000 and change does not look very good. Would more training help?

(time (train-shuffle net (train-data 0) (train-data 1) :quadratic 50 [0.005]))

"Elapsed time: 744.456035 msecs"
708.0862449675478

Still bad, but it's improving! For the sake of experinmenting, I've run this ten(ish) times, and the cost has been steadily decreasing, to the point that it looks good now.

(time (train-shuffle net (train-data 0) (train-data 1) :quadratic 50 [0.005]))

"Elapsed time: 720.427332 msecs"
0.13662090173881225

We don't have to stick to 50-epoch runs. Let's do 500 at once.

(time (train-shuffle net (train-data 0) (train-data 1) :quadratic 500 [0.005]))

"Elapsed time: 6325.427605 msecs"
0.008672867995529465

This looks nice. A thousand epochs might seem a lot, but considering the large network size, recurrent layers, and the scarceness of the training data, it might actually be appropriate. On the other hand, Deep Diamond did it blazingly fast, in a mere dozen seconds! In the world of machine learning, this is nothing.

(def question (tensor [5 1 1] :float :tnc))
(transfer! [1 2 3 4 5] question)

{:shape [5 1 1], :data-type :float, :layout [1 1 1]} (1.0 2.0 3.0 4.0 5.0)

(infer net question)

Dragan says: Requested subtensor is outside of bounds.
{:src-index -31, :src-cnt 1, :dst-index 0, :dst-cnt 32, :mb-size 1}

Using the network for inference

Now that we have our super useful network, we can stop and think: but how do we use it? First, we need data that could be interpreted as a "question". Our network needs a sequence of five numbers, and when fed, will answer with one number. Obviously, these should be provided as tensors of appropriate dimensions.

(def question (tensor [5 1 1] :float :tnc))
(transfer! [1 2 3 4 5] question)

However, invoking inference would cause the complaint from the network.

(infer net question)

Execution error (ExceptionInfo) at uncomplicate.commons.utils/dragan-says-ex (utils.clj:105).
Dragan says: Requested subtensor is outside of bounds.

Our network's input requires a minibatch of 32 samples. The infer function can handle more, and do the inference in mini batches of 32, but it can't handle fewer samples.

One of the solutions is to create another blueprint with the same structure, and transfer learned weights to the new network.

(def net1-bp (network (desc [5 1 1] :float :tnc)
                      architecture))

While we're at it, we don't need to create a complex network capable of learning (:adam or :sgd). We can create a much cheaper inference network that takes fewer resources.

(def net1-infer (net1-bp))
(transfer! net net1-infer)

Now, finally, give us the answer, network!

(infer net1-infer question)

{:shape [1 1], :data-type :float, :layout [1 1]} (6.016137599945068)

So, being asked what is the next element in the sequence of [1.0 2.0 3.0 4.0 5.0] (remember, we specified data type :float), our network answers [6.0161] which is close enough to the actual target value that we can mark this as correct. But, it's not a great achievement, since our network already saw this sequence in training. A hash map would have done much better job at guessing this. Let's try a previously unseen sequence.

(transfer! [10 12 14 16 18] question)
(infer net1-infer question)

{:shape [1 1], :data-type :float, :layout [1 1]} (17.750324249267578)

Not that off, but one would expect 19.0. What went wrong? We trained our network with ducks $(x+1)$ and asked it about geese $(x+2)$. What about griffons?

(transfer! [10 1 100 16 34] question)
(infer net1-infer question)

{:shape [1 1], :data-type :float, :layout [1 1]} (-6.679039001464844)

Now the answer does not make any sense, but would you be able to come with a better answer?

All, right, let's try with a sequence that is similar to the one we used in training, but is out of the range of data that the network has seen.

(transfer! [1000 1001 1002 1003 1004] question)
(infer net1-infer question)

{:shape [1 1], :data-type :float, :layout [1 1]} (96.41997528076172)

Nope, not much success, but we should not have expected any. The network can not answer question outside its domain of expertise.

Let's try with previously unseen data, but inside the domain that we used for training (floats from -100.0 to 1000).

(transfer! [37.4 38.4 39.4 40.4 41.4] question)
(infer net1-infer question)

{:shape [1 1], :data-type :float, :layout [1 1]} (42.39200210571289)

This is actually pretty close!

What about GPU?

Sure!

(def nvidia (cudnn-factory))

(def gpu-net-bp (network nvidia
                         (desc [5 32 1] :float :tnc)
                         architecture))

(def gpu-net (init! (gpu-net-bp :adam)))

(def gpu-train-data (split-series nvidia full-series 5))

Let's hit it with 1000 epochs right away.

(time (train-shuffle gpu-net (gpu-train-data 0) (gpu-train-data 1) :quadratic 1000 [0.005]))

"Elapsed time: 5734.688665 msecs"
4.561427662266859E-4