Billions of Random Numbers in a Blink of an Eye

Need help with your custom Clojure software? I'm open to (selected) contract work.

June 12, 2019

Please share: Twitter.

These books fund my work! Please check them out.

I'm happy to announce that the new release of Neanderthal () can generate random vectors and matrices on the CPU and GPU out of the box! To show off the performance, I'll demonstrate how much you can do in a matter of milliseconds, or whatever a blink of an eye takes.

To run this code on you own machine, you need a Clojure project with included as a dependency. If you're in a hurry, you can clone Neanderthal Hello World project. With a bit of fiddling, you can also use this from your Java projects.

A random Hello World

This blog article is, as usual, automatically generated from live code. Therefore, I have to show you the necessary imports. I hope that you are following along an evaluating the code in your own REPL, but if you are just reading, you can skip the following expression.

(require '[uncomplicate.commons.core :refer [with-release let-release]]
         '[uncomplicate.fluokitten.core :refer [fmap!]]
         '[uncomplicate.clojurecuda.core :refer [with-default synchronize!]]
         '[uncomplicate.clojurecl.core :refer [finish!] :as ocl]
         '[uncomplicate.neanderthal
           [native :refer [fv fge native-float]]
           [core :refer [submatrix native]]
           [math :refer [sqrt log sin pi sqr]]
           [cuda :refer [cuv cuge] :as cuda]
           [opencl :refer [clv clge] :as opencl]
           [random :refer [rand-normal! rand-uniform! rng-state]]]
         '[criterium.core :refer [quick-bench]])

Let's make things simple at first. I have the vector x in the main memory.

(def x (fv 5))

I would like this vector to hold some random numbers. Why I need this is not important at this moment; maybe I just need to fill it with numbers for testing or demonstration, or I am creating a simulation that requires random stuff. But, I need a random vector.

I can get it simply by calling rand-uniform!

(rand-uniform! x)

nil#RealBlockVector[float, n:5, offset: 0, stride:1]
[   0.88    0.62    0.22    0.72    0.16 ]

Does it work on more complex structures, such as GE matrix? Of course!

(def a (fge 3 2))

(rand-normal! 33 2.5 a)

nil#RealGEMatrix[float, mxn:4x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →      33.90   36.38   34.29
   →      31.09   32.56   33.44
   →      36.96   37.39   32.83
   →      33.46   36.02   34.23
   ┗                               ┛

What is rand-normal? Well, sometimes you need your numbers to be uniformly distributed, but sometimes you need a bell-shaped data. rand-uniform generates uniformly distributed random numbers, \(U(0,1)\) by default, while rand-normal generates data from normal (Gaussian) distribution \(N(0,1)\) by default.

How long does it take to generate a billion

I have a dilemma now. Should I first write about the motivation of having this feature, or should I show off the performance first? I choose to show of the performance, since I assume that you are reading this because you are already interested in working with random numbers, and you know how convoluted it can get. The motivation is in the next section.

How long does it take an eye to blink?

I promised to show you a lot of work done in a blink of an eye, but how long is that? According to Wikipedia, it does take an average of 100-150 milliseconds, or 100-400 milliseconds, depending on the source of the research.

I'll show you that it is not important which one of these we choose as the reference, since we can go even faster and assume that the blink of an eye is whatever it takes for a frame in an old movie to switch, which was 1/24 of a second in the good old days of analog theaters. That is 41.97 milliseconds.

On the CPU: a 5 year old Intel i7-4790k

First, I'll create a 1000-dimension vector in the main memory and measure the speed of generating random uniformly distributed entries.

(def x1K (fv 1000))
(quick-bench (rand-uniform! x1K))

Evaluation count : 1420254 in 6 samples of 236709 calls.
             Execution time mean : 422.105972 ns
    Execution time std-deviation : 1.347492 ns
   Execution time lower quantile : 420.687916 ns ( 2.5%)
   Execution time upper quantile : 423.978301 ns (97.5%)
                   Overhead used : 1.369142 ns

It takes 422 /nano/seconds per 1000 entries, which is 0.42 /nano/seconds per entry.

Let's create a larger vector, one of million entries, and try that.

(def x1M (fv 1000000))
(quick-bench (rand-uniform! x1M))

Evaluation count : 1890 in 6 samples of 315 calls.
             Execution time mean : 319.874178 µs
    Execution time std-deviation : 4.399502 µs
   Execution time lower quantile : 317.444041 µs ( 2.5%)
   Execution time upper quantile : 327.527853 µs (97.5%)
                   Overhead used : 1.369142 ns

Even faster per entry, 0.32 nanoseconds! This looks promising.

The next step is a billion entries. I'll create two vectors of 500 million entries each, which together takes a billion. The reason I'm doing this instead of one vector, is that each entry takes 4 bytes, so a billion entries requires 4GBs of bytes. Java (and Intel MKL) buffers are indexed with integers, and the largest integer is 2147483647.

So, generating 1 billion random numbers on the CPU takes…

(def x500M (fv 500000000))
(def y500M (fv 500000000))
(quick-bench
 (do (rand-uniform! x500M)
     (rand-uniform! y500M)))

Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 347.586774 ms
    Execution time std-deviation : 5.861722 ms
   Execution time lower quantile : 341.168665 ms ( 2.5%)
   Execution time upper quantile : 356.432065 ms (97.5%)
                   Overhead used : 1.369142 ns

347 milliseconds. Even that is a billion random numbers in a blink of an eye, if we choose the slowest eyes according to Wikipedia. But, this is Clojure and Neanderthal, so why not try GPU? Of course!

But before that, let's also mention that we have only demonstrated uniformly distributed random numbers. We often need normally distributed numbers, which are a bit more difficult to generate, and rand-normal expected to be slower. Let's see…

(quick-bench
 (do (rand-normal! x500M)
     (rand-normal! y500M)))

Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 1.870732 sec
    Execution time std-deviation : 4.685253 ms
   Execution time lower quantile : 1.865571 sec ( 2.5%)
   Execution time upper quantile : 1.876130 sec (97.5%)
                   Overhead used : 1.369142 ns

Almost 2 seconds. Although this is extremely fast (see the next section about motivation) this is slower than a blink of an eye!

On the GPU: Nvidia GTX 1080Ti

I hope that I don't need to convince you that using GPU computing with Neanderthal is almost effortless (read about this in many dedicated articles on this blog).

Other than creating a GPU context, everything else is polymorphic and transparent, and uses the same functions as the CPU examples, so we're going to the benchmarks right away.

(with-default
  (cuda/with-default-engine
    (with-release [x500M (cuv 500000000)
                   y500M (cuv 500000000)]
      (time (do (rand-uniform! x500M)
                (rand-uniform! y500M)
                (synchronize!))))))

"Elapsed time: 39.319381 msecs"

That's right! The most restrictive goal has been met right away! Neanderthal generated 1 billion random numbers in 39 milliseconds, which is even faster than motion picture frames! Certainly faster than a blink of an eye.

Fantastic! But, what about normally distributed numebers? They are more demanding…

(with-default
  (cuda/with-default-engine
    (with-release [x500M (cuv 500000000)
                   y500M (cuv 500000000)]
      (time (do (rand-normal! x500M)
                (rand-normal! y500M)
                (synchronize!))))))

"Elapsed time: 34.108762 msecs"

Even better! 34 milliseconds for 1 billion random numbers from Normal distribution. That's something!

On the GPU: AMD Vega 64

Now, I like supporting open source, and CUDA is a proprietary platform, etc. etc. Of course, Neanderthal's code is open-source, but the CUDA platform itself is owned and controlled by Nvidia.

Here's my AMD GPU with open-source ROCm drivers that AMD provides.

(ocl/with-default
  (opencl/with-default-engine
    (with-release [x500M (clv 500000000)
                   y500M (clv 500000000)]
      (time (do (rand-uniform! x500M)
                (rand-uniform! y500M)
                (finish!))))))

"Elapsed time: 37.707126 msecs"

Supercool! 37 milliseconds on a completely open GPU platform!

(ocl/with-default
  (opencl/with-default-engine
    (with-release [x500M (clv 500000000)
                   y500M (clv 500000000)]
      (time (do (rand-normal! x500M)
                (rand-normal! y500M)
                (finish!))))))

"Elapsed time: 37.836357 msecs"

Of course, by now you expect that rand-normal has the same performance…

I declare victory!

Donations

If you feel that you can afford to help me funding the development of these nice open-source libraries, please check patreon.com/draganrocks and pledge a monthly donation of you choice. You can do something even cooler: adopt a pet function.

Motivation: Why do we need this at all?

Almost every major platform and programming language has a random function. Isn't that enough?

There are 3 issues with a default random function:

(1) the random numbers that it generates are of poor quality.
(2) it is slow (if you need lots of these)
(3) it only generates uniformly distributed numbers (of poor quality)

While you can solve the third problem by writing a function that transforms what rand gives you into normally distributed numbers, as I described in the article Deep Learning from Scratch to GPU - 13 - Initializing Weights, the first two issues are much trickier!

Built-in RNG

Here's how we use the built-in rand function.

(rand)

nil0.13648137992696296

Nice and simple! That part is good.

And, if we need a sequence of these, it's still nice.

(map rand (range 5))

nil(0.0 0.1955719758267589 1.0403752058141191 2.0357222475376053 1.7017948352691565)

There's a bug here. Note how the numbers progressively increase. This is because the first argument of rand controls the upper bound. If we want the whole sequence to have the same fixed bounds, or the default 1, we need to create a lambda function in-place.

(map (fn [_] (rand)) (range 5))

nil(0.027135352571733606 0.5705001188754536 0.3043893027311386 0.9307741452527688 0.42642378633952305)

What if we need to fill a more complex structure with random numebers? We need to program it ourselves, or to hope that the library at hand has a polymorphic support for mapping.

Fortunately, Neanderthal does support fmap!, one of Fluokitten's polymorphic alternatives to Clojure's map.

(fmap! (fn [_] (rand)) (fge 2 3))

nil#RealGEMatrix[float, mxn:2x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       0.48    0.59    0.33
   →       0.13    0.49    0.32
   ┗                               ┛

Unfortunately, if we need these numbers to be normally distributed, we have to create a specialized function. I use the result discussed in the earlier article, Deep Learning from Scratch to GPU - 13 - Initializing Weights.

(defn rand-gaussian ^double [^double _]
  (double (* (sqrt (* -2.0 (log (double (rand)))))
             (sin (* 2.0 pi (double (rand)))))))

(fmap! rand-gaussian (fge 2 3))

nil#RealGEMatrix[float, mxn:2x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       1.33   -0.66    1.04
   →       1.23   -0.97   -0.36
   ┗                               ┛

While this is good enough for lots of purposes, one important technical obstacle is that rand works only on the CPU! If we need to initialize random vectors on the GPU, rand is of no help.

As I demonstrated in Deep Learning from Scratch to GPU - 13 - Initializing Weights, generating random numbers on the CPU, and transferring them to the GPU takes much more time that computing the main algorithm that processes these random numbers on the GPU. That is ugly and not good enough.

Built-in RNG is of poor quality

This is actually and issue that I can't demonstrate easily. If you are not already aware of this, it probably didn't matter much in your applications, since you didn't need that many random numbers, so you couldn't discover that the numbers that built in random functions generate are not that well distributed.

Now, of course, the "random" numbers that the computer generates are not really random. They are "pseudorandom", meaning that there is a deterministic algorithm that generates them. This is why when you provide the same seed, you get the same sequence of "random" numbers.

Producing real random numbers on a computer is difficult, and often requires specialized hardware, or some other source of randomness from the real world. This is very important for cryptography, and is very difficult and slow. I am not talking about that level of randomness here.

Fortunately, for almost everything outside cryptography, various pseudorandom numbers are good enough. However, while the built-in RNG is good enough if you need to create a list of 42 random numbers for testing purposes, it is not good enough if you need to explore a distribution with MCMC.

I won't bother you again other than stating that Neanderthal uses Philox and/or ARS5 RNG which is much, much, better than the stuff you get from the built-in rand.

Built-in RNG is too slow

This is easy to demonstrate.

(quick-bench (rand))

Evaluation count : 35329860 in 6 samples of 5888310 calls.
             Execution time mean : 15.789613 ns
    Execution time std-deviation : 0.114434 ns
   Execution time lower quantile : 15.674532 ns ( 2.5%)
   Execution time upper quantile : 15.964610 ns (97.5%)
                   Overhead used : 1.364400 ns

While 15 ns does not seem too slow, compare it to 0.32 ns that we get on the CPU, and 0.034 ns that we got on the GPU.

What about normal distribution?

(quick-bench (rand-gaussian Double/NaN))

Evaluation count : 8797110 in 6 samples of 1466185 calls.
             Execution time mean : 67.180339 ns
    Execution time std-deviation : 0.511289 ns
   Execution time lower quantile : 66.824600 ns ( 2.5%)
   Execution time upper quantile : 67.927071 ns (97.5%)
                   Overhead used : 1.364400 ns

Billion numbers should take 15 or 67 seconds, but we should also consider that these numbers have to be written to the memory, so let's benchmark it.

(time
 (do (fmap! rand-gaussian x500M)
     (fmap! rand-gaussian y500M)))

"Elapsed time: 69576.897458 msecs"

Now, I admit that rand is fast enough for most stuff, and that the main reason why rand is not good enough is that the numbers it generates are poor, and that it does not work on the GPU.

The fact that Neanderthal can do this RNG 50 times faster on the CPU, and 200 times faster on the GPU is just an icing on the cake.

Gratis features

In addition to performance, Neanderthal functions transparently adapt to the structure at hand. Not only that it supports matrices, it can support tricky sub structures.

For example, take a note that calling rand-uniform! on a submatrix will not touch the data on the main matrix outside that submatrix.

(def a (fge 4 3))
(def sub-a (submatrix a 0 1 4 2))

(rand-uniform! 0 2 sub-a)

nil#RealGEMatrix[float, mxn:4x2, layout:column, offset:4]
   ▥       ↓       ↓       ┓
   →       1.78    0.09
   →       0.32    0.32
   →       1.86    1.94
   →       1.43    0.95
   ┗                       ┛

nil#RealGEMatrix[float, mxn:4x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →      33.90    1.78    0.09
   →      31.09    0.32    0.32
   →      36.96    1.86    1.94
   →      33.46    1.43    0.95
   ┗                               ┛

Of course, this also works on the GPU.

(with-default
  (cuda/with-default-engine
    (with-release [a (cuge 5 4)
                   sub-a (submatrix a 1 1 3 2)]
      (rand-normal! 55 3 sub-a)
      (native a))))

: nil#RealGEMatrix[float, mxn:5x4, layout:column, offset:0]
:    ▥       ↓       ↓       ↓       ↓       ┓
:    →       0.00    0.00    0.00    0.00
:    →       0.00   57.59   63.07    0.00
:    →       0.00   53.33   55.46    0.00
:    →       0.00   52.48   58.21    0.00
:    →       0.00    0.00    0.00    0.00
:    ┗                                       ┛

If you need the code to be reproducible, and to get the same random numbers each time you run the tests, you can provide a specific seed.

(rand-uniform! (rng-state native-float 11) 2 3 (fv 3))

nil#RealBlockVector[float, n:3, offset: 0, stride:1]
[   2.73    2.40    2.89 ]

(rand-uniform! (rng-state native-float 11) 2 3 (fv 3))

nil#RealBlockVector[float, n:3, offset: 0, stride:1]
[   2.73    2.40    2.89 ]

The Book, The Books

If you find this useful, take a look at the series of books on programming applied to artificial intelligence.

The book series is titled "Interactive Programming for Artificial Intelligence".

The first book, "Deep Learning for Programmers: An Interactive Tutorial with CUDA, OpenCL, MKL-DNN, Java, and Clojure" is already in the process of writing, and the drafts are available as a beautifully typeset PDF. Subscribe now and read the drafts at https://aiprobook.com/deep-learning-for-programmers

More books about high performance computing, CUDA, OpenCL, and Linear Algebra, are planned!