CuPy accelerates NumPy on the GPU? Hold my Cider, here's Clojure!

Need help with your custom Clojure software? I'm open to (selected) contract work.

April 9, 2020

Please share: Twitter.

These books fund my work! Please check them out.

Everyone knows that Python's NumPy is fast due to its C++ based native backends. Power users even know that Nvidia provides a drop-in replacement, CuPy, that brings the power of GPU acceleration. Everyone knew that there was no airplane faster than Messerschmitt Me 163, but that was good information in 1944. Everyone knows that Java is slow and Clojure is not a great choice for number crunching, but as soon as I dig anywhere with my software, I discover that the folk wisdom is terribly wrong, and that Clojure kicks ass on CPU and especially on the GPU.

Here's summary, or, if you prefer, a TL;DR: I compared NumPy's corrcoef function with a simple 7-liner that I wrote as a demonstration for my book Numerical Linear Algebra for Programmers. The times for computing corrcoef for a \(1000\times{100000}\) matrix, on the CPU and GPU, are in the following table (lower is better).

	Python & NumPy/CuPy	Clojure & Neanderthal
CPU i7 4790K	946 ms	711 ms (544 ms)
GPU Nvidia GTX 1080Ti	680 ms	30 ms

My implemetation in Clojure is powered by Neanderthal () my free open source library.

Correlation Coefficient

Here's why corrcoef is a good choice for this comparison:

It's a statistical function, commonly used in the wild (data analysis, etc.).
Its implementation requires all levels of BLAS 1, 2 & 3 functions of linear, quadratic, and cubic asymptotic complexity.
- Matrix Multiplication (Blas 3)
- Elementwise math functions (linear)
- Outer product (Blas 2)
It's still simple enough that we can compare apples to apples.

NumPy (optimized)

First, the spec: I run this on an (old-ish but still powerful) i7-4790K desktop with 32GB of RAM. OS is Arch Linux, with the latest Intel MKL and OpenBLAS installed globally. NumPy is provided by pip (Conda on Arch is a mess, and I leave Docker, Kubernetes, and whatnot, to other people). Sadly, intel-numpy doesn't seem to be maintained that well, and it refuses to install through Arch's pip. That leaves NumPy with OpenBLAS, which should not be an issue, since OpenBLAS is very fast, and I expect it to be within a few percent of Intel's MKL. (You might cry foul, but, hey, grab NumPy & MKL and check for yourself).

I enjoy working in Clojure's REPL, so I'll use it for calling NumPy via libpython-clj. You may object again, since that might be a cause of NumPy's troubles, but I claim it's not, and I do it on purpose, to motivate you to fire up your favorite Python dev tool and try to prove me wrong.

We create a NumPy array.

(def x (-> (numpy/linspace 0 2 100000000 :dtype "float32")
           (numpy/reshape [1000 100000])))

I made sure that we use single-precison floats, so we can do a fair comparison with the GPU, which supports fast computations with float32, but is crippled for float64. I reshaped it to \(1000\times{100000}\), a size that you might not need every day, but is nothing unusually big. It represents a data set with \(1000\) variables and \(100,000\) observations.

Next, we call corrcoef to check whether it does what we expect it to.

(numpy/corrcoef x)

[[1.         1.         1.         ... 1.         1.         1.        ]
 [1.         1.         1.         ... 1.         1.         1.        ]
 [1.         1.         1.         ... 1.         1.         1.        ]
 ...
 [1.         1.         1.         ... 1.         1.         0.99999999]
 [1.         1.         1.         ... 1.         1.         1.        ]
 [1.         1.         1.         ... 0.99999999 1.         1.        ]]

We measure a wall clock time. I don't see a need for a more sophisticated benchmark here, involving samples, standard deviations, etc. since this is a long-running computation implemented by a native backend. There's no caches and JITs to warm here, just brutal number crunching. Try it many times, or put it in a loop, and see. The timings will vary, naturally, but within a few percent. Or, if you really want to be sure, please do whatever advanced benchmark you can think of and let me know of the results.

(time (numpy/corrcoef x))

"Elapsed time: 946.143499 msecs"

Just to be clear, this is a great result! This is fast, and is much faster than it would be with interpreted Python, hand-written C++, or pure Java. Nothing is wrong or deficient here. This is good enough.

Clojure & Neanderthal

However, with Clojure & Neanderthal ():

We can do even better!
We can implement it in a simple & elegant way.
We understand 100% of the solution instead of relying on black box implementation that we're scared to look at.

Here's the full implementation of the function that does the same thing as Python's corrcoef. It's taken from Numerical Linear Algebra for Programmers, as one of many examples that teach the full path from math theory to applied number crunching. I tried to make the book as practical as possible, so I demonstrate each and every step, so you can learn by doing, not only by looking at formulas. What's more, the stuff we implement is a top performer (as you'll see now)!

We use functions provided by Neanderthal (), which is free and open-source. This is the full implementation, not some pseudocode. You can find the full discussion in the book.

(defn matrix-correlation! [a!]
  (with-release [ones (entry! (vctr a! (ncols a!)) 1.0)
                 work (mv a! ones)]
    (let [a-mean (rk! (/ -1.0 (dim ones)) work ones a!)]
      (let-release [sigma (mm a-mean (trans a-mean))]
        (sqrt! (copy! (dia sigma) work))
        (with-release [sigma-x*y (rk work work)]
          (div! sigma sigma-x*y))))))

(defn matrix-correlation [a]
  (with-release [a-copy (copy a)]
    (matrix-correlation! a-copy)))

Here's matrix-correlation in action with a random \(1000\times{100000}\) matrix.

(def y (rand-uniform! (fge 1000 100000)))
(time (matrix-correlation y))

The result makes sense.

=>
#RealGEMatrix[float, mxn:1000x1000, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ↓       ↓       ┓
   →       1.00   -0.00    ⁙       0.00   -0.00
   →      -0.00    1.00    ⁙       0.01   -0.00
   →       ⁙       ⁙       ⁙       ⁙       ⁙
   →       0.00    0.01    ⁙       1.00   -0.01
   →      -0.00   -0.00    ⁙      -0.01    1.00
   ┗                                               ┛

But the result is not what interests us. We would like to know how fast we got it.

"Elapsed time: 710.890478 msecs"

That is even faster than NumPy, and by quite a margin!

But, wait, we can go even faster! If we're willing to sacrifice and overwrite the original data, we can use the destructive version (note the suffix "!").

(time (matrix-correlation! y))

"Elapsed time: 543.826297 msecs"

That's pretty nice result for some textbook code!

GPU acceleration with Nvidia GTX 1080Ti

NumPy does not support GPU acceleration, but Nvidia provides CuPy, which mimics NumPy's API on the GPU. It's Nvidia's stuff, so it should offer the full power of CUDA, right? It doesn't seem to be so.

We create the same array on the GPU (note cupy instead of numpy).

(def gpu-x (-> (cupy/linspace 0 2 100000000 :dtype "float32")
               (cupy/reshape  [1000 100000])))

I'm following Nvidia's official guide, and I've made sure that we wait for the actual computation to finish before measuring time. A common mistake in GPU benchmarks that are too good to be true is when people just queue asynchronous calls and don't wait for the GPU to actually do the job.

(time (do (cupy/corrcoef gpu-x)
          (py. Stream/null synchronize)))

"Elapsed time: 679.769357 msecs"

That is faster than NumPy on the CPU, but not by much. And it is even slower than Neanderthal on the CPU.

The first instinct is to check whether we mistakenly use float64, which is notoriously crippled to be many times slower than float32 on Nvidia's consumer GPU cards.

(py/get-attr gpu-x "dtype")

float32

Nope, we work with float32.

The next issue might be that our array is not on the GPU, but on the CPU, and has to be transferred to the GPU memory, which is rather slow. I followed the suggestion from Nvidia's CuPy documentation, and the array should be on the GPU, but let's check just in case.

(py/get-attr gpu-x "device")

<CUDA Device 0>

Yes, the array is on the GPU.

But, maybe there is something missing. Maybe, despite all this info, the computation takes place on the CPU? I put this computation in a loop and checked nvidia-smi. It shows 100% GPU utilization, so I'm pretty sure that this runs on the GPU, with data that already sits in GPU memory, as it should.

Underwhelming results for CuPy.

How Clojure deals with that? Clojure does not need special drop-in replacement library. Neanderthal supports CPU, Nvidia GPU with CUDA, and AMD GPU with OpenCL engines. All Neanderthal's functions are polymorphic and automatically adapt to whatever hardware you throw at them.

(def gpu-y (rand-uniform! (cuge 1000 100000)))

(time (do (matrix-correlation gpu-y)
          (synchronize!)))

"Elapsed time: 30.677678 msecs"

Yes, that's the acceleration we should expected from the GPU!

In case you have any comments and questions, please contact me. I would love to improve any part of this article, if possible!

Read part 2 of this article with more comparison here.