CUDA and cuBLAS GPU matrices in Clojure

May 20, 2017

Please share this post in your communities. Without your help, it will stay burried under tons of corporate-pushed, AI and blog farm generated slop, and very few people will know that this exists.

These books fund my work! Please check them out.

The new Neanderthal 0.11 comes with the new CUDA engine! The high-performance Clojure matrix library now supports all 3 major choices that you'd want to crunch those billions of numbers with: CPU, Nvidia GPU with CUDA, and AMD's or Nvidia's GPU, or other accellerators with OpenCL. Let's see why this new stuff is important (and it really is!).

CUDA/cuBLAS based GPU engine

I prefer free software, but many times I need access to the most powerful stuff. I've added a new engine to Neanderthal that gives us the full speed of Nvidia's CUDA based cuBLAS library. Since I have already stabilized Neanderthal's architecture by now, this is transparent to the user. If you've ever written code for native CPU engine or OpenCL GPU engine, you already know how to use it!

The only thing you need to take care of is basic engine configuration, but even that only if the defaults are not the right thing for you. If you want to use the first available Nvidia GPU in your machine, here's the hello world:

(require '[uncomplicate.commons.core :refer [with-release]]
         '[uncomplicate.neanderthal
           [core :refer [asum mm! copy]]
           [cuda :refer [cuv cuge with-default-engine]]])

(with-default-engine
  (with-release [gpu-x (cuv 1 -2 5)]
    (asum gpu-x)))

You can study Hello World project example on the GitHub, which shows a basic Leiningen project setup based on Neanderthal, and examples of using CPU, OpenCL and CUDA code.

Even faster!

Neanderthal was already optimized for top CPU speed, and even more speed on both AMD's and Nvidia's GPU. You could write very concise Clojure code using vector and matrix API, and also easily combine those with customized GPU code in ClojureCL.

One major thing was still left untapped though: Nvidia's proprietary CUDA-based libraries that require Nvidia's proprietary and closed source CUDA technology that is also tied to the Nvidia hardware. Bad. Saint IGNUcius does not approve this sinful technology and I am ashamed to indulge in this blasphemy. On the other hand it gives us the access to a ridiculously well optimized set of libraries ranging from linear algebra to deep learning that Nvidia offers at no charge. How fast it is? Let's see…

In the OpenCL engine tutorial, I did a basic exploration of the capabilities of the OpenCL-based Neanderthal engine, that is based on Cedric Nugteren's excellent open-source library CLBlast. It is amazingly fast. For example, it multiplies three 8192 × 8192 matrices (\(C_{8192\times8192} = \alpha A_{8192\times8192} \cdot B_{8192\times8192}\)) in 220 ms on Nvidia GTX 1080.

Theoretically, matrix computations require \(2\times m \times k \times n\) floating point operations (FLOPS). (This does not even count memory operations, but that's the problem of the implementer.) \((2 * 8192^3) \div 0.220\) is 4.99 TFLOPS (\(10^{12}\)). This boils down to 5 TFLOPS out of 8.228 that the card is theoretically capable of. That's 60% utilization, which is quite impressive for a one-man team working part-time! Now I'm trying the same stuff on the same hardware with the CUDA-based engine:

(require '[uncomplicate.clojurecuda.core :refer [synchronize!]])

(with-default-engine
  (let [cnt 8192]
    (with-release [gpu-a (cuge cnt cnt (range (* cnt cnt)))
                   gpu-b (copy gpu-a)
                   gpu-c (copy gpu-a)]
      (time (do (mm! 3 gpu-a gpu-b 2 gpu-c)
                (synchronize!) ;;Wait for asynchronious mm!
                gpu-c)))))

#CUGEMatrix[float, mxn:8192x8192, order:column, offset:0, ld:8192]

141 ms! \((2 * 8192^3) \div 0.141\) is 7.798 TFLOPS, almost at the specification maximum of the hardware! Nvidia did a really good job here, and the performance difference is, according to Cedric's experiments, even larger for smaller or non-quadratic matrices. In Clojure land, now we have the choice between a great free OpenCL backend, and an impressive proprietary Nvidia backend with unbeatable speed!

Access to the Nvidia CUDA ecosystem

What is left for you to do now, write a few lines of high-level Clojure function calls of these GPU operations, and easily create top-notch machine learning cash machine? Well… not so fast. Yes, linear algebra operations can take us a long way, but there is always some custom stuff that needs to be written. Sometimes it is because we cannot express what we need to do with standard operations, sometimes because we need to optimize for our special use case.

Neanderthal got us covered - we can still use its data structures and automatic transfer methods to manage our data, and combine it with our custom CUDA kernels with the help of ClojureCUDA. That's where real power is.

And that's not all. Nvidia ships several libraries for domains beyond linear algebra. Neanderthal helps with connecting with those in a similar way it helps with our custom ClojureCUDA-based stuff - by taking data management task off our shoulders. Expect some of those Nvidia libraries to be integrated into Clojure ecosystem as additional Neanderthal engines, or as separate Uncomplicate libraries.

I expect to integrate at least cuSPARSE and cuSOLVER into Neanderthal, and I'm eyeing cuDNN as an engine for a possible deep learning library. That shouldn't stop you from creating those before I do!

Wrap it up

I've just remembered that I had skipped the announcement post for ClojureCUDA last month when it was released. So, check out that library also :) I had in mind a couple of additional things to say, but this post is getting long, so, until the next time, please check out the new Neanderthal 0.11.0, ClojureCUDA 0.2.0, play with the examples, experiment, and write about what you discovered!