Neanderthal vs ND4J - vol 4 - Fast Vector Broadcasting in Java, CPU and CUDA

July 25, 2018

In Part 1, 2, and 3, of this series, we examined general dense matrix multiplications. I demonstrated just one important feature of Neanderthal () (the library I develop), and any other matrix library. GEMM and its sugared wrappers are probably the most useful operation in this space. However, any curious programmer will think, "Why the hell everybody only cares and talk about GEMM? There are plenty of other functions that I need. Being good at GEMM is just one microbenchmark."

I agree. What's more, now I'll show to you that with other functions, Neanderthal's performance stays strong, even for the things you have to implement yourself. GEMM is usually well optimized in all decent libraries. The further away we move from GEMM, libraries start to be sloppy.

I wanted to cover several vectorized operations on large arrays in this article: broadcasting, typical mappings, reductions, and math functions. When I finished writing, the article ended up a lot longer than I hoped. That's why here we are only going to cover broadcasting.

If you're impatient to hear the results of other operations, here is the summary: Neanderthal is faster for gigantic arrays, several times to an order of magnitude faster for large-ish arrays, and a couple orders of magnitude faster for tiny arrays. The exact difference varies across operations, of course, but whereas with GEMM Neanderthal was from just a bit faster to a several times faster, now the performance multipliers are much larger. That's the theme of the next article, though.

But, broadcasting is more interesting since it doesn't come as a built-in in Neanderthal, so we can learn a lot trying to write an implementation that would be on par with the optimized one that comes with Nd4j.

First, the imports.

(require '[uncomplicate.commons.core :refer [with-release release]]
         '[uncomplicate.fluokitten.core :refer [fmap!]]
             :refer [init context device current-context current-context! synchronize!]]
           [core :refer [dot raw row col entry! mrows rows asum axpy! vctr rk!]]
           [native :refer [fge fv]]
           [cuda :refer [cuge cuv set-engine!]]]
         '[criterium.core :refer [quick-bench]])
(import org.nd4j.linalg.factory.Nd4j

Vector broadcasting

Broadcasting is a staple operation in deep learning. It figures prominently in Nd4J's documentation, and is one of the first top features that programmers ask when evaluating matrix libraries. So, Nd4j has it in the form of addiRowVector and addiColumnVector methods. Don't mistakenly use non-destructive addRowVector and addColumnVector, if you don't want to pay 5-10× performance penalty. Nd4j does not handle large allocations very gracefully.

Broadcasting is not a linear algebra operation, and is a bit loosely defined mathematically, and I deliberately refused to add it to Neanderthal. I think it belongs to a machine learning library built on top of Neanderthal. What a handicap for Neanderthal! Shouldn't such grave omission be a deal breaker?

Instead of crying, I'll demonstrate how awesome Neanderthal is: I'll implement broadcasting with a Neanderthal one-liner, and that one-liner will be competitive with Nd4j's optimized broadcasting!

Here's how Nd4j's broadcasting works: we have a matrix a and want to add a lower-dimension structure, vector x, to it, but since the dimensions do not match, we would like to "upgrade" it by "broadcasting" it to each row (or column) of a.

(let [a (Nd4j/ones 2 3)
      x (Nd4j/ones 3)]
  (.addiRowVector ^INDArray a ^INDArray x))
#object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x7e850238 "[[    2.0000,    2.0000,    2.0000], \n [    2.0000,    2.0000,    2.0000]]"]

Let's measure Nd4j's performance for various sizes:

(defn bench-nd4j-addiRowVector [^long m ^long n]
  (let [a (Nd4j/ones m n)
        x (Nd4j/ones n)]
     (do (.addiRowVector ^INDArray a ^INDArray x)

Here's how to call it

;;(bench-nd4j-addiRowVector 100 100)

And the results for some representative dimensions:

;; 10x15: 1.687800 µs
;; 100x100: 7.704195 µs
;; 1000x1000: 67.836326 µs
;; 100x10000: 67.930071 µs
;; 10000x100: 83.789454 µs
;; 10000x10000: 29.833284 ms

A very naive broadcasting implementation in Neanderthal

Even if I didn't know anything about linear algebra, I could employ a stock tool from the programmer's toolbox, iteration. I'll naively ask Neanderthal's matrix a to give me a sequence of its rows, which are vectors, and add x to those vectors one by one.

(defn naive-add-row-vector! [a x]
  (doseq [rw (rows a)]
    (axpy! x rw))

Let's test it briefly:

(let [a (fge 2 3 [1 2 3 4 5 6])
      x (entry! (fv 3) 1)]
  (naive-add-row-vector! a x))
#RealGEMatrix[float, mxn:2x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       2.00    4.00    6.00
   →       3.00    5.00    7.00
   ┗                               ┛

As the phrase goes, you can "convince yourself that this is equivalent functionality" to Nd4j's. Neanderthal has somewhat prettier printing, but that is not of major concern right now.

Performance of a very naive Neanderthal broadcasting

Since we started with the most naive implementation, I expected Nd4j to take a giant lead here. I have been so naive! Neanderthal was not that much behind!

(defn bench-neanderthal-naive-add-row-vector [m n]
  (with-release [a (entry! (fge m n {:layout :row}) 1)
                 x (entry! (fv n) 1)]
     (do (naive-add-row-vector! a x)
;;(bench-neanderthal-naive-add-row-vector 100 100)

Here are the results:

;; 10x15: 1.379596 µs
;; 100x100: 11.467549 µs
;; 1000x1000: 216.736544 µs
;; 100x10000: 140.238912 µs
;; 10000x100: 1.163332 ms
;; 10000x10000: 31.391642 ms

A less naive broadcasting in Neanderthal

But why stop here? We can improve this naive implementation and make it less naive, to show you that with a good tool in your toolbox, and a little knowledge of textbook math you can make wonders.

So, what caught my eye in the previous measurements, is that cases when the matrix is "flat" require a lot of looping iterations, each calling native operation and wasting its time by not giving it enough work to do. What I would prefer is, of course, to call fewer more demanding operations, preferably just one :)

Take a look at the diagram on the Nd4j user guide that explains what broadcasting does, and recall a bit of basic linear algebra. If you imagine the vector to be broadcast to be a \(1 \times{} n\) matrix, and imagine a \(m \times{} 1\) matrix of ones, you may recognize that their product places the elements exactly at the places where they need to end up.

This might sound a bit messy. Whaaat? I'd have to convert the vector to a matrix and whatnot? Although it would be easy in Neanderthal, and it would be hidden inside a function, there is something even simpler. Yep, whatever I end up needing for these kinds of tasks, Neanderthal just end up already having a convenient function to use!

In this case, it is the rk! function, which can multiply each combination of elements of two orthogonal vectors resulting in a matrix that can be summed element-by-element with another matrix.

Here is a less naive broadcasting implementation in Neanderthal. This time, a two-liner:

(defn less-naive-add-row-vector! [a x]
  (with-release [y (entry! (raw (col a 0)) 1)]
    (rk! y x a)))
(let [a (fge 2 3 [1 2 3 4 5 6])
      x (entry! (fv 3) 1)]
  (less-naive-add-row-vector! a x))
#RealGEMatrix[float, mxn:2x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       2.00    4.00    6.00
   →       3.00    5.00    7.00
   ┗                               ┛

It works correctly; what about the performance?

Performance of a less naive Neanderthal broadcasting

(defn bench-neanderthal-less-naive-add-row-vector [m n]
  (with-release [a (entry! (fge m n {:layout :row}) 1)
                 x (entry! (fv n) 1)]
     (do (less-naive-add-row-vector! a x)
;; 10x15: 506.063330 ns
;; 100x100: 1.768599 µs
;; 1000x1000: 61.031548 µs
;; 100x10000: 155.047204 µs
;; 10000x100: 66.700431 µs
;; 10000x10000: 26.320592 ms

Now Neanderthal even took the lead!

With small-ish matrices it is 3-4 × faster, while with large and huge matrices it is on par with Nd4J. The only slip is with "flat" matrices, it is almost 3× slower here, and that's because it has to create a huge vector of 10000 ones for each invocation of the function. That's something that can be improved, but this is a makeshift implementation anyway. In "real" code, I use optimized broadcasting (sorry, I hasn't released that library as open source), and it works faster.

When we have large arrays and need speed, the real saver is GPU computing anyway. So why not try that instead?

Broadcasting on the GPU

Both libraries support CUDA, so let's see how those compare on Nvidia's GPUs. Neanderthal also supports OpenCL, so it can run on AMD's and Intel's cards, but Nd4j doesn't, so I won't be able to do any comparisons on my fine AMD GPUs. The most powerful GPU in my desktop is currently Nvidia GTX 1080 Ti, if you need the reference for the following numbers.

Nd4J can't use CPU and GPU concurrently (or at least that's what the documentation suggests). I had to kill my REPL session, change the Nd4j dependency to use CUDA backend, and start new JVM. On top of that, Nd4j's documentation does not talk about how to control the actual GPU device regarding explicit stream synchronization, which is the key point in this benchmark since GPU operations are asynchronous. With Neanderthal, I could call ClojureCL's synchronize, but with Nd4j I had to call a small summing to force it to complete the whole broadcasting instead of measuring just the work queuing into the stream. I wanted this comparison to be fair, so I did the same thing in Neanderthal, although it is not necessary, and on top of that Neanderthal's broadcasting is already being synchronized by the creation of the vector of ones (which is also not necessary but I wanted this blog to be simple, so I won't cry over these minor points).

Nd4j broadcasting on the GPU

(defn bench-nd4j-addiRowVector-cuda [^long m ^long n]
  (let [a (Nd4j/ones m n)
        x (Nd4j/ones n)]
     (do (.addiRowVector ^INDArray a ^INDArray x)
         (.sumNumber ^INDArray x)))))

;;(bench-nd4j-addiRowVector-cuda 10000 10000)
;; 10x15: 241.076577 µs
;; 100x100: 138.564805 µs
;; 1000x1000: 223.194717 µs
;; 100x10000: 289.878918 µs
;; 10000x100: 211.342593 µs
;; 10000x10000: 7.895556 ms

Except for the largest dimension, Nd4j's broadcasting runs not only slower than Nd4j's CPU implementation, not only slower that a less naive Neanderthal broadcasting on the CPU; it is slower than the most naive broadcasting implementation on the CPU.

That's not only Nd4j's fault, since broadcasting is simply not demanding enough to squeeze the juice out of the GPU. I mentioned that more as a warning to inexperienced readers to not expect the magic words GPU and CUDA to magically speed up the code.

A less naive Neanderthal broadcasting on the GPU

Let's see Neanderthal.

(defn bench-neanderthal-less-naive-add-row-vector-cuda [m n]
  (with-release [a (entry! (cuge m n) 1)
                 x (entry! (cuv n) 1)]
     (do (less-naive-add-row-vector! a x)
         (asum x)))))

(current-context! (context (device 0)))
;;(bench-neanderthal-less-naive-add-row-vector-cuda 10000 10000)
(release (current-context))
;; 10x15: 52.111330 µs
;; 100x100: 52.281520 µs
;; 1000x1000: 81.112859 µs
;; 100x10000: 80.279229 µs
;; 10000x100: 74.671313 µs
;; 10000x10000: 2.601956 ms

As I mentioned, on top of Nd4j's summing synchronization handicap, it has its own array creating and populating with ones handicap. Interestingly, a less naive broadcasting implementation turned out to be 3 × faster than Nd4j's CUDA code! Most of that is in GPU communication.

Without asum, and with explicit ClojureCL synchronize!, all these 50s turn into 28 microseconds. That's exactly why those not-so-demanding operations are not (much) faster on the GPU: the actual computation is just a small part of the time. GPU's have huge throughput, but have much higher latency than the CPU code, even when there is not any data transfer.

The real puzzle here is where Nd4j wasted these 150 microseconds that it is slower than Neanderthal in most cases, or why it stashed 5 milliseconds compared to Neanderthal in that 100M matrix…


I hope this post has helped you learn a few things about linear algebra, and gain some confidence in the code you write compared to the magical libraries. Usually, optimized library code is much better than what you'd write, but don't underestimate your own abilities. Also, although cuBLAS and MKL have amazing computational routines, higher level libraries are not always examples of how to use them properly; don't trust them blindly.

You can find Neanderthal's source code here:

Documentation can be reached via Neanderthal's homepage, while there's a bunch of tutorials at the blog you're reading right now. If you need more theory-oriented textbook style linear algebra code, you can start with Clojure Linear Algebra Refresher (1) - Vector Spaces, while if you remember your matrix-fu, you might need some numerical computation tutorials; start with Clojure Numerics - Part 1 - Use Matrices Efficiently.

Neanderthal vs ND4J - vol 4 - Fast Vector Broadcasting in Java, CPU and CUDA - July 25, 2018 - Dragan Djuric