Neanderthal vs ND4J - vol 1 - Native performance, Java and CPU

June 22, 2018

TLDR; Neanderthal () is considerably faster than ND4J. There is more to that, of course. It's not so much about the results but about the process. Read on.

I have to be a good driver to get the power out of a Lamborghini (that I do not have ;), and a good driver can get an enjoyable ride out of any decent car. The same is with high-performance libraries. Simply throwing a powerful library blindly into a project wouldn't get me far, and might occasionally crash me into a wall. Only when I understand how this stuff works, I can hope to make everything fit together, as anyone doing deep learning today will agree :)

Today, Java has a choice of a few well maintained libraries that can use Intel's MKL for fast native operations. MKL is basically as fast as you can get today on the CPU. The reasoning is that, since two libraries use MKL, they should squeeze the same amount of juice out of it. It's not quite like that.

What are we comparing here

Long time ago, when I started developing Neanderthal, the folk wisdom was that FFI in Java has huge overhead, so pure Java libraries are faster with small and medium sized matrices, while you can hope to get some speedup from using native bindings (such as JBlas) only when matrices are very large. Neanderthal showed this popular opinion to not be true; Neanderthal, a Clojure library backed by custom bindings to MKL, is much faster than pure Java libraries even for rather small matrices.

Roughly at the same time, the folks from Skymind started developing Deeplearning4J and ND4J. They took a similar route, and also got some quite good results, because they did many things right. Somehow, though, it is difficult to find the actual performance numbers. Most folks who use DL4J and ND4J take for granted that ND4J is as fast as possible, and I don't blame them, for ND4J is much faster than any previous Java matrix library. Except Neanderthal, of course :)

I developed Neanderthal to be able to meet some demanding computation requirements. I tirelessly optimized not only the speed, but API details that can't be found in other libraries. During that process, I compared the results not only to Java wrappers, but to native execution itself, to make sure the overhead is tiny, ideally non-existent. Whenever I measured ND4J, Neanderthal was faster. In some cases a little faster, in some cases much faster.

That's why I was surprised to hear that Adam Gibson and his team at Skymind did some benchmarks where they found out that ND4J was faster than Neanderthal. Having measured Neanderthal extensively vs MKL itself, and knowing that the overhead is really, really small, I simply could not see how that would be possible, since that would mean that ND4J backed by MKL must be faster than MKL itself. I was intrigued.

Adam pointed me to a code repository with some basic matrix multiplication benchmarks that they tried. Unfortunately, the code only contains ND4J calls, no Neanderthal calls. The results are not available, but the author, Paul Dubs, reports this: "The result was that for matrices smaller than 128x128 Neanderthal won, for larger ND4J."

In this experiment, I'll concentrate on replicating the tests that the guys from Skymind pointed out, comparing Neanderthal and ND4J brutal in-place matrix multiplication. We'll leave other issues such as the API ergonomics, GPU computing, solving linear equations for future posts. Here I am measuring the approaches: given that both libraries use MKL, I assume that the raw computation speed is the same, and ascribe any differences to the overhead that the library uses in keeping data around and calling appropriate operations.

The setup

I created a project that anyone can use on their own machine to replicate what I'm doing here in the Clojure REPL. The whole code is contained in this post, so you can type it exactly as-is into your REPL. I purposefully have not created a binary that you run blindly and read the report, since the purpose of this blog is to help people learn this stuff. I hope you'll be satisfied with what you learn by typing this stuff out yourself. The lein project is here.

The only thing you have to have on your machine for this to work is MKL. Neanderhtal website has the information and I guess ND4J folks also wrote something similar. I assume MKL is available on your machine.

The measurements are done with the Criterium library. It warms up the JVM, runs the code many times, and generates a nice report with statistics.

After starting the REPL, the first thing we do is require Neanderthal's functions that we'll use.

(require '[uncomplicate.commons.core :refer [with-release]]
         '[uncomplicate.fluokitten.core :refer [fmap!]]
         '[uncomplicate.neanderthal
           [core :refer [mm!]]
           [native :refer [dge fge]]]
         '[criterium.core :refer [quick-bench]])

Then, I'll import the ND4J stuff:

(import org.nd4j.linalg.factory.Nd4j
        org.nd4j.linalg.api.ndarray.INDArray
        org.nd4j.linalg.cpu.nativecpu.NDArray
        java.util.SplittableRandom)

Since I am not an ND4J user, the first concern is to check whether I am using the right stuff. As a reference, I use the code from the Paul's project that Adam pointed to me.

Here's how he created a matrix:

(def m1 (Nd4j/rand 4 4))

I am checking the implementation class of the matrix I got:

(class m1)
org.nd4j.linalg.cpu.nativecpu.NDArray

Now, there is one important thing that people often get wrong in such comparisons: floating point precision. Single precision operations are generally twice (or even more on the GPU) as fast as double precision FP operations. That's why I want to make sure that the ND4J matrices that I'll use in this code use single precision floats, so they are as fast as possible. It is not clear to me how to directly order ND4J factories to use single or double precision based backends. Fortunately, the default seems to be single precision:

(class (.data ^NDArray m1))
org.nd4j.linalg.api.buffer.FloatBuffer

since this is FloatBuffer, I assume that it is single precision, because there is also DoubleBuffer in ND4J, which I assume uses doubles.

Another thing that people get wrong is that they use out-of place operations. Since those copy data, they obviously introduce much overhead. In ND4J, mmul is out of place, while mmuli is in-place matrix multiplication. Again, I use Paul's code as a reference, and he used mmuli, so I'm doing the same here.

(let [m2 (Nd4j/rand 4 4)
      result (Nd4j/createUninitialized (.shape ^INDArray m2))]
  (.mmuli ^INDArray m1 ^INDArray m2 ^INDArray result))
#object[org.nd4j.linalg.cpu.nativecpu.NDArray 0xee3e53b "[[    0.6359,    0.7100,    0.7950,    0.5909], \n [    0.6098,    0.6178,    0.7594,    0.6282], \n [    1.5101,    1.4069,    1.7199,    1.7073], \n [    0.4605,    0.6056,    0.1510,    0.3166]]"]

The result seems consistent, so I'll assume that I'm using ND4J correctly.

The microbenchmark code

Now, the third thing that a careless coder might get wrong is to measure apples and oranges. For example, we might measure both the time to create or read matrices and the actual computations. Here, we want to make sure to prepare matrices before we begin any measurement. Note the quick-bench call. It is called only after m1, m2, and result have been instantiated and populated. Of course, in some applications the creation time can also be relevant. But usually it is not. Usually, we should populate, load, or initialize the data once, and then call many high-speed operations on it. But even if it is, it is not a topic of this benchmark. So, load the data in a whatever way, and then measure the matrix multiplication speed is what we do here.

ND4J:

(defn bench-nd4j-mmuli-float [^long m ^long k ^long n]
  (let [m1 (Nd4j/rand m k)
        m2 (Nd4j/rand k n)
        result (Nd4j/createUninitialized m n)]
    (quick-bench
     (do (.mmuli ^INDArray m1 ^INDArray m2 ^INDArray result)
         true))))

The equivalent Neanderthal code:

(let [splittable-random (SplittableRandom.)]
  (defn random ^double [^double _]
    (.nextDouble ^SplittableRandom splittable-random)))

(defn bench-neanderthal-mm!-float [^long m ^long k ^long n]
  (with-release [m1 (fmap! random (fge m k))
                 m2 (fmap! random (fge k n))
                 result (fge m n)]
    (quick-bench
     (do (mm! 1.0 m1 m2 0.0 result)
         true))))

Let's run it

I'll call these with the usual dimensions first. From small to large, and see how Neanderthal and ND4J compare. First we'll try the powers of 2. You know, the usual fastest cases.

4x4

(bench-nd4j-mmuli-float 4 4 4)
Evaluation count : 105222 in 6 samples of 17537 calls.
             Execution time mean : 5.942944 µs
    Execution time std-deviation : 157.029961 ns
   Execution time lower quantile : 5.771839 µs ( 2.5%)
   Execution time upper quantile : 6.125343 µs (97.5%)
                   Overhead used : 1.364399 ns
(bench-neanderthal-mm!-float 4 4 4)
Evaluation count : 2425998 in 6 samples of 404333 calls.
             Execution time mean : 249.515958 ns
    Execution time std-deviation : 5.380322 ns
   Execution time lower quantile : 242.310660 ns ( 2.5%)
   Execution time upper quantile : 253.920545 ns (97.5%)
                   Overhead used : 1.364399 ns

As expected, Neanderthal is 24 times faster.

128x128

In Paul's experiments, this was the dimension where ND4J catches up.

(bench-nd4j-mmuli-float 128 128 128)
Evaluation count : 10548 in 6 samples of 1758 calls.
             Execution time mean : 70.099357 µs
    Execution time std-deviation : 5.436201 µs
   Execution time lower quantile : 62.897093 µs ( 2.5%)
   Execution time upper quantile : 75.951324 µs (97.5%)
                   Overhead used : 1.364399 ns
(bench-neanderthal-mm!-float 128 128 128)
Evaluation count : 53238 in 6 samples of 8873 calls.
             Execution time mean : 14.473891 µs
    Execution time std-deviation : 4.845955 µs
   Execution time lower quantile : 11.640464 µs ( 2.5%)
   Execution time upper quantile : 21.214843 µs (97.5%)
                   Overhead used : 1.364399 ns

Here, I get different numbers. Neanderthal is still 5 times faster! Read more for my explanation what might have happened in Paul's benchmarks.

1024x1024

OK, let's see. Maybe Paul's machine is simply different than mine, so the catch-up happens a bit later here?

(bench-nd4j-mmuli-float 1024 1024 1024)
Evaluation count : 48 in 6 samples of 8 calls.
             Execution time mean : 13.769772 ms
    Execution time std-deviation : 1.392356 ms
   Execution time lower quantile : 12.100929 ms ( 2.5%)
   Execution time upper quantile : 15.414595 ms (97.5%)
                   Overhead used : 1.364399 ns
(bench-neanderthal-mm!-float 1024 1024 1024)
Evaluation count : 72 in 6 samples of 12 calls.
             Execution time mean : 7.969413 ms
    Execution time std-deviation : 2.128109 ms
   Execution time lower quantile : 5.592680 ms ( 2.5%)
   Execution time upper quantile : 10.434048 ms (97.5%)
                   Overhead used : 1.364399 ns

Nope, on 1024x1024 matrices, which is already quite large, so the overhead should not be noticeable, Neanderthal is still almost twice as fast as ND4J!

4096x4096

What about increasing the dimensions even further?

(bench-nd4j-mmuli-float 4096 4096 4096)
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 519.947225 ms
    Execution time std-deviation : 88.737480 ms
   Execution time lower quantile : 457.144434 ms ( 2.5%)
   Execution time upper quantile : 626.192143 ms (97.5%)
                   Overhead used : 1.364399 ns
(bench-neanderthal-mm!-float 4096 4096 4096)
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 421.734319 ms
    Execution time std-deviation : 38.450777 ms
   Execution time lower quantile : 385.047354 ms ( 2.5%)
   Execution time upper quantile : 464.568442 ms (97.5%)
                   Overhead used : 1.364399 ns

Neanderthal is still faster. Not many times, but still a healthy 20% faster, which pleasantly surprised me!

10000x10000

Cool. But what about huge, not-powers of 2 matrices? When the actual computation in MKL runs in seconds, both libraries should do the same? Apparently, not:

(bench-nd4j-mmuli-float 10000 10000 10000)
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 7.454176 sec
    Execution time std-deviation : 330.407747 ms
   Execution time lower quantile : 7.079223 sec ( 2.5%)
   Execution time upper quantile : 7.883593 sec (97.5%)
                   Overhead used : 1.364399 ns
(bench-neanderthal-mm!-float 10000 10000 10000)
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 6.496778 sec
    Execution time std-deviation : 216.152094 ms
   Execution time lower quantile : 6.299091 sec ( 2.5%)
   Execution time upper quantile : 6.796785 sec (97.5%)
                   Overhead used : 1.364399 ns

There is still more than 15% lead by Neanderthal! This is much better than I expected!

Some random odd dimensions

MKL is rather optimized, so dimensions that are not power of 2 are still performant. Moreover, both libraries use the same MKL backend, so this should not be much relevant, but since it costs me nothing to run the benchmark function with different arguments, why not?

(bench-nd4j-mmuli-float 3129 7845 1299)
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 288.518780 ms
    Execution time std-deviation : 82.622658 ms
   Execution time lower quantile : 217.015311 ms ( 2.5%)
   Execution time upper quantile : 392.145625 ms (97.5%)
                   Overhead used : 1.364399 ns
(bench-neanderthal-mm!-float 3129 7845 1299)
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 196.261770 ms
    Execution time std-deviation : 36.661680 ms
   Execution time lower quantile : 166.886059 ms ( 2.5%)
   Execution time upper quantile : 253.401196 ms (97.5%)
                   Overhead used : 1.364399 ns

By this time, I expect Neanderthal to win no matter what shape I throw at it, and I'm not disappointed to find out that it is a few dozen percents faster on average.

But why Paul got different results?

Of course, since I can't see his code, and I can't see the actual results that he got, I can not know for sure. However, I might guess. My guess is that he created double precision matrices in Neanderthal and then run those. Let's try to do that:

(defn bench-neanderthal-mm!-double [^long m ^long k ^long n]
  (with-release [m1 (fmap! random (dge m k))
                 m2 (fmap! random (dge k n))
                 result (dge m n)]
    (quick-bench
     (do (mm! 1.0 m1 m2 0.0 result)
         true))))

This should be the dimension where ND4J catches up in his report:

(bench-neanderthal-mm!-double 128 128 128)
Evaluation count : 20778 in 6 samples of 3463 calls.
             Execution time mean : 43.818634 µs
    Execution time std-deviation : 14.212249 µs
   Execution time lower quantile : 25.217389 µs ( 2.5%)
   Execution time upper quantile : 58.573658 µs (97.5%)
                   Overhead used : 1.364399 ns

Scroll up to see that this is still faster than ND4J's single precision result of 70 microseconds! But Neanderthal's 43 microseconds here with doubles are slower than Neanderthal's 14 microseconds with floats, so it's consistent with what Paul got on his machine.

Let's try to ramp it up:

(bench-neanderthal-mm!-double 256 256 256)
Evaluation count : 2664 in 6 samples of 444 calls.
             Execution time mean : 289.675935 µs
    Execution time std-deviation : 52.723462 µs
   Execution time lower quantile : 237.323191 µs ( 2.5%)
   Execution time upper quantile : 341.534073 µs (97.5%)
                   Overhead used : 1.364399 ns
(bench-nd4j-mmuli-float 256 256 256)
Evaluation count : 2628 in 6 samples of 438 calls.
             Execution time mean : 265.474912 µs
    Execution time std-deviation : 11.124842 µs
   Execution time lower quantile : 250.634715 µs ( 2.5%)
   Execution time upper quantile : 276.895468 µs (97.5%)
                   Overhead used : 1.364399 ns

So, running 256x256 matrices, ND4J single-precision and Neanderthal double precision, and the speed is roughly the same.

Of course, Neanderthal running single-precision computation is twice as fast, as expected:

(bench-neanderthal-mm!-float 256 256 256)
Evaluation count : 2340 in 6 samples of 390 calls.
             Execution time mean : 155.588860 µs
    Execution time std-deviation : 53.847954 µs
   Execution time lower quantile : 96.616695 µs ( 2.5%)
   Execution time upper quantile : 217.512076 µs (97.5%)
                   Overhead used : 1.364399 ns

Conclusions

This post was long overdue, and it's good that I got intrigued and took time to write it. I hope that it's a start of a series of experiments where you, me, and people from Skymind test and showcase Java for high speed computations.

I'm pleasently surprised by how Neanderthal was much faster across the board. I expected smaller difference. But these tests show both libraries in favorable light. I am almost sure that both would be faster than Numpy; that would be a good comparison.

Of course, matrix multiplication speed is not the only one measurement. There are many more things to consider. But matrix multiplication is a good indicator of the overall overhead. Feel free to do some tests of the things you think are important, fix the goalposts, and I'll try to replicate those with Neanderthal.

Some ideas for the next comparisons are GPU computing, linear equation solving, shifting data around etc. My favorite theme is also combining Neanderthal with custom kernels, both CUDA and OpenCL, so that will be covered for sure :)

Please join the discussion :)

Link to the response from the ND4J folks and the second part is already on my blog: Neanderthal vs. ND4J – Vol 2 – The Same Native MKL Backend, 1000 X Speedup

You can find Neanderthal's source code here:

Neanderthal vs ND4J - vol 1 - Native performance, Java and CPU - June 22, 2018 - Dragan Djuric