Neanderthal vs ND4J - vol 2 - The Same Native MKL Backend, 1000 x Speedup

Need help with your custom Clojure software? I'm open to (selected) contract work.

June 28, 2018

Please share: .

These books fund my work! Please check them out.

Mercenary work have been finished ahead of time - sounds incredible, but happens! - a handful of time to spare. With my shoulder slightly injured, I have to avoid the gym for a few weeks. Time for free software! More Neanderthal vs ND4J comparison, then?

The part 1 of this series contains the backstory and the first series of experiments comparing Neanderthal () and ND4J, fast scientific computing libraries for Java platform. We are ready to dig further, still staying with matrix multiplication on the CPU. Upcoming articles will tackle other matrix and linear algebra functions, as well as GPU computing. There is more valuable insight at a place where we currently are.

Which one is faster, after all?

Comparisons of Neanderthal () and ND4J started with ND4J's creator claiming that ND4J was considerably faster than Neanderthal in the tests that Skymind (the company that develops ND4J) conducted. Having been intrigued, I run the benchmarks they suggested. I've found the contrary, that Neanderthal was multiple times faster for small matrices, a couple of times faster than mid-sized matrices, and a few dozen percents faster for huge matrices. That was better than I expected! Since I checked the ND4J documentation and confirmed that I followed its recommendations, I called it a day and handed the ball over to the ND4J guys. This has been described in part 1 of this series, Neanderthal vs ND4J - vol 1 - Native Performance, Java, and CPU.

The ND4J guys investigated what happened, and found out that:

  1. While my measurements were correct, ND4J's performance is not optimal with default memory layout. To get the best performance, the resulting matrix has to be in \f order, while the default is \c.
  2. They dug out and fixed some bugs in ND4J.

Those being fixed, ND4J improved. \f instead of \c order makes ND4J only a handful of times slower for small matrices, a few dozen percents slower for mid-sized, and almost the same speed as Neanderthal with large matrices. Paul published the process and the result on his blog.

I might have a minor issue with him changing the test to use the low-level static Nd4j.gemm method instead of mmuli, a method suggested by the documentation that the user would normally call, but since Neanderthal's high-level polymorphic mm! function still keeps the lead, I won't be splitting hairs and crying foul here. The results he reports are valid. Paul also reasoned that mm! is, after all, only a GEMM call with a few additional checks. I argue it does more than that.

As a warm-up, I'll rerun some benchmarks with \f order, leaving you to run various dimensions yourself.

The updated project is here.

First, the imports:

(require '[uncomplicate.commons.core :refer [with-release release double-fn]]
         '[uncomplicate.fluokitten.core :refer [fmap!]]
         '[uncomplicate.neanderthal
           [core :refer [mm! mm]]
           [native :refer [dge fge]]]
         '[criterium.core :refer [quick-bench]])
(import org.nd4j.linalg.factory.Nd4j
        org.nd4j.linalg.api.ndarray.INDArray
        org.nd4j.linalg.cpu.nativecpu.NDArray
        java.util.SplittableRandom)

We'll use \f layout and Nd4j/gemm:

(defn bench-nd4j-gemm-float [^long m ^long k ^long n]
  (let [m1 (Nd4j/rand m k)
        m2 (Nd4j/rand k n)
        result (Nd4j/createUninitialized (int-array [m n]) \f)]
    (quick-bench
     (do (Nd4j/gemm ^INDArray m1 ^INDArray m2 ^INDArray result false false 1.0 0.0)
         true))))
(bench-nd4j-gemm-float 128 128 128)
Evaluation count : 34482 in 6 samples of 5747 calls.
             Execution time mean : 19.596878 µs
    Execution time std-deviation : 3.621484 µs
   Execution time lower quantile : 17.475822 µs ( 2.5%)
   Execution time upper quantile : 24.146490 µs (97.5%)
                   Overhead used : 1.140086 ns

Much better than previously, but still behind Neanderthal's timing, which is similar to what ND4J guys got.

May I mention that Neanderthal handles all memory layouts equally well? It doesn't care whether it is CCC, FFF, CCF, CFC (in ND4J's terminology), Neanderthal will figure out how to get the same top speed from the underlying backend. Check that out by providing {:layout :row} (ND4J's C) as an optional argument to fge: (fge m n {:layout :row}).

I can conclude that Neanderthal is the faster library indeed, at least when it comes to matrix multiplication.

  • when the layout is optimal for ND4J, Neandrethal is at equal speed at worst, but often faster;
  • with arbitrary memory layout combinations ND4J slows down, while Neanderthal keeps up its top speed.

If the only thing we had to do was a single matrix multiplication, this would have been the concluding article in the series. It's not a difference that matters in most programs, and they have much larger inefficiencies in other places anyway.

Beyond basic matrix multiplication

Just multiplying a few matrices rarely ever happens, though. Even the simplest algorithm does several calls, and combine different operations. Therefore, it is important that the API is amenable to composition of operations, and when those operations get composed, how they work together.

Neanderthal is much more than native BLAS wrapper - it is a high-level numerical linear algebra environment. Its purpose is not only to provide fast access to underlying computation routines, but also:

  • automate most of the bookkeeping and data tending
  • automatically choose the right method for the underlying structure and call it with optimal options
  • make the API as simple, composable, and declarative as possible
  • guide and teach the programmer, keep simple things simple, and hard things still possible

Enough talk, on with the examples!

Meet the pure mm method. It's similar to mm! but it doesn't overwrite any of its arguments. It doesn't require a pre-created memory for the result. For simple use cases, ND4J has a counterpart, mmul. If we compare them, we'll find out that the difference in speed is what we'd expect by now.

ND4J mmul

(defn bench-nd4j-mmul [^long m ^long k ^long n]
  (let [m1 (Nd4j/rand m k)
        m2 (Nd4j/rand k n)]
    (quick-bench
     (do (.mmul ^INDArray m1 ^INDArray m2)
         true))))
(bench-nd4j-mmul 128 128 128)
Evaluation count : 21276 in 6 samples of 3546 calls.
             Execution time mean : 36.131573 µs
    Execution time std-deviation : 4.488518 µs
   Execution time lower quantile : 32.052519 µs ( 2.5%)
   Execution time upper quantile : 42.751941 µs (97.5%)
                   Overhead used : 1.140086 ns

Neanderthal mm

(let [splittable-random (SplittableRandom.)]
  (defn random ^double [^double _]
    (.nextDouble ^SplittableRandom splittable-random)))

(defn bench-neanderthal-mm [^long m ^long k ^long n]
  (with-release [m1 (fmap! random (fge m k))
                 m2 (fmap! random (fge k n))]
    (quick-bench
     (release (mm m1 m2)))))
(bench-neanderthal-mm 128 128 128)
Evaluation count : 33714 in 6 samples of 5619 calls.
             Execution time mean : 18.404674 µs
    Execution time std-deviation : 3.261362 µs
   Execution time lower quantile : 15.999125 µs ( 2.5%)
   Execution time upper quantile : 22.894910 µs (97.5%)
                   Overhead used : 1.140086 ns

Better, but nothing to write home about. I've run this only to show you that the starting positions for these methods are close.

Matrix Transformations

You've probably seen formulas like this one in Linear Algebra literature:

\begin{gather*} M_{m\times n} = A_{m \times k_1} B_{k_1 \times k_2} C_{k_2 \times k_3} D_{k_3 \times k_4} E_{k_4 \times n}\\ \end{gather*}

or

\begin{gather*} M = M_1 M_2 M_3 M_4 M_5 E \end{gather*}

We start from some data in matrix E, and do a series of matrix transformations on it. The example is abstract, but I hope you agree that things like this are not unheard of.

Neanderthal mm M=ABCDEF uniform sizes

Neanderthal's mm method can handle as many of those as we need in one call. Let's try with six:

(defn bench-neanderthal-mm [m k1 k2 k3 k4 k5 n]
  (with-release [m1 (fmap! random (fge m k1))
                 m2 (fmap! random (fge k1 k2))
                 m3 (fmap! random (fge k2 k3))
                 m4 (fmap! random (fge k3 k4))
                 m5 (fmap! random (fge k4 k5))
                 m6 (fmap! random (fge k5 n))]
    (quick-bench
     (release (mm m1 m2 m3 m4 m5 m6)))))
(bench-neanderthal-mm 128 128 128 128 128 128 128)
Evaluation count : 6804 in 6 samples of 1134 calls.
             Execution time mean : 98.326749 µs
    Execution time std-deviation : 17.289847 µs
   Execution time lower quantile : 89.346895 µs ( 2.5%)
   Execution time upper quantile : 128.161994 µs (97.5%)
                   Overhead used : 1.140086 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 47.8944 % Variance is moderately inflated by outliers

ND4J mmul M=ABCDEF uniform sizes

I was unable to find the counterpart in ND4J. It seems that it only provides the simple binary matrix multiplication. We'll have to write the chain ourselves, which gives a somewhat awkward, but straightforward code.

(defn bench-nd4j-mmul [m k1 k2 k3 k4 k5 n]
  (with-release [m1 (Nd4j/rand ^int m ^int k1)
                 m2 (Nd4j/rand ^int k1 ^int k2)
                 m3 (Nd4j/rand ^int k2 ^int k3)
                 m4 (Nd4j/rand ^int k3 ^int k4)
                 m5 (Nd4j/rand ^int k4 ^int k5)
                 m6 (Nd4j/rand ^int k5 ^int n)]
    (quick-bench
     (do (.mmul ^INDArray m1
                (.mmul ^INDArray m2
                       (.mmul ^INDArray m3
                              (.mmul ^INDArray m4
                                     (.mmul ^INDArray m5 ^INDArray m6)) )))
         true))))
(bench-nd4j-mmul 128 128 128 128 128 128 128)
Evaluation count : 4752 in 6 samples of 792 calls.
             Execution time mean : 202.247243 µs
    Execution time std-deviation : 68.648835 µs
   Execution time lower quantile : 147.452665 µs ( 2.5%)
   Execution time upper quantile : 313.785035 µs (97.5%)
                   Overhead used : 1.140086 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 81.7637 % Variance is severely inflated by outliers

Hey, this ND4J fella keeps up quite well. Only twice as slow. Not something to write the blog post about.

M=ABCDEF Neanderthal's mm vs ND4J's mmul: 1000 times faster

It's unlikely that all matrices will be such finely serendipitous to have exactly the same dimension. Let's try to vary that a bit, and feed them some arbitrary sizes.

(bench-neanderthal-mm 8 120 800 28 3128 18 1208)
Evaluation count : 14346 in 6 samples of 2391 calls.
             Execution time mean : 55.179664 µs
    Execution time std-deviation : 9.232348 µs
   Execution time lower quantile : 42.028083 µs ( 2.5%)
   Execution time upper quantile : 65.236952 µs (97.5%)
                   Overhead used : 1.140086 ns

Neanderthal's still sprinting well!

(bench-nd4j-mmul 8 120 800 28 3128 18 1208)
Evaluation count : 210 in 6 samples of 35 calls.
             Execution time mean : 4.046360 ms
    Execution time std-deviation : 705.958030 µs
   Execution time lower quantile : 3.060699 ms ( 2.5%)
   Execution time upper quantile : 4.626018 ms (97.5%)
                   Overhead used : 1.140086 ns

Whoops. ND4J is starting to lag behind. It took 4000 μs, compared to Neanderthal's 55 μs. That's 80 × slower! Two orders of magnitude…

Let's continue to play with dimensions.

(bench-neanderthal-mm 13 13456 2300 125 7700 810 100000)
Evaluation count : 30 in 6 samples of 5 calls.
             Execution time mean : 24.621988 ms
    Execution time std-deviation : 1.619110 ms
   Execution time lower quantile : 23.443505 ms ( 2.5%)
   Execution time upper quantile : 26.843352 ms (97.5%)
                   Overhead used : 1.140086 ns

Neanderthal got into the millisecond range. This seems to be a bit tougher problem than the last one.

Now, when I tried to bench ND4J for the same problem, it seemed as if ND4J left for a long trip. That's why I have to resort to using the Clojure's time function to measure a single run.

(defn time-nd4j-mmul [m k1 k2 k3 k4 k5 n]
  (with-release [m1 (Nd4j/rand ^int m ^int k1)
                 m2 (Nd4j/rand ^int k1 ^int k2)
                 m3 (Nd4j/rand ^int k2 ^int k3)
                 m4 (Nd4j/rand ^int k3 ^int k4)
                 m5 (Nd4j/rand ^int k4 ^int k5)
                 m6 (Nd4j/rand ^int k5 ^int n)]
    (time
     (do (.mmul ^INDArray m1
                (.mmul ^INDArray m2
                       (.mmul ^INDArray m3
                              (.mmul ^INDArray m4
                                     (.mmul ^INDArray m5 ^INDArray m6)) )))
         true))))
(time-nd4j-mmul 13 13456 2300 125 7700 810 100000)

My patience was unbreakable, and I finally met the report:

"Elapsed time: 25815.869081 msecs"

That's right. While Neanderthal sprinted, ND4J struggled to finish the task.

25 seconds vs Neanderthal's 25 milli seconds. That's 1000 × the difference.

Little API niceties in Clojure

As a bonus, here's a benchmark function for Neanderthal that works for any number of matrices.

(defn bench-neanderthal-mm [dimensions]
  (with-release [ms (map (comp (partial fmap! random) fge)
                         (butlast dimensions) (rest dimensions))]
    (quick-bench
     (release (apply mm ms)))))

This version expects a sequence of dimensions (note the [ and ]).

(bench-neanderthal-mm [13 13456 2300 125 7700 810 100000])

Yep, Neanderthal comes equipped with a fine touch of Clojure's famous elegance.

If you feel adventurous, write the looping version for ND4J and play with various dimensions. Witness the wild swings in timing differences. Sometimes Neanderthal is multiple orders of magnitude faster, sometimes only a dozen times. When Moon is in the Seventh House, and Jupiter aligns with Mars, then peace will guide the planets and love will steer the stars. Once in a while ND4J aligns with planets, too, and comes within sight.

Conclusions

Now, It would be unfair to say that the speedup would always be 1000 ×, or that you'll see such drastic speedup often. But it shows that, as soon as we move beyond straightforward operations into a more involved stuff, performance gotchas start to accumulate. A program might waste lots of resources without you even suspecting. After all, aren't all those libraries using the same superbly optimized MKL underneath?

In addition to the top speed, Neanderthal holds the programmer's hand as much as possible - meaning: a lot - while ND4J seems to let me fend for myself.

Would you like to see some more matrix multiplication stuff in the next article, or we are ready to explore other functions? Which ones? Feel free to suggest the themes for future articles.

One cool thing that we should do is compare with Numpy. I'm quite confident in expecting that Clojure and Java win, but who knows. If you know Python ecosystem well, it would be great if you can provide some benchmark code.

You can find Neanderthal's source code here: Documentation can be reached via Neanderthal's homepage, while there's a bunch of tutorials at the blog you're reading right now. If you need more theory-oriented textbook style linear algebra code, you can start with Clojure Linear Algebra Refresher (1) - Vector Spaces, while if you remember your matrix-fu, you might need some numerical computation tutorials; start with Clojure Numerics - Part 1 - Use Matrices Efficiently.

Neanderthal vs ND4J - vol 2 - The Same Native MKL Backend, 1000 x Speedup - June 28, 2018 - Dragan Djuric