Neanderthal vs ND4J - vol 3 - Clojure Beyond the Fast Native MKL Backend

June 29, 2018

Some free time fell into my lap, and I decided to continue helping the DL4J team at Skymind improve ND4J, free of charge, since they are generous folks who contribute DL4J as free software so anyone can use it for learning, fun, and profit. In Part 2 of this series, I demonstrated just one of the strengths of Neanderthal () (the library I develop), and here I analyze the code and measurements the ND4J developers wrote as the response.

The part 1 of this series contains the backstory and the initial comparisons of these two fast scientific computing libraries for Java platform.

First of all, let me thank the ND4J guys, especially Paul Dubs, for handling this thread very politely. I know these kinds of discussions can heat up, so I apologize in advance if I cause the heat to go up more than necessary from time to time. I respect them and their work. I hope we'll continue these purely technical comparisons cool-headedly.

ND4J team's implementation

Instead of the ND4J code that I wrote in Part 2 and used as a strawman, the ND4J guys supplied a better implementation.

First, the imports.

(require '[uncomplicate.commons.core :refer [with-release release]]
         '[uncomplicate.fluokitten.core :refer [fmap!]]
         '[uncomplicate.neanderthal
           [core :refer [mm! mm]]
           [native :refer [dge fge]]]
         '[criterium.core :refer [quick-bench]])
(import org.nd4j.linalg.factory.Nd4j
        org.nd4j.linalg.api.ndarray.INDArray
        org.nd4j.linalg.cpu.nativecpu.NDArray
        java.util.SplittableRandom)

The new ND4J benchmark method:

(defn bench-nd4j-mm-forwards [m k1 k2 k3 k4 k5 n]
  (let [m1 (Nd4j/rand ^int m ^int k1)
        m2 (Nd4j/rand ^int k1 ^int k2)
        m3 (Nd4j/rand ^int k2 ^int k3)
        m4 (Nd4j/rand ^int k3 ^int k4)
        m5 (Nd4j/rand ^int k4 ^int k5)
        m6 (Nd4j/rand ^int k5 ^int n)]
    (quick-bench
     (do (-> ^INDArray m1
             (.mmul ^INDArray m2)
             (.mmul ^INDArray m3)
             (.mmul ^INDArray m4)
             (.mmul ^INDArray m5)
             (.mmul ^INDArray m6))
         true))))

You'll notice that this mmul chain is called from left to right instead from right to left.

Trying out the improved ND4J code

Trying it with the same data as in the previous part:

(bench-nd4j-mm-forwards 13 13456 2300 125 7700 810 100000)
Evaluation count : 12 in 6 samples of 2 calls.
             Execution time mean : 57.653097 ms
    Execution time std-deviation : 14.216856 ms
   Execution time lower quantile : 44.730686 ms ( 2.5%)
   Execution time upper quantile : 71.051185 ms (97.5%)
                   Overhead used : 1.640938 ns

As Paul also measured, that's roughly only twice slower than Neanderthal's 25 milliseconds, and nowhere near 25 seconds!

Paul also correctly noted that the new implementation is not really any better than the implementation I provided in my previous post. It does not work so well with arbitrary dimensions, and is optimized to handle the opposite direction.

Let's do the most trivial thing and reverse those arguments:

(bench-nd4j-mm-forwards 100000 810 7700 125 2300 13456 13)

Don't do this at home. You'll have to wait for two hours. Use the equivalent time-nd4j-mm-forwards, since ND4J has the same problem again: 26 seconds:

"Elapsed time: 25971.33278 msecs"

In almost everything Paul said in his response, he is right, and I agree (I might nitpick here and there, but it doesn't matter). However, he missed the main point in thinking that any of these directions specifically favor Neanderthal. In general case, which is not amenable to any of the two directions, you'll get more than 2x difference.

First, ND4J, from both directions

Paul complains that the dimension 13 at left and 100000 at right stacks this computation favorably to be computed from the left, while I intentionally implemented ND4J.mmul to compute from the right. I'll remove that obstacle and make both sides equal: 10000 (100000×100000 can't fit into memory).

(time-nd4j-mm-forwards 10000 13456 2300 125 7700 810 10000)
"Elapsed time: 2570.365788 msecs"

The right-sided ND4J call from the previous article:

(time-nd4j-mmul 10000 13456 2300 125 7700 810 10000)
"Elapsed time: 10013.720258 msecs"

Neandrthal's mm

Let's see Neanderthal:

(defn bench-neanderthal-mm [m k1 k2 k3 k4 k5 n]
  (with-release [m1 (fmap! random (fge m k1))
                 m2 (fmap! random (fge k1 k2))
                 m3 (fmap! random (fge k2 k3))
                 m4 (fmap! random (fge k3 k4))
                 m5 (fmap! random (fge k4 k5))
                 m6 (fmap! random (fge k5 n))]
    (quick-bench
     (release (mm m1 m2 m3 m4 m5 m6)))))
(bench-neanderthal-mm 10000 13456 2300 125 7700 810 10000)
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 308.434844 ms
    Execution time std-deviation : 63.680791 ms
   Execution time lower quantile : 242.705198 ms ( 2.5%)
   Execution time upper quantile : 376.655029 ms (97.5%)
                   Overhead used : 1.640938 ns

That's roughly 8 times faster than the fastest of the two directions computed with ND4J.

It's not as rare as one may think

And, the implementation of the chained multiplication that I used in the previous post is not even a fictional strawman. It's exactly what 3-rd party ND4J wrappers use. See this implementation of chained mmul in core.matrix, which is a library DL4J guys (semi)officially provide a ND4J backend for. That library is used by many Clojure projects, and anyone using this functionality will be bitten by it. It's not ND4J's fault, but it's relevant.

I will eat my hat (don't worry, it's a hat made of fine Belgian chocolate) if it is not true that virtually every user of ND4J would come up with exactly the same ND4J implementation that Paul and I used here, and will think that it's as fast as it can go, since ND4J is fast, and is using MKL under the hood!

That specific functionality may not be used by that many people, much often, I admit, but still - there are many places that accumulate similar sorts of problems…

Conclusions

ND4J developers may (rightfully) complain that it is not fair to compare high-level features that Neanderthal has in cases where ND4J does not provide the exact match. OK, I might agree. In the next article, I'll invert that unfair situation, and be unfair to Neanderthal. I'll try to test some of the core operations that ND4J has, and Neanderthal doesn't, so I'll have to implement them through the article using whatever makeshift contraption I can come up with using Neanderthal's existing features.

I can repeat the conclusions from Part 2. Performance gotchas are easy to accumulate. Neanderthal helps you avoid that. ND4J a bit less (but it's still a cool library). This is only an illustration of the general challenge: a program might waste lots of resources without you even suspecting. It may not always be the case, but it happens more often than you think.

I'm not saying that Neanderthal gets everything right, nor that it is perfect. It seems to get more things right than ND4J does. And I'd like if people help me in unearthing some bugs in Neanderthal, or bring to my attention better ways to implement some things.

I'm rather satisfied with the results the ND4J folks are reporting. From Adam Gibson's initial claim that ND4J is considerably faster than Neanderthal, we now got to the place where it's a success for ND4J to be only 2× slower than Neanderthal. We still couldn't find a case to support Adam's claims yet.

You can find Neanderthal's source code here: Documentation can be reached via Neanderthal's homepage, while there's a bunch of tutorials at the blog you're reading right now. If you need more theory-oriented textbook style linear algebra code, you can start with Clojure Linear Algebra Refresher (1) - Vector Spaces, while if you remember your matrix-fu, you might need some numerical computation tutorials; start with Clojure Numerics - Part 1 - Use Matrices Efficiently.

Neanderthal vs ND4J - vol 3 - Clojure Beyond the Fast Native MKL Backend - June 29, 2018 - Dragan Djuric