Neanderthal vs ND4J - vol 5 - Why are native map and reduce up to 100x faster in Clojure?

You can adopt a pet function! Support my work on my Patreon page, and access my dedicated discussion server. Can't afford to donate? Ask for a free invite.

November 14, 2018

Please share: .

See Part 1, 2, 3, and 4, of this series, for comparisons of matrix operations on the CPU and broadcasting on the CPU and GPU. Now I demonstrate another common area where Neanderthal () is much faster than Nd4j despite both libraries using optimized native routines and the same hardware.

When I wrote the last article, I had too much material, and I left comparisons of mapping and reduction functions for later. Some months pass, without me finding time to write a proper narrative. But, now I think there is no need for much explanation. Here is the code, and I put results that I get on my old Intel i7 4790k processor for different sizes of vectors and matrices. What else speaks better than code?

As examples, I typically use functions that I found in the Nd4j documentation and tutorial.

TL;DR: Neanderthal () is typically an order of magnitude faster. For small data even more. For very large, Nd4j catches up.

(require '[uncomplicate.commons.core :refer [with-release release]]
         '[uncomplicate.fluokitten.core :refer [fmap!]]
           [core :refer [dot raw row col entry! mv! sum mrows asum axpy! vctr rk!]]
           [native :refer [fge fv]]
           [math :refer [sqrt]]]
         '[criterium.core :refer [quick-bench]])
(import org.nd4j.linalg.factory.Nd4j

(let [splittable-random (SplittableRandom.)]
  (defn random ^double [^double _]
    (.nextDouble ^SplittableRandom splittable-random)))

Distance (dot product)

Calculating distance between two vectors or matrices involves both mapping, which is trivially parallelizable, and reduction.


(defn bench-nd4j-distance
   (let [x (Nd4j/rand (int-array [n]))
         y (Nd4j/rand (int-array [n]))]
     (quick-bench (.distance2 ^INDArray x ^INDArray y))))
  ([m n]
   (let [x (Nd4j/rand \f (int-array [m n]))
         y (Nd4j/rand \f (int-array [m n]))]
     (quick-bench (.distance2 ^INDArray x ^INDArray y)))))


(defn bench-neanderthal-distance
   (with-release [x (fmap! random (fv n))
                  y (fmap! random (fv n))]
     (quick-bench (sqrt (dot x y)))))
  ([m n]
   (with-release [x (fmap! random (fge m n))
                  y (fmap! random (fge m n))]
     (quick-bench (sqrt (dot x y))))))


  • Distance between vectors
    dimension Nd4j Neanderthal Neanderthal speedup
    2 2.865009 µs 64.005593 ns \(\times\) 45
    10 3.055880 µs 62.042329 ns \(\times\) 49
    100 3.104767 µs 66.816429 ns \(\times\) 47
    1,000 3.712104 µs 92.034678 ns \(\times\) 40
    10,000 7.369057 µs 908.968553 ns \(\times\) 8
    1 M 369.138225 µs 68.781036 µs \(\times\) 5
    100 M 39.677566 ms 25.768197 ms \(\times\) 1.5
  • Distance between matrices
    dimensions Nd4j Neanderthal Neanderthal speedup
    10 \(\times\) 15 3.313088 µs 278.192235 ns \(\times\) 12
    100 \(\times\) 100 7.109816 µs 1.115164 µs \(\times\) 6
    1,000 \(\times\) 1,000 357.002435 µs 82.225425 µs \(\times\) 4
    100 \(\times\) 10,000 423.156695 µs 76.504216 µs \(\times\) 5
    10,000 \(\times\) 100 335.000853 µs 81.241423 µs \(\times\) 4
    10,000 \(\times\) 10,000 40.107896 ms 27.167233 ms \(\times\) 1.5



(defn bench-nd4j-sum
   (let [x (Nd4j/rand (int-array [n]))]
     (quick-bench (.sumNumber ^INDArray x))))
  ([m n]
   (let [a (Nd4j/rand \f (int-array [m n]))
         d (int-array [0])]
     (quick-bench (.sum ^INDArray a d)))))


(defn bench-neanderthal-sum
   (with-release [x (fmap! random (fv n))]
     (quick-bench (sum x))))
  ([m n]
   (with-release [a (fmap! random (fge m n))]
     (quick-bench (sum a)))))


You'll notice that this is the only function on the CPU where Nd4j can be faster than Neanderthal.

  • Vector sum
    dimension Nd4j Neanderthal Neanderthal speedup
    2 4.214011 µs 71.435637 ns \(\times\) 59
    10 4.232423 µs 70.562578 ns \(\times\) 60
    100 4.299567 µs 84.437461 ns \(\times\) 51
    1,000 4.627851 µs 120.773336 ns \(\times\) 39
    10,000 14.572367 µs 1.225827 µs \(\times\) 12
    1 M 219.302064 µs 130.245611 µs \(\times\) 1.7
    100 M 22.361447 ms 28.147435 ms \(\times\) 0.8
  • Matrix sum
    dimensions Nd4j Neanderthal Neanderthal speedup
    10 \(\times\) 15 5.904680 µs 84.925045 ns \(\times\) 69
    100 \(\times\) 100 21.092154 µs 1.254015 µs \(\times\) 17
    1,000 \(\times\) 1,000 52.358064 µs 131.412269 µs \(\times\) 0.4
    100 \(\times\) 10,000 103.955403 µs 147.992786 µs \(\times\) 0.7
    10,000 \(\times\) 100 64.909655 µs 129.061267 µs \(\times\) 0.5
    10,000 \(\times\) 10,000 15.360876 ms 27.866048 ms \(\times\) 0.5

Absolute sum

OK, I am guilty of not making sum as fast as I should have. However, I can offer the asum function, which covers most of sum use cases. See how fast that is in Neanderthal.


Absolute sum is cumbersome to call in Nd4j so I'll skip that code and use sum results from Nd4j for comparison. asum has more work to do anyway, and in my measurements, Nd4j sum was never slower than Nd4j asum.


(defn bench-neanderthal-asum
   (with-release [x (fmap! random (fv n))]
     (quick-bench (asum x))))
  ([m n]
   (with-release [a (fmap! random (fge m n))]
     (quick-bench (asum a)))))


  • Vector asum
    dimension Nd4j sum Neanderthal asum Neanderthal speedup
    2 4.214011 µs 50.302131 ns \(\times\) 84
    10 4.232423 µs 45.395512 ns \(\times\) 94
    100 4.299567 µs 49.682850 ns \(\times\) 86
    1,000 4.627851 µs 76.036001 ns \(\times\) 61
    10,000 14.572367 µs 573.208513 ns \(\times\) 25
    1 M 219.302064 µs 32.646665 µs \(\times\) 6.5
    100 M 22.361447 ms 13.238496 ms \(\times\) 1.7
  • Matrix sum
    dimensions Nd4j sum Neanderthal asum Neanderthal speedup
    10 \(\times\) 15 5.904680 µs 50.816077 ns \(\times\) 116
    100 \(\times\) 100 21.092154 µs 466.800109 ns \(\times\) 45
    1,000 \(\times\) 1,000 52.358064 µs 31.799570 µs \(\times\) 1.7
    100 \(\times\) 10,000 103.955403 µs 33.643302 µs \(\times\) 3
    10,000 \(\times\) 100 64.909655 µs 32.994745 µs \(\times\) 2
    10,000 \(\times\) 10,000 15.360876 ms 13.070991 ms \(\times\) 1.15

Finding the position of maximum absolute value

Often we need to know which element is the largest or smallest.


In the last section, I was joking that calling absolute sum in Nd4j is too cumbersome to show you the code, to make you write some code and measure some of that stuff yourself. Rather that being too cumbersome, it is just cumbersome, as in the case of iamax that I show here:

(defn bench-nd4j-iamax
  (let [x (Nd4j/rand (int-array [n]))
        exec (Nd4j/getExecutioner)
        iamax (org.nd4j.linalg.api.ops.impl.indexaccum.IAMax. x)]
    (quick-bench (.getFinalResult (.execAndReturn exec iamax)))))

Notice how you need to setup the executor and keep it around. I do not measure the executor creation time here. But beware, wrapping lots of similar Nd4j code inside simpler API might seem as a low hanging fruit, but might be a trap if you don't find a way to properly handle the executor life-cycle.


(defn bench-neanderthal-iamax [n]
  (with-release [x (fmap! random (fv n))]
    (quick-bench (iamax x))))


  • Vector iamax
    dimension Nd4j sum Neanderthal asum Neanderthal speedup
    2 1.212382 µs 75.503383 ns \(\times\) 16
    10 1.234349 µs 66.782601 ns \(\times\) 19
    100 1.346079 µs 63.711496 ns \(\times\) 21
    1,000 2.080715 µs 305.853934 ns \(\times\) 7
    10,000 10.239278 µs 2.540639 µs \(\times\) 4
    1 M 209.913386 µs 116.195962 µs \(\times\) 1.8
    100 M 20.508549 ms 15.084416 ms \(\times\) 1.4

So, why?

If you run the code from this post on your machine you'll probably get different numbers but proportionally comparable results. You'll notice one more thing: Neanderthal benchmarks will complete in a short time, while Nd4j benchmarks, especially the ones where the function itself is very fast, will drag on while the CPU usage spikes. That's because Nd4j tries to use Java's garbage collector to explicitly handle external resources. It goes against all warnings in Java's GC documentation to not do that. I guess with enough care it works well for DL4J folks, but here is one demonstration how it can be a source of incredible slowdown under pressure. Use with care.

However, this is not the reason for the difference in speed of the computations that we measure.

I have shown you that Neanderthal is much faster, but the title implies that this article will answer why is that so. Sorry for the creative title, the answer might be something you discover inspired by this series.

It seems to me I have shown you enough of the capabilities of both Nd4j and Neandeathal on the CPU. The next article will probably explore the GPU side of both libraries.

Neanderthal vs ND4J - vol 5 - Why are native map and reduce up to 100x faster in Clojure? - November 14, 2018 - Dragan Djuric