Neanderthal vs ND4J - vol 5 - Why are native map and reduce up to 100x faster in Clojure?

Need help with your custom Clojure software? I'm open to (selected) contract work.

November 14, 2018

Please share: Twitter.

These books fund my work! Please check them out.

See Part 1, 2, 3, and 4, of this series, for comparisons of matrix operations on the CPU and broadcasting on the CPU and GPU. Now I demonstrate another common area where Neanderthal () is much faster than Nd4j despite both libraries using optimized native routines and the same hardware.

When I wrote the last article, I had too much material, and I left comparisons of mapping and reduction functions for later. Some months pass, without me finding time to write a proper narrative. But, now I think there is no need for much explanation. Here is the code, and I put results that I get on my old Intel i7 4790k processor for different sizes of vectors and matrices. What else speaks better than code?

As examples, I typically use functions that I found in the Nd4j documentation and tutorial.

TL;DR: Neanderthal () is typically an order of magnitude faster. For small data even more. For very large, Nd4j catches up.

(require '[uncomplicate.commons.core :refer [with-release release]]
         '[uncomplicate.fluokitten.core :refer [fmap!]]
         '[uncomplicate.neanderthal
           [core :refer [dot raw row col entry! mv! sum mrows asum axpy! vctr rk!]]
           [native :refer [fge fv]]
           [math :refer [sqrt]]]
         '[criterium.core :refer [quick-bench]])

(import org.nd4j.linalg.factory.Nd4j
        org.nd4j.linalg.api.ndarray.INDArray
        java.util.SplittableRandom)

(let [splittable-random (SplittableRandom.)]
  (defn random ^double [^double _]
    (.nextDouble ^SplittableRandom splittable-random)))

Distance (dot product)

Calculating distance between two vectors or matrices involves both mapping, which is trivially parallelizable, and reduction.

Nd4j

(defn bench-nd4j-distance
  ([n]
   (let [x (Nd4j/rand (int-array [n]))
         y (Nd4j/rand (int-array [n]))]
     (quick-bench (.distance2 ^INDArray x ^INDArray y))))
  ([m n]
   (let [x (Nd4j/rand \f (int-array [m n]))
         y (Nd4j/rand \f (int-array [m n]))]
     (quick-bench (.distance2 ^INDArray x ^INDArray y)))))

Neanderthal

(defn bench-neanderthal-distance
  ([n]
   (with-release [x (fmap! random (fv n))
                  y (fmap! random (fv n))]
     (quick-bench (sqrt (dot x y)))))
  ([m n]
   (with-release [x (fmap! random (fge m n))
                  y (fmap! random (fge m n))]
     (quick-bench (sqrt (dot x y))))))

Speed

Distance between vectors

dimension	Nd4j	Neanderthal	Neanderthal speedup
2	2.865009 µs	64.005593 ns	\(\times\) 45
10	3.055880 µs	62.042329 ns	\(\times\) 49
100	3.104767 µs	66.816429 ns	\(\times\) 47
1,000	3.712104 µs	92.034678 ns	\(\times\) 40
10,000	7.369057 µs	908.968553 ns	\(\times\) 8
1 M	369.138225 µs	68.781036 µs	\(\times\) 5
100 M	39.677566 ms	25.768197 ms	\(\times\) 1.5

Distance between matrices

dimensions	Nd4j	Neanderthal	Neanderthal speedup
10 \(\times\) 15	3.313088 µs	278.192235 ns	\(\times\) 12
100 \(\times\) 100	7.109816 µs	1.115164 µs	\(\times\) 6
1,000 \(\times\) 1,000	357.002435 µs	82.225425 µs	\(\times\) 4
100 \(\times\) 10,000	423.156695 µs	76.504216 µs	\(\times\) 5
10,000 \(\times\) 100	335.000853 µs	81.241423 µs	\(\times\) 4
10,000 \(\times\) 10,000	40.107896 ms	27.167233 ms	\(\times\) 1.5

Sum

Nd4j

(defn bench-nd4j-sum
  ([n]
   (let [x (Nd4j/rand (int-array [n]))]
     (quick-bench (.sumNumber ^INDArray x))))
  ([m n]
   (let [a (Nd4j/rand \f (int-array [m n]))
         d (int-array [0])]
     (quick-bench (.sum ^INDArray a d)))))

Neanderthal

(defn bench-neanderthal-sum
  ([n]
   (with-release [x (fmap! random (fv n))]
     (quick-bench (sum x))))
  ([m n]
   (with-release [a (fmap! random (fge m n))]
     (quick-bench (sum a)))))

Speed

You'll notice that this is the only function on the CPU where Nd4j can be faster than Neanderthal.

Vector sum

dimension	Nd4j	Neanderthal	Neanderthal speedup
2	4.214011 µs	71.435637 ns	\(\times\) 59
10	4.232423 µs	70.562578 ns	\(\times\) 60
100	4.299567 µs	84.437461 ns	\(\times\) 51
1,000	4.627851 µs	120.773336 ns	\(\times\) 39
10,000	14.572367 µs	1.225827 µs	\(\times\) 12
1 M	219.302064 µs	130.245611 µs	\(\times\) 1.7
100 M	22.361447 ms	28.147435 ms	\(\times\) 0.8

Matrix sum

dimensions	Nd4j	Neanderthal	Neanderthal speedup
10 \(\times\) 15	5.904680 µs	84.925045 ns	\(\times\) 69
100 \(\times\) 100	21.092154 µs	1.254015 µs	\(\times\) 17
1,000 \(\times\) 1,000	52.358064 µs	131.412269 µs	\(\times\) 0.4
100 \(\times\) 10,000	103.955403 µs	147.992786 µs	\(\times\) 0.7
10,000 \(\times\) 100	64.909655 µs	129.061267 µs	\(\times\) 0.5
10,000 \(\times\) 10,000	15.360876 ms	27.866048 ms	\(\times\) 0.5

Absolute sum

OK, I am guilty of not making sum as fast as I should have. However, I can offer the asum function, which covers most of sum use cases. See how fast that is in Neanderthal.

Nd4j

Absolute sum is cumbersome to call in Nd4j so I'll skip that code and use sum results from Nd4j for comparison. asum has more work to do anyway, and in my measurements, Nd4j sum was never slower than Nd4j asum.

Neanderthal

(defn bench-neanderthal-asum
  ([n]
   (with-release [x (fmap! random (fv n))]
     (quick-bench (asum x))))
  ([m n]
   (with-release [a (fmap! random (fge m n))]
     (quick-bench (asum a)))))

Speed

Vector asum

dimension	Nd4j sum	Neanderthal asum	Neanderthal speedup
2	4.214011 µs	50.302131 ns	\(\times\) 84
10	4.232423 µs	45.395512 ns	\(\times\) 94
100	4.299567 µs	49.682850 ns	\(\times\) 86
1,000	4.627851 µs	76.036001 ns	\(\times\) 61
10,000	14.572367 µs	573.208513 ns	\(\times\) 25
1 M	219.302064 µs	32.646665 µs	\(\times\) 6.5
100 M	22.361447 ms	13.238496 ms	\(\times\) 1.7

Matrix sum

dimensions	Nd4j sum	Neanderthal asum	Neanderthal speedup
10 \(\times\) 15	5.904680 µs	50.816077 ns	\(\times\) 116
100 \(\times\) 100	21.092154 µs	466.800109 ns	\(\times\) 45
1,000 \(\times\) 1,000	52.358064 µs	31.799570 µs	\(\times\) 1.7
100 \(\times\) 10,000	103.955403 µs	33.643302 µs	\(\times\) 3
10,000 \(\times\) 100	64.909655 µs	32.994745 µs	\(\times\) 2
10,000 \(\times\) 10,000	15.360876 ms	13.070991 ms	\(\times\) 1.15

Finding the position of maximum absolute value

Often we need to know which element is the largest or smallest.

Nd4j

In the last section, I was joking that calling absolute sum in Nd4j is too cumbersome to show you the code, to make you write some code and measure some of that stuff yourself. Rather that being too cumbersome, it is just cumbersome, as in the case of iamax that I show here:

(defn bench-nd4j-iamax
  [n]
  (let [x (Nd4j/rand (int-array [n]))
        exec (Nd4j/getExecutioner)
        iamax (org.nd4j.linalg.api.ops.impl.indexaccum.IAMax. x)]
    (quick-bench (.getFinalResult (.execAndReturn exec iamax)))))

Notice how you need to setup the executor and keep it around. I do not measure the executor creation time here. But beware, wrapping lots of similar Nd4j code inside simpler API might seem as a low hanging fruit, but might be a trap if you don't find a way to properly handle the executor life-cycle.

Neanderthal

(defn bench-neanderthal-iamax [n]
  (with-release [x (fmap! random (fv n))]
    (quick-bench (iamax x))))

Speed

Vector iamax

dimension	Nd4j sum	Neanderthal asum	Neanderthal speedup
2	1.212382 µs	75.503383 ns	\(\times\) 16
10	1.234349 µs	66.782601 ns	\(\times\) 19
100	1.346079 µs	63.711496 ns	\(\times\) 21
1,000	2.080715 µs	305.853934 ns	\(\times\) 7
10,000	10.239278 µs	2.540639 µs	\(\times\) 4
1 M	209.913386 µs	116.195962 µs	\(\times\) 1.8
100 M	20.508549 ms	15.084416 ms	\(\times\) 1.4

So, why?

If you run the code from this post on your machine you'll probably get different numbers but proportionally comparable results. You'll notice one more thing: Neanderthal benchmarks will complete in a short time, while Nd4j benchmarks, especially the ones where the function itself is very fast, will drag on while the CPU usage spikes. That's because Nd4j tries to use Java's garbage collector to explicitly handle external resources. It goes against all warnings in Java's GC documentation to not do that. I guess with enough care it works well for DL4J folks, but here is one demonstration how it can be a source of incredible slowdown under pressure. Use with care.

However, this is not the reason for the difference in speed of the computations that we measure.

I have shown you that Neanderthal is much faster, but the title implies that this article will answer why is that so. Sorry for the creative title, the answer might be something you discover inspired by this series.

It seems to me I have shown you enough of the capabilities of both Nd4j and Neandeathal on the CPU. The next article will probably explore the GPU side of both libraries.