Map and Reduce Primitive Arrays Without Clojure Macros

July 5, 2018

I've just released a new version of Fluokitten (), a library that enables "monadic" programming in Clojure. It sounds abstract, but it is really the opposite. I'll show you how you can use its fmap and fold functions as universal replacement for Clojure's map and reduce. Not only that fmap and fold offer more functionality than map and reduce, they even support primitive Java arrays, and are as fast as Clojure's macros amap and areduce. There are much more cool functions there - be sure to check documentation and articles at Flukitten's website.

What map and reduce lack

Clojure's map and reduce work with anything sequential, so they work with arrays too. However, the problems are:

  1. They turn everything into sequences and return sequence.
  2. They turn elements into objects. It impact performance of primitive arrays an order of magnitude.
  3. They do not support in-place, mutable changes of source Java array.
  4. They work only with sequential things.

Fluokitten's elegance at full speed!

I'll just show you Fluokitten's map and reduce equivalents, although it has much more to offer.

fmap is, roughly, a polymorphic alternative to map. Instead of converting data to a specific type, it calls specialized implementation for each data type. fmap! is a destructive, in-place version of fmap. fold is a polymorphic alternative to reduce. foldmap is a combination of map and reduce, that first maps over source arguments, and then reduces the results of mapping in the same pass.

In contrast to map and reduce, they:

  1. They keep the type of the object/container they work on.
  2. They can work with primitives without boxing.
  3. fmap! supports in-place, mutable changes (if applicable for the object's type).
  4. They work with any Clojure or Java object!

What's wrong with amap and areduce macros

Let's demonstrate problems with map, reduce, amap, and areduce.

(require '[uncomplicate.commons.core :refer [double-fn]]
         '[uncomplicate.fluokitten.core :refer [fmap! fmap fold foldmap]])
(def arr (double-array [1 2 3]))

map returns a sequence:

(map inc arr)
2.0 3.0 4.0

The array is unchanged. This is normally desired in Clojure, but primitive arrays are explicitly built for mutability and speed, and sometimes we need to mutate them.

(get arr 0)
1.0

As for the speed, let's measure mapping and reduction of an array of 100.000 elements with criterium. The canonical example is dot product:

\begin{gather*} \vec{x} = [x_1, x_2,\ldots, x_n]\\ \vec{y} = [y_1, y_2,\ldots, y_n]\\ \vec{x} \cdot \vec{y} = \sum_{i=1}^n x_i y_i \end{gather*}

So, create two large arrays and create a function for dot product:

(def arr-a (double-array (range 100000)))
(def arr-b (double-array (range 100000)))

(defn dot-seq [xs ys]
  (reduce + (map * xs ys)))
;;(quick-bench (dot-seq arr-a arr-b))
(dot-seq arr-a arr-b)
3.3332833335E14

Execution time mean : 10.074604 ms

It's not that fast…

Rich Hickey knew that this is a problem, and he created macros amap and areduce to help us with that:

(defn dot-areduce [xs ys]
  (areduce ^doubles xs idx res 0.0
           (+ res (* (aget ^doubles xs idx)
                     (aget ^doubles ys idx)))))
;;(quick-bench (dot-areduce arr-a arr-b))
(dot-areduce arr-a arr-b)
3.3332833335E14

100 times faster than the seq way!

Execution time mean : 94.521934 µs

Much faster, but more verbose. To avoid reflection penalty, we had to sprinkle the code with the ^doubles typehints. This means that our function can work only with doubles, and we will have to create a separate version for floats, longs, ints etc. areduce and map had to be macros, since working with primitives is explicit.

Even in Java, there are special methods for each type of primitive arrays: getFloat, getDouble, getLong, setDouble etc. If amap and areduce were functions instead of macros, they would have to work with all types of data, and would have to box and unbox numbers on every call instead of keeping them primitive. This incures performance penalty of roughly an order of magnitude.

Now, Fluokitten!

(defn p+ ^double [^double x ^double y] (+ x y))
(defn p* ^double [^double x ^double y] (* x y))

(defn dot-fluokitten [xs ys]
  (foldmap p+ 0.0 p* xs ys))
;;(quick-bench (dot-fluokitten arr-a arr-b))
(dot-fluokitten arr-a arr-b)
3.3332833335E14

Execution time mean : 78.358088 µs

Yesss! Even slightly faster than areduce! (your mileage may vary, though!).

Faster than map / reduce, and often at the same speed as amap / areduce. If you're not careful you might be hit by boxing. HotSpot is an unpredictable beast when it comes to hese things. It is always safer to provide primitive typehinted functions as arguments. + and * work, but the arguments will be boxed. There are helper functions in Uncomplicate commons library that can wrap any function into a primitive version.

Where it really shows its convenience is when we work with something a bit more sophisticated than mere + and *:

(defn pointless-id ^double [^double x]
  (Math/exp (Math/log x)))

(defn demo-amap [xs]
  (amap ^doubles xs idx xs
        (pointless-id (aget ^doubles xs idx))))
;;(quick-bench (demo-amap arr-a))
(take 3 (demo-amap arr-a))
0.0 1.0 2.0
Execution time mean : 5.186906 ms

Now, Fluokitten:

;;(quick-bench (fmap! pointless-id arr-b))
(take 3 (fmap! pointless-id arr-b))
0.0 1.0 2.0

The same speed as amap!

Execution time mean : 5.006114 ms

While, if you'd used ordinary map:

;;(quick-bench (doall (map pointless-id arr-b)))
(doall (map pointless-id arr-b))
Execution time mean : 14.845975 ms

Three times slower. JVM spent 5 ms on actual computations, and twice as that on messing up with objects. Sometimes it is the cost that is not important, but in some applications, it is.

And a honorary mention of Neanderthal

Of course, if you really need the speed, there is no better helper with that than Neanderthal ():

(require '[uncomplicate.neanderthal
           [native :refer [dv]]
           [vect-math :refer [log! exp!]]])
(def v (dv (range 100000)))
;;(quick-bench (take 3 (exp! (log! v))))
(take 3 (exp! (log! v)))
#'user/v(0.0 1.0 2.0)

Execution time mean : 110.770416 µs

50 times faster than arrays is quite all right. Keep in mind that it used four cores instead of just one…

Conclusions

Fluokitten helps me a lot. I use it in Neanderthal and other high-performance code to get the elegance of Clojure's higher order map and reduce, while keeping the speed of native and GPU backends. Check it out…

Map and Reduce Primitive Arrays Without Clojure Macros - July 5, 2018 - Dragan Djuric