# More fun with NumPy, CuPy, Clojure and GPU acceleration. Hold my Cider 2!

Need help with your custom Clojure software? I'm open to (selected) contract work.April 14, 2020

Please share: Twitter.

These books fund my work! Please check them out.

NumPy and CuPy converted my computations to `float64`

despite being told to
work with `float32`

, and that prompted some readers to discard my previous article
as *faulty*. My article doesn't make CuPy and NumPy more or less perfect, though,
and that forceful conversion stings everyone who is using Nvidia's
GTX gaming GPUs. I don't have grants that would cover $10,000 for uncrippled
`float64`

support on Tesla GPU. So, if you're poor like me, that article
probably offered valuable information.

I've received some nice suggestions, too. Some readers were inspired by my article
and re-implemented that `corrcoef`

function in their favorite Python tools
(Numpy, CuPy, PyTorch) and found out that the advice from my book Numerical Linear Algebra for Programmers
is not constrained to Clojure, and that it is fast in NumPy and CuPy, too.
Of course! Most of my book applies to *any* language, and is written for *all programmers*,
not just Clojure programmers. Seriously - check it out, and while you're
there, maybe you'll also be interested in Deep Learning for Programmers.

Unfortunately, these readers haven't sent me the code (yet), and I didn't have the opportunity to compare it to Clojure on my machine, but I trust them that they were satisfied with the result.

There is an interesting way to compare my unambitious textbook implementation
with NumPy and CuPy without that `float64`

conversion impact. Simply, use
`float64`

computations in Neanderthal (),
too, just for the sake of comparison.
In this article, I compare implementation
of Pearson's correlation coefficient (a widely used statistical function) on
a range of large and small matrices.

## TL;DR

The following two tables contain the summary of the micro-benchmarks that I run on my CPU and GPU.

Since NumPy and CuPy forcefully do `float64`

computation regardless of
data type, there is one column for them. Neanderthal () follows what
it's told, so I include both `float64`

(`double`

in Javaspeak) for
direct comparison, and `float32`

(`float`

) as a reference.

NumPy and CuPy code is run from the interactive Python console, while Clojure uses interactive REPL.

Note that my code is not a port of NumPy, nor a drop-in replacement; this is simply a head-to-head comparison of two different implementations. Python has its strengths and weaknesses, but JVM has its own constraints, too, so don't blame me for either of them, nor expect that I care about fixing them.

### CPU i7 4790K

All time is reported in milliseconds (lower is better).

size | NumPy | Neanderthal double | Neanderthal float |
---|---|---|---|

1K×100K | 949 | 686 | 350 |

1K×10K | 113 | 111 | 56.2 |

100×100K | 61 | 21 | 10.5 |

100×10K | 5.47 | 2.22 | 1.19 |

10×10K | 0.29 | 0.16 | 0.16 |

10×1K | 0.085 | 0.026 | 0.028 |

10×100 | 0.057 | 0.011 | 0.012 |

10×1M | 43 | 18 | 15 |

10K×10K | 11476 | 8079 | 4902 |

### GPU Nvidia GTX 1080Ti

All time is reported in milliseconds (lower is better).

size | CuPy | Neanderthal double | Neanderthal float |
---|---|---|---|

1K×100K | 686 | 401 | 20 |

1K×10K | 70 | 42 | 2.60 |

100×100K | 9.9 | 9.2 | 2.25 |

100×10K | 1.38 | 1.56 | 0.3 |

10×10K | 0.21 | 0.22 | 0.14 |

10×1K | 0.18 | 0.67 | 0.1 |

10×100 | 0.18 | 0.1 | 0.1 |

10×1M | 6.32 | 6.57 | 0.26 |

10K×10K | 5078 | 2597 | 117 |

Honestly, if the several lines of code from my *textbook* implementation
reached within an order of magnitude of professional libraries such as
NumPy an CuPy, that would still be rather nice. I can't say that these
results do not make me satisfied :)

## The Python code

To prevent folks bugging me about why I called NumPy from Clojure, I wrote a Python script this time. Micro benchmarks in Python code that I see on the Internet are clunky at best, and more sophisticated tools are equally ridiculous. Having to call my functions in a string- embedded DSL? Thanks, but no.

So I did the lazy thing, and used a simple
`time_ns()`

based measurement of a loop that calls the `corrcoef`

function
enough times to mitigate the imprecision of the timestamp that Python
gets. I believe that's a good approximation of how would a user experience
this function when calling it from the code. It's good enough considering
that we are calling exactly one function that delegates to native code anyway.

On the CPU, it looks like this:

def numpy_corrcoef(m, n, reps): a = numpy.random.random(m * n).reshape(m,n) s = time.time_ns() for x in range(reps): numpy.corrcoef(a) e = time.time_ns() print((e-s)/reps/(10**6), " milliseconds") numpy_corrcoef(1000, 100000, 1) numpy_corrcoef(10, 10000, 1000) # etc.

On the GPU, it takes care of synchronization:

def cupy_corrcoef(m, n, reps): a = cupy.random.random(m * n).reshape(m,n) cupy.cuda.Stream.null.synchronize() s = time.time_ns() for x in range(reps): cupy.corrcoef(a) cupy.cuda.Stream.null.synchronize() e = time.time_ns() print((e-s)/reps/(10**6), " milliseconds") cupy_corrcoef(1000, 100000, 1) # etc.

## The Clojure code

In Clojure, we have sophisticate benchmarking capabilities right
there in the REPL through the Criterium library. Although would
like to use its `quick-bench`

, as it gives reliable and precise
measurements with one simple line of code, I opted to follow the
Python way, and use `time`

and the appropriate loop, to make it as
similar as the Python code as possible.

(defn bench-neanderthal-cpu [m n factory reps] (with-release [x (rand-uniform! (ge factory m n))] (time (dotimes [i reps] (matrix-correlation! x)))))

(defn bench-neanderthal-gpu [m n factory reps] (with-default (cuda/with-engine factory default-stream (with-release [x (rand-uniform! (cuge m n))] (synchronize!) (time (do (dotimes [i reps] (release (matrix-correlation! x))) (synchronize!)))))))

In the previous article you can find the code for `matrix-correlation!`

.
That implementation did a full general matrix multiplication, although
the correlation matrix is symmetric, since for most matrix sizes it is
faster to do twice as many flops as necessary instead of using logic to
avoid duplicate work. Especially on the GPU, the matrix should have to be
unrealistically large for the symmetric matrix multiply to pay off.

Also, consider what `corrcoef`

computes: correlation between different variables.
There will typically be much fewer variables than observations, so the matrix
will be a long rectangle. There, my old implementation will be optimal, even
for large matrices.

However, I confirmed by measurement and comparison that NumPy uses symmetric
multiply for large matrices. This cannot be seen directly from the code, but
is 100% certain since it takes NumPy 1.5 second to do the full matrix multiply
of this matrix, and around 1 sec for the whole `corrcoef`

.

That's why I added this option to Neanderthal (), and made a simple heuristic
to use symmetric multiply for huge fat matrices (the `mmt!`

function, and
GEMM for everything else. That made the code less elegant, and several lines
longer, but, hey, everything for performance! :)

(defn matrix-correlation! [a!] (with-release [ones (entry! (vctr a! (ncols a!)) 1.0) work (mv a! ones)] (let [a-mean (rk! (/ -1.0 (dim ones)) work ones a!)] (let-release [large (and (< 512 (mrows a!)) (< 8196 (ncols a!))) sigma (sy a! (mrows a!))] (if large (mmt! a-mean sigma) (mm! 1.0 a-mean (trans a-mean) (view-ge sigma))) (sqrt! (copy! (dia sigma) work)) (with-release [sigma-x*y (rk work)] (div! (view-ge sigma) (view-ge sigma-x*y))) sigma))))

## The books

I guess that this is enough info. If you feel inspired, you can try to write the Python equivalent, and see whether it is faster than the NumPy's and CuPy's implementation.

The books that I write are written for all programmers, not just Clojure programmers. I use Clojure because it is very enjoyable due to its elegance and power, but almost all the lessons in the book are applicable in other environments. Clojure helped me in writing the whole stack of these libraries that support CPU, GPU, CUDA, OpenCL, etc. in surprisingly small numbers of lines of code.

It can help you, too, to be more productive, but even if Clojure is not your cup of tea, the books can help you learn this stuff and apply it in other programming languages, since the books make clear connection from theory to implementation, without skipping any steps in between!

The key difference compared to other books, is that here you can try each small building block, which is usually one or a few lines of code directly in the interactive REPL, see the results right away, experiment, and understand the theoretic concept that is demonstrated. Equally important is that the performance of that code is top notch right away!

There's already 400 pages written. By subscribing to these books you get the access to the drafts right away, and the final book this summer. There's no middle man; 100% of the proceeds goes towards my work on these open source libraries!