Why I sometimes like to write my own number crunching code
Need help with your custom Clojure software? I'm open to (selected) contract work.June 22, 2021
Please share: Twitter.
These books fund my work! Please check them out.
Let's start right in the middle of the story, and play with cosine similarity. I'll tell you why in a few minutes.
I assume that most programmers have forgotten their math classes, so "cosine similarity" sounds somewhat grandiose. Quick skimming at Wikipedia might puzzle you even more, but, scrolling down to the definition, you can see that it's just a normalized dot product. I know, I know - now you might wonder what dot product really is. Please read my recent linear algebra Hello World article if that's the case.
A little warm up coding
We are programmers; it will all be clear after we code this.
Here are two vectors:
(def x (dv 1 2 3)) (def y (dv 30 20 10))
Dot product
Now, the dot product can be naively implemented as following.
(defn naive-dot [x y] (reduce + (map * x y)))
(naive-dot x y)
=> 100.0
It's much easier and faster to use the staple linear algebra function called dot
,
available in Neanderthal ():
(dot x y)
=> 100.0
Normalization
Next, the normalization part. This too might seem mystic to you, but
it is, again, a staple LA function nrm2
(the second norm of a vector). Not
only that it's a staple function, but it's just a square root of the
dot product of a vector with itself.
(defn naive-nrm2 [x] (sqrt (naive-dot x x)))
(naive-nrm2 x)
=> 3.7416573867739413
A better choice in higher level code is to use the optimized nrm2
function (available in Neanderthal (), or any other similar library):
(nrm2 x) (nrm2 y)
Cosine similarity
This is cosine similarity between vectors x
and y
:
(/ (dot x y) (nrm2 x) (nrm2 y))
=> 0.7142857142857142
OK, nice exercise, but not that interesting. There's no particularily
useful insight there. I implemented everything naively, and then wrote
my own cosine-similarity
, which is, after all
, the same thing as any
library's cosine-similarity (it's probably faster but that's irrelevant here).
We now have a way to compute a measurement of similarity of two vectors (cosine similarity),
which is a number between -1 (opposite), 0 (not similar at all), 1 (they match), or somewhere in between.
In our case, x
and y
are somewhat similar (0.714
) which does not tell us that much, but it can
tell us that they are more similar that z
if cosine similarity with z
is, say, 0.5
.
So what?
This was only a warm-up, and the insight follows after the intermezzo: we can compute cosine similarity of many vectors in a batch much faster than if we did it one by one. But why? And when? Well, if we would have only blindly used what the library is providing, we probably wouldn't even been asking these questions!
Now, the motivation for all this
A few weeks ago a Clojurist was asking about a weird result returned by a function in a Clojure/Java machine learning library. There's nothing unexpected there; bugs are caught in all software all the time, and the author of that (open source and free) library fixed the bug immediately. It's a good pretext to make a point, though, so let's make it :)
We often rely on software libraries. Layers of abstraction make our work possible. Something we stretch the tendency to rely on 3rd party abstractions too much, intentionally giving up control of the central part of our code. The irony is that many times the simplest thing to do is to implement our thing using the right lower layer, rather than to configure a function call of the higher layer.
In this case, the bug was that a cosine similarity funciton returned a value larger than 1. The user (correctly) wondered whether it's an impossible value for cosine similarity. Several people helped with theoretical knowledge and practical guesses where the bug might be. How such a simple thing such as cosine similarity could be the source of any fuss? Probably because most programmers rely on libraries so much that they don't build confidence even for the simplest things related to math.
That's why I like to not skip the step of implementing custom functions for central parts of the algorithms that I rely on. Sometimes the existing solutions serve well, and I use them in the final code, but by building at least a naive prototype I build up my understanding and confidence. I also found out that my solution is often much better than the one implemented in a library. If that seems unexpected, consider that machine learning libraries are frequently written by grad students on their path to discovery. It's a domain expert with poor programming skills. Or it might be a case of a good programmer who only barely understands the domain…
Is that it? NO :)
So, we implemented our naive cosine similarity (nothing special).
(defn cosine-similarity [x y] (/ (dot x y) (nrm2 x) (nrm2 y)))
We choose this, or the (hypothetical) library function with the same name, and proceed with looping, mapping, reducing and other Clojure-fu? NO! This would limit us to calling such function for each pair of vectors. What works well in number crunching is finding an operation that does the same for the whole bunch in one sweep!
But, how do we do that? If we hadn't explored the internals of cosine similarity, that'd be a mystery. But, since we had, we can recognize that matrix multiplication is a bunch of dot products. Explaining the details of this insight would digress too much; I cover this kind of knowledge in the Numerical Linear Algebra for Programmers book. If you'd like to see this stuff applied to the full blown (and super fast) Deep Learning library, then you might be interested in Deep Learning for Programmers. Both books are part of my Interactive Programming for Artificial Intelligence book series.
So, let's put these two vectors, x
and y
, in a matrix and call it xs
.
(def xs (dge 2 3 [[1 2 3] [30 20 10]] {:layout :row}))
When we multiply two matrices, we get a third matrix, and each entry in the result is the dot product of the respective row of the first matrix and column of the second. For example, entry (2,3) is the dot product of the second row of the first and the third column of the second matrix.
In our trivialized case, we have one matrix, xs
, with two rows, and we don't have
any other matrix. What now? Since we want all combinations of dot products of our vectors,
we can simply matrix multiply xs
by its transpose!
(mm xs (trans xs))
=> #RealGEMatrix[double, mxn:2x2, layout:column, offset:0] ▥ ↓ ↓ ┓ → 14.00 100.00 → 100.00 1400.00 ┗ ┛
Do you recognize the familiar number 100.0? That's the dot product of x
and y
.
Notice that this matrix is symmetric, so we can optimize this by using a dedicated function
optimized for matrix multiply with transpose, mmt
.
(def x-dots (mmt xs))
Now we only need to normalize these. But how? There's no trick with matrix multiplication for nrm2
. Or is there?
Remember that we also learned how nrm2
is computed: as the square root of the dot product of a vector with itself.
The diagonal holds the dot products of x
and x
\((1,1)\), and y
\((2,2)\)! We only need to square the diagonal.
While we're at it, we might also invert it, so we can later deal with multiplication instead of division (but this is a minor detail).
All these vector operations that we need are available in Neanderthal's vect-math
namespace.
(def x-norms (inv! (sqrt! (copy (dia x-dots)))))
Now we have the dot products in matrix x-dots
, and the norms in vector x-norms
.
Whoops… How do we combine that? The formula for cosine similarity uses each combination of
norms for vector pairs, not single norms. No problem! We pull the rk
function from our pocket.
It does the outer product of two vectors, making a matrix that contains the product of each pair of their entries!
Again, this is not some one-off trick, but a standard linear algebra stuff explained in the book.
(mul! (view-ge x-dots) (rk x-norms x-norms))
=> #RealGEMatrix[double, mxn:2x2, layout:column, offset:0] ▥ ↓ ↓ ┓ → 1.00 0.00 → 0.71 1.00 ┗ ┛
BAM! Here it is! 0.71! Ones on the diagonal show that each vector is perfectly similar to itself :)
This might seem much fuss for nothing! Isn't the vector code simpler? Yes, but it only can compute one combination of two vectors. In this example, we've done the most trivial example with the same two vectors. But this code computes as many vectors as we give it! No need for loops or recursions!
Pimp my cosine similarity!
I've added some bookkeeping code. Clean up our memory and reuse matrices if possible for maximum speed gains.
Note the plural in the function name.
(defn cosine-similarities [a] (let-release [a-dots (mmt a)] (with-release [a-norms (inv! (sqrt! (copy (dia a-dots)))) ab-norms (rk a-norms a-norms)] (mul! (view-ge a-dots) ab-norms) a-dots)))
Here's a matrix with 4 vectors.
(def a (dge 4 3 [[1 2 3] [-3 8 0] [30 20 10] [-7 1 -5]] {:layout :row}))
And, here are their cosine similarities.
#RealUploMatrix[double, type:sy, mxn:4x4, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ┓ → 1.00 * * * → 0.41 1.00 * * → 0.71 0.22 1.00 * → -0.62 0.39 -0.74 1.00 ┗ ┛
See? All the combinations of the vectors with one single blow! And, if we have thousands of vectors it's orders of magnitude faster than calling the single cosine similarity function thousands of times!
Vectorization FTW!