Clojure Numerics, Part 1 - Use Matrices Efficiently

June 26, 2017

It's time to step back from theory and look a bit more into implementation details of all that matrix stuff. In this post, I give an overview of data structures that I use to represent matrices in Clojure, methods to manipulate them, and a few tips and tricks that you can use to make your code fast.

Before I continue, a few reminders:

The namespaces we'll use:

(require '[uncomplicate.neanderthal
           [native :refer [dv dge fge dtr native-float]]
           [cuda :as cuda]
           [opencl :as ocl]
           [core :refer [copy copy! row submatrix scal! transfer! transfer mrows ncols nrm2 mm cols view-tr]]
           [real :refer [entry entry!]]
           [linalg :refer [trf tri det]]]
         '[uncomplicate.commons.core :refer [with-release]]
         '[uncomplicate.clojurecl.core :as clojurecl])

Dense Matrices

A matrix is a rectangular array of numbers. That's the usual mathematical definition. Something like this: \(A=\begin{bmatrix}1&1&1&2\\2&3&1&3\\1&-1&-2&-6\end{bmatrix}\).

When we have lots of data, or want top performance, we have to be aware of (at least) three major issues:

  • Are we wasting space by storing unnecessary elements (usually zeros)?
  • Does the way we store the data optimally matches the way we access (read/write) and compute data?
  • Are we using the right algorithm for the data structure at hand?

Let's say that the data is dense, so we have to store all entries. Why the order we store the elements is important?

Most programming languages have means to construct and manipulate 2-d arrays. Code such as a[i][j] is a straightforward way to access the element (number) at i-th row and j-th column of a Java (or C) array of arrays. Straightforward, and good enough for small or medium matrices. A good way to waste performance though!

The catch here is that, although a programming language might provide a fine abstraction for 2-dimensional (or even N-dimensional) arrays, typical computer memory holds data only in one dimension. If we are careful, the data will be stored in consecutive locations, if not - the chunks will be scattered all over! This is a big problem because memory access times are highly variable, by orders of magnitude. When a modern processor needs to access some data, it is best if the data is in registers. The next best is the first level of cache, which is slower. If it's not there, then from the second level, still much slower, and so on. Once the data is fetched to the faster, but smaller, level closer to the processor, the data that was in that place is kicked out, and the data that was near the data that was fetched is also brought in, in the hope that there is increased chance that it will be asked for next. If we access the numbers that are near each other in memory, we will have more cache hits (good) than cache misses (bad).

Take a look at how Java handles access to 2-d arrays: a[343][4098]. You might think that this gets you direct access to the element at index \(343,4098\) but it does not. Java arrays are always one-dimensional. In the case of 2-d arrays, the first-level array actually holds references to m object arrays, each holding the actual numbers, n of them. a[343][4098] fetches 343-rd reference from the first array, and then it's 4098-th element. This is a huge problem because those second level arrays will be scattered through memory.

What I want, is to lay \(A\) in memory in a way that is efficient, but still easy to work with. Obviously, those two dimensions have to be projected into one. But how? Should it be

\(\begin{bmatrix}1&1&1&2&|&2&3&1&3&|&1&-1&-2&-6\end{bmatrix}\),

or \(\begin{bmatrix}1&2&1&|&1&3&-1&|&1&1&-2&|&2&3&-6\end{bmatrix}\)?

The answer is: it depends on how your algorithm accesses it most often.

Neanderthal gives you both options. When you create any kind of matrix, you can specify whether you want it to be column-oriented (:column, which is the default), or row-oriented (:row). In the following example, we will use CPU matrices from the native namespace. The same options also work for functions that create GPU CUDA matrices (cuda namespace), or OpenCL's GPU and CPU matrices (opencl namespace).

(dge 3 2 [1 2 3 4 5 6])
#RealGEMatrix[double, mxn:3x2, layout:column, offset:0]
   ▥       ↓       ↓       ┓
   →       1.00    4.00
   →       2.00    5.00
   →       3.00    6.00
   ┗                       ┛

This created a dense \(3\times{2}\) column-oriented matrix. Notice how the 1-d Clojure sequence that we used as data source has been read column-by-column.

The other option is row orientation:

(dge 3 2 [1 2 3 4 5 6] {:order :row})
#RealGEMatrix[double, mxn:3x2, layout:column, offset:0]
   ▥       ↓       ↓       ┓
   →       1.00    4.00
   →       2.00    5.00
   →       3.00    6.00
   ┗                       ┛

In this case, the elements have been laid out row-by-row.

When printing the matrix out in the REPL, we can also see the information about its structure.

Memory, Reuse, and (Co)Location

Neanderthal can support any kind of memory implementation or layout. This is what's possible, but what is important is what works well in real use cases. We could have implemented a dense matrix storage that is backed by Clojure's persistent vectors-in-vectors. Something like this:

[[1 2]
 [3 4]
 [5 6]]

This is nice, looks pretty and uncluttered, and it's products can even be calculated quickly! Why not this, then? Keep in mind that today's computers have amazing performance. While the data is reasonably small, everything and anything will be fast. But, once the size gets a bit more serious, the naive approach tragically falls apart. It is common that a naive and elegant implementation of an algorithm works fine while you're playing with toy examples, but starts running in hours (or weeks!) once you get to even moderately large sizes.

For anything that involves many numbers and operations on them, we have to avoid objects and use primitive numbers. Clojure persistent vectors and sequences hold boxed numbers that are scattered all over the memory. That has terrible access times and does not take advantage of how memory cache works. It is good and flexible for data structures that hold http requests, employee records, and other objects, but is a bad match for numbers.

The next thing we need to take care is that those primitive numbers are co-located in memory. We need to make sure the data is as tightly packed in memory, so we get as many cache hits as possible, and have as much data as close to the processor as possible.

Next, we want to avoid memory allocations. Yes, JVM is good at that, but it can still be relatively slow if we do it too often compared to (destructively!) reusing structures that we already have. We also have to be aware that, for example, GPU memory is at a premium, and that JVM garbage collector can not manage memory on the GPU. Sometimes we need to reuse memory because, simply, there is not enough space to create new instance. Isn't that mutability a bad thing? Yes, it is! That's why, generally, it is a good idea to keep these mutations tamed inside the main function of your algorithm, and, for example, do mutations inside a tight loop, not visible to outside world. That is a technique that is well known and used with Clojure transients.

I also want to do as little data movement as possible. Neanderthal is very efficient about this - it does not need to copy memory to make it available to highly efficient computational routines outside JVM optimized for native platforms. Then, do not degrade that performance by needlessly moving the data yourself.

Recall that Neanderthal even supports different devices: CPU's, but also GPU's and even other accelerators. These are typically discrete devices having their own memory. No matter how fast these devices are, transferring data to and from that separate memory is rather slow, compared to accessing data in CPU registers.

Having all this in mind, let's see how Neanderthal can help you do the right thing and avoid shooting yourself in the foot, or, worse, in the hip.

How Neanderthal Handles Memory

Typically, each Neanderthal data structure is backed by an efficient primitive buffer, that holds raw primitive data. This buffer is what the implementer of a highly-efficient computation routine (such as matrix multiplication) uses, but the caller of the routine should, typically, not need to access this directly.

The data structure may access the whole buffer, on only a part. Let's look at a few examples:

(row (dge 2 3 (range 6)) 1)
#RealBlockVector[double, n:3, offset: 1, stride:2]
[   1.00    3.00    5.00 ]

Note how the resulting vector has offset 1 and stride 2. That's because it does not hold a copy of original matrix's buffer. It reuses the original buffer, and just takes care to access the right entries.

In this case, the original matrix is the owner of the buffer, which is responsible for it's eventual releasing (necessary for GPU data, optional for native CPU buffers), and all sub-structures that have been created from it are views that look at the same memory. That also means that, when views change the data, the original structure also changes!

(let [a (dge 2 3 (range 6))
      b (submatrix a 0 1 1 2)]
  (scal! 100 b)
  a)
#RealGEMatrix[double, mxn:2x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       0.00  200.00  400.00
   →       1.00    3.00    5.00
   ┗                               ┛

By scaling the submatrix \(B\) by 100, we also changed the "master" matrix. Neanderthal offers two variants of many functions: pure and destructive. The destructive variant, scal!, changes the original data, while the pure variant scal, would keep the data intact, and work on the copy of the data instead. That is often desirable, but can be inefficient, especially if we work with large data, or change data a lot. You'll probably have to use the destructive variants more often than not, and it's not a matter of taste, but of brutal necessity.

When it comes to memory movement, there are two types in Neanderthal: copy and transfer. Copy works in the same memory space, for example in the main native RAM, or in the GPU RAM. You can copy data from a native matrix to a native matrix, but you have to transfer the data from a native matrix to a GPU matrix, or from a Clojure list to a native vector. Copying is obviously faster, so it should be preferred when necessary, while transfer may involve copying and other kinds of memory movements and coercions under the hood.

When it comes to memory copying and transfer, there are also destructive and pure variants: The destructive copy! function moves the data from the existing source to the existing and pre-allocated destination. The pure copy function accepts the source and returns a fresh objects with the same contents, which is not a view of the buffer, but a master of a separate copy of its own.

Here's a quick demonstration of the copy operation:

(let [a (dge 3 2 (range 6))
      b (dge 3 2)]
  (copy! a b)
  (entry! a 1 1 800)
  (copy b))
#RealGEMatrix[double, mxn:3x2, layout:column, offset:0]
   ▥       ↓       ↓       ┓
   →       0.00    3.00
   →       1.00    4.00
   →       2.00    5.00
   ┗                       ┛

I've created matrices a and b, then copied a to b, overwriting b, then changed an element of a, and finally created a new copy of b. Since b has its own buffer independent of a, it is unaffected by the changes to a.

Copy works whenever objects are compatible, meaning they have the same structure, and are in the same memory space.

If I want to move data from a Clojure list to a matrix, I need to transfer it. Constructors, such as ge, or dge, will attempt to transfer the source, as we've already seen.

(let [a (dge 3 2)]
  (transfer! (list 1 2 3 4 5 6) a))
#RealGEMatrix[double, mxn:3x2, layout:column, offset:0]
   ▥       ↓       ↓       ┓
   →       1.00    4.00
   →       2.00    5.00
   →       3.00    6.00
   ┗                       ┛

Or, the other way, from a native matrix to a Java array:

(let [a (dge 3 2 (range 6) {:order :row})]
  (seq (transfer! a (double-array (* (mrows a) (ncols a))))))
0.0 1.0 2.0 3.0 4.0 5.0

GPU Matrices and Explicit Memory Control

Let's do something exotic and transfer a matrix from an AMD GPU to an Nvidia GPU:

(clojurecl/with-default
  (ocl/with-default-engine
    (cuda/with-default-engine
      (with-release [amd (ocl/clge 2 3 (range 6))
                     nvidia (cuda/cuge 2 3)]
        [(scal! (nrm2 amd) amd)
         (transfer! amd nvidia)
         (transfer native-float nvidia)]))))
'(#CLGEMatrix(float  mxn:2x3  layout:column  offset:0) #CUGEMatrix(float  mxn:2x3  layout:column  offset:0) #RealGEMatrix(float  mxn:2x3  layout:column  offset:0)
   ▥       ↓       ↓       ↓       ┓
   →       0.00   14.83   29.66
   →       7.42   22.25   37.08
   ┗                               ┛
)

We needed some default setup, with those with-default methods, to choose which platforms and GPU cards our code will run on (I have 2 AMDs and 1 Nvidia in my machine at the moment). Otherwise, the code is completely the same as the normal CPU code!

Let's see what happened here. First, a lazy Clojure sequence (range 6) is transferred to a newly created general matrix in the memory of AMD card. That couldn't be done directly, since the sequence is a bunch of lazy function calls that produce 6 objects scattered through memory inside JVM. It first has to transparently be transferred to the native memory outside the JVM, and then efficiently to the memory of the AMD GPU card. Next, we would like to transfer from AMD to Nvidia GPU. Unfortunately, that cannot be done directly. Both cards communicate with CPU only through the PCIe. Thus, the most efficient available options is to pin both memories to native memory, and then do a transparent transfer through it. Finally, we would like the REPL to print the nvidia matrix to make sure the data has been transferred, but when the REPL get hold to that instance, its buffer has already been released by with-release. I've done one more transfer back to the native instance to print it out.

Our data has traveled a lot! This example was done as a demonstration of Neanderthal's ability to find the most efficient way to move data even through a diverse jungle such as this multi-GPU setup. This kind of transfer should be done as little as possible, and only when there is no other choice. Data has to reach the GPU somehow, but let it be only once in its lifetime, not every time you call a computation routine!

A very important new thing that we should notice here is the use of with-release method. It makes sure that the memory outside the JVM is cleaned and released as soon as it is not needed. When it comes to native memory, it is optional; direct byte buffers get cleaned by Java's garbage collector. However, even if it is optional, it is better to release it explicitly, since it may take quite a time until GC comes to it, and by that time you might need that precious space! Even when optional, that method is a great help that can boost performance quite a lot. GPU memory, though, is completely oblivious to JVM's GC. It is under the control of the appropriate GPU driver, and it absolutely has to be explicitly released. There are libraries out there that try to obscure this using techniques such are relying on Java's finalize method, and praying to GC by calling GC.run(). This relies on pure hope, and according to Java documentation, guaranteed to not work reliably.

GPU memory has to be released explicitly, either by calling the release method, or wrapping with with-release macro! There is no way around this - we might like it or hate it, but, as Spider-Man, one of most well-known modern philosophers, says: "With great power comes great responsibility".

Specialized Matrix Storage

Often, the data is not completely dense. For example, the matrix might be triangular, or symmetric, or having all non-zero data near diagonal, or, perhaps, only have a small percentage of non-zero elements scattered around huge spaces filled with zeros. Keeping such data in a dense structure might be wasteful at best or downright impossible, if the dimensions are so great that the zero-filled dense matrix cannot even fit into available memory.

For these specific cases, there is a number of well-researched storage formats. It is a good idea to familiarize with their pros and cons, so you can select the right one when the need arises.

There are four broad groups of storage:

  • Full storage: a matrix is stored densely, and all or only some elements are accessible. Matrix operations might be optimized for specific storage. Examples are:
    • General matrices (ge), which use a dense rectangle,
    • Triangular matrices, (tr), use lower or upper triangle of a dense rectangle,
    • Symmetric matrices (sy) , use lower or upper triangle of dense storage.
  • Packed storage scheme stores special matrices more compactly. It stores only the data that is used.
    • Symmetric packed (sp),
    • Triangular packed (tp).
  • Banded storage: matrices whose data are concentrated on a relatively narrow band around the diagonal, can take advantage and only store that band in memory.
    • General band matrices (gb),
    • Triangular band matrices, (tb),
    • Symmetric band matrices (sb).
  • Sparse storage: for the case where we need to work with huge, but sparsely populated, matrices, there is a number of standard formats for compressing the data and storing only the (rare) entries that matter:
    • Compressed Sparse Row (CSR),
    • Compressed Sparse Column (CSC),
    • Coordinate format,
    • Diagonal format,
    • Skyline format,
    • Block Sparse Row (BSR) format.

Please note that at the time of writing this post Neanderthal does not yet implement all these formats. Some are already here, some are going to be implemented soon, so by the time you read this they might be ready, and for some, you might have to wait.

Dense Triangular Matrices

Describing all of them might be too much for this post, which is already getting rather long, so I'll just give an example of how triangular matrices are being handled, and refer you to the Neanderthal documentation, and future articles for the rest.

First, the basic stuff. I use dtr to create a dense triangular matrix in memory that is lower triangular, and has ones on the diagonal. Then I multiply it with a dense matrix.

(with-release [a (dtr 3 (range 1 7) {:order :row :diag :unit :uplo :lower})
               b (dge 3 4 (range 1 13))]
  (mm a b))
#RealGEMatrix[double, mxn:3x4, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ↓       ┓
   →       1.00    4.00    7.00   10.00
   →       3.00    9.00   15.00   21.00
   →      11.00   29.00   47.00   65.00
   ┗                                       ┛

The result is, as you can see, a dense general matrix, since the product generally has all elements as non-zeros.

Let's print out a triangular matrix:

(dtr 3 (range 1 7) {:order :row :diag :unit :uplo :lower})
#RealUploMatrix[double, type:tr, mxn:3x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →     ·1·       *       *
   →       1.00  ·1·       *
   →       2.00    3.00  ·1·
   ┗                               ┛

As you can see, the triangular matrix prints out both the entries that it can change, and the zeroes and ones that are inaccessible.

Now, how such a structure is useful at all? It does not save space (remember, it is dense, but only one half is accessible) and require some additional knowledge to use. It turns out that many numerical techniques produce one or a few triangular matrices as a result. Those results, that consist of two triangular matrices, for example, can be nicely packed into one rectangle, which is laid out as one chunk of memory. Further, such nicely packaged matrices can be used in further operations that are aware of this structure. This is something that many functions from the linalg namespace.

Take as an example the ubiquitous LU factorization. Its job is to find two triangular matrices, that can further be used in solving systems of linear equations, or finding determinants, or other tasks. In Clojure, we can use the trf! and trf functions to compute that:

(def lu (trf (dge 3 3 [1 0 -1 3 -2 3 4 1 1] {:order :row})))
#'user3/lu

lu
#uncomplicate.neanderthal.internal.common.LUFactorization{:lu #RealGEMatrix[double, mxn:3x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       1.00    3.00    4.00
   →      -1.00    6.00    5.00
   →       0.00   -0.33    2.67
   ┗                               ┛
, :ipiv #IntegerBlockVector[int, n:3, offset: 0, stride:1](1 3 3), :master true, :fresh #atom[true 0x2ab70e40]}

L and U factors as stored in one dense matrix, but we can take lower or upper triangular views:

L (lower) factor is:

(view-tr (:lu lu) {:uplo :lower :diag :unit})
#RealUploMatrix[double, type:tr, mxn:3x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →     ·1·       *       *
   →      -1.00  ·1·       *
   →       0.00   -0.33  ·1·
   ┗                               ┛

U (upper) factor:

(view-tr (:lu lu) {:uplo :upper})
#RealUploMatrix[double, type:tr, mxn:3x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       1.00    3.00    4.00
   →       *       6.00    5.00
   →       *       *       2.67
   ┗                               ┛

When we once computed LU for a given matrix, we can reuse that result to find such expensive operations as inverse, solving the linear equations, or determinant at greatly discounted computation price:

(det lu)
-16.0

(mm  (tri lu) (dge 3 3 [1 0 -1 3 -2 3 4 1 1] {:order :row}))
#RealGEMatrix[double, mxn:3x3, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ┓
   →       1.00    0.00    0.00
   →       0.00    1.00    0.00
   →       0.00    0.00    1.00
   ┗                               ┛

That was the inverse, indeed!

This was a long post…

So it should be the right time to wrap it up and remind you that there is more documentation at Neanderthal's website. Next time you feel like doing some high-performance coding, you can check it out :)

Clojure Numerics, Part 1 - Use Matrices Efficiently - June 26, 2017 - Dragan Djuric