Get Ready for Clojure, GPU, and AI in 2026 with CUDA 13.0
October 30, 2025
Please share this post in your communities. Without your help, it will stay burried under tons of corporate-pushed, AI and blog farm generated slop, and very few people will know that this exists.
These books fund my work! Please check them out.
A little anniversary
Did you know that CUDA has been available in Clojure for the last 9 years through ClojureCUDA, and GPU programming through OpenCL for more than 10? I almost forgot about these anniversaries.
Ten years ago most people liked it a lot, starred it on Github, patted me on the back, but then concluded that they don't have an Nvidia card available on their laptops, or, if they had GPUs, that they won't have time to learn to think in massive parallel algorithms, or if they have time and will, that there are no GPUs in the servers, so what would they do with their applications, even if they created them in Clojure, and so on, and so off :)
But, ClojureCUDA and ClojureCL continued living on for these 10 years, I used them in creating Neanderthal, Deep Diamond, and Diamond ML, and they proved themselves as simple and reliable tools. I still had trouble convincing Clojure programmers that they can write GPU programs that run as fast as they'd wrote them in C++, but interactively in the Cloujre REPL, without C++ hell.
But I'm not easy to shake off! If it's necessary, I'll continue for 10 more years, for I'm convinced there'd be a moment when Clojure programmers are going to say "hmmm, this is something that we can use and be good at!".
CUDA 13 is here!
I've recently released ClojureCUDA 0.25.0, with support for the latest CUDA 13.0.2!
Why not celebrate that by opening the REPL, and coding your first Hello World application on the GPU? I promise, it won't be a usual GPU carpet of text; this is ClojureCUDA, it follows the Clojure philosophy by being simple and interactive!
There's not much sense in wielding a GPU to print out "Hello World". Note that it is also not very useful to work with scalar numbers and call a GPU function to add or multiply two numbers. No. Unless you have many, many, numbers to crunch, stay by your trusty CPU. For our purposes, many, many, numbers would be two vectors of dimension 3 (hey, it's hello world; imagine it's 3 billion). Also, even when we have many, many, numbers the sheer cost of getting them to the GPU memory would destroy any gains we get in the computation speed, so we must also ensure that we want to perfom many complicated operations. Well, we will only do the simple operation of adding these two vectors, and we will pretend that this operation is extra-demanding (we're hello-worlders today, we can cheat a bit).
CUDA Hello World
First things first, we require the functions that we'll use.
(require '[uncomplicate.commons.core :refer [release]] '[uncomplicate.clojure-cpp :refer [float-pointer pointer-seq]] '[uncomplicate.clojurecuda.core :refer [compile! context device function grid-1d init launch! mem-alloc-driver mem-alloc-pinned mem-alloc-runtime memcpy-host! module parameters program synchronize! push-context!]])
If it seems to you that this list already looks too large, I agree with you. But, don't be afraid; in Uncomplicate libraries there are so many higher-level helpers that you'll rarely need to touch these. I only use them here because I want to show you that even when we program at the base CUDA level, Clojure can do it interactively and each line can be evaluated by itself, and each intermediate result can be inspected and understood.
I'll use def for poor man variables, and pop-context! so this is evaluated step-by-step. The real
code would be much simpler; resources can be managed by with-release and with-context!
First we initialize CUDA and create the the CUDA context.
(init) (def ctx (context (device)))
Unfortunately, CUDA insists on managing contexts in different threads by itself, so we have
to let CUDA know that we want to use the context that we've just created. ClojureCUDA
has some macros that can help with making this simpler, such as with-context, and in-context,
but we'll do this hello world as real ninjas, with basic tools!
(push-context! ctx)
Many language integrations try to let you write everything in that language, both the
host code that manages GPU computations, and the GPU kernels themselves. So far, they
don't fare too well compared to C++ CUDA, even when it comes to simplicity of such kernel code.
ClojureCUDA doesn't do that, for a reason. We are practical people. We understand that
we won't be able compile kernels ourselves and be competitive with Nvidia. So, we write
kernels in C++ that Nvidia uses, since they are typically not that complicated compared
to host code that manages them. We can load these kernels as strings, and I find it's
most convenient to not sprinkle these strings throughout my .clj source files,
but to load them from .cu files, which contain C++ code that text editors recognize.
We load the kernel source.
(def kernel-source (slurp "test/cuda/examples/jnvrtc-vector-add.cu"))
Next, we compile it.
(def prog (compile! (program kernel-source)))
We create the module…
(def m (module prog))
… and load the add function that was defined in the kernel code. CUDA kernels
are short functions that run on the GPU.
(def add (function m "add"))
This way we get the best of both worlds: we edit short C++ kernels, and then load them
in our Clojure REPL and manage the kernels they define. If we change anything, there's
no need for recompilation of everything; just the short .cu file that changed, and
this is also interactive, as these tools are available as Clojure functions.
Next, this kernel needs data! We allocate some memory on the GPU. There are a few different types that CUDA offers, and even a few CUDA APIS: runtime and driver. Typically, the runtime API is what CUDA uses in the C++ code that mixes host and GPU device code, while the driver API is more geared towards tools such is ClojureCUDA. But, some CUDA libraries will expect inputs to be from the runtime API, and ClojureCUDA supports both. We can even mix them!
(def gpu-a (mem-alloc-runtime (* Float/BYTES 3))) (def gpu-b (mem-alloc-driver (* Float/BYTES 3)))
The data needs to be transferred to the GPU memory. There are many functions in CUDA for
doing that for every combination of different argument types. ClojureCUDA simplifies this
a lot with protocols, and usually memcpy-host! will find the right way to transfer the
data. We also need the place to keep the results, unless we want to overwrite one of
the two input arrays.
(memcpy-host! (float-pointer [1 2 3]) gpu-a) (memcpy-host! (float-pointer [2 3 4]) gpu-b) (def gpu-result (mem-alloc-pinned (* Float/BYTES 3)))
That's a lot of work to set everything up! Let's compute it at last!
(launch! add (grid-1d 3) (parameters 3 gpu-a gpu-b gpu-result))
The kernels are launched asynchronously. That means that the launch! function returns
as soon as it puts the kernel in the computation queue of the GPU, typically before the kernel
has actually been executed; there might be 1000 kernels in the queue before this one. We can
explicitly wait until the kernel has completed its work by calling synchronize!.
(synchronize!)
Since we now know that the new data is in the gpu-result array, we will have to
move it back to the host if we wont to see it.
(pointer-seq (memcpy-host! gpu-result (float-pointer 3)))
=> (3.0 5.0 7.0)
Yeah, I know, a dozen lines of code for a simple addition of two vectors. But hear me out: the management code for more demanding kernels is not much more complicated. So, it's a dozen lines for this simple case, but it will be a dozen lines for some real crunching. Or 23 lines, but not 2000 lines.
Plus, there are so many functions in the libraries for computing vectors, matrices, and tensors, that you'll write your own kernels only occasionally, and you'll still get great speed, once you learn the basics.
So, invest some time in 2026, learn the basics of GPU computing, and enjoy the coming AI age! In Clojure!
Oh, I didn't show you the C++ kernel code. It's C++, it must be scary! No, it's not. CUDA kernels are written in a subset of C++, without the scary parts!
extern "C" __global__ void add(int n, float *a, float *b, float *sum) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) { sum[i] = a[i] + b[i]; } };
… and, hey, we shouldn't forget to clean up the memory! Since if we didn't use
def, we could just rely on with-release to do this for us, but we did everything
manually, and now we have to clean up manually.
(release gpu-result) (release gpu-b) (release gpu-a) (release add) (release m) (release prog) (release ctx)
 
        