Not One, Not Two, Not Even Three, but Four Ways to Run an ONNX AI Model on GPU with CUDA

November 9, 2025

Please share this post in your communities. Without your help, it will stay burried under tons of corporate-pushed, AI and blog farm generated slop, and very few people will know that this exists.

These books fund my work! Please check them out.

Two weeks ago, I announced a new Clojure ML library, Diamond ONNX RT, which integrates ONNX Runtime into Deep Diamond. In that post, we explored the classic Hello World example of Neural Networks, MNIST handwritten image recognition, step-by-step. We run that example on the CPU, from main memory. The next logical step is to execute this stuff on the GPU.

You'll see that with a little help of ClojureCUDA and Deep Diamond built-in CUDA machinery, this is both easy and simple, requiring almost no effort from a curious Clojure programmer. But don't just trust me; let's fire up your REPL, and we can continue together.

Here's how you can evaluate this directly in your REPL (you can use the Hello World that is provided in the ./examples sub-folder of Diamond ONNX RT as a springboard).

Require Diamond's namespaces

First things first, we refer functions that we're going to use.

(require '[uncomplicate.commons.core :refer [with-release]]
         '[uncomplicate.neanderthal.core :refer [transfer! iamax native]]
         '[uncomplicate.diamond
           [tensor :refer [tensor with-diamond]]
           [dnn :refer [network]]
           [onnxrt :refer [onnx]]]
         '[uncomplicate.diamond.internal.dnnl.factory :refer [dnnl-factory]]
         '[uncomplicate.diamond.internal.cudnn.factory :refer [cudnn-factory]]
         '[hello-world.native :refer [input-desc input-tz mnist-onnx]])

None of the following ways to run CUDA models has preference, you use the one that best suits your needs.

Way one

One of the ways to run ONNX models on your GPU is to simply use Deep Diamond's cuDNN factory as the backend for your tensors. Then, the machinery recognizes what you need and proceeds doing everything on the GPU, using the right stream for tensors, Deep Diamond operations, and ONNX Runtime operations. This looks exactly the same as any other Deep Diamond example from this blog or the DLFP book.

(with-diamond cudnn-factory []
  (with-release [cuda-input-tz (tensor input-desc)
                 mnist (network cuda-input-tz [mnist-onnx])
                 classify! (mnist cuda-input-tz)]
    (transfer! input-tz cuda-input-tz)
    (iamax (native (classify!)))))

..it says.

Way two

As an ONNX model usually defines the whole network, you don't need to use Deep Diamond's network as a wrapper. The onnx function can create a Deep Diamond blueprint, and Deep Diamond blueprints can be used as standalone layer creators. Just like in the following code snippet.

(with-diamond cudnn-factory []
  (with-release [cuda-input-tz (tensor input-desc)
                 mnist-bp (onnx cuda-input-tz "../../data/mnist-12.onnx" nil)
                 infer-number! (mnist-bp cuda-input-tz)]
    (transfer! input-tz cuda-input-tz)
    (iamax (native (infer-number!)))))

… again.

Way three

We can even mix CUDA and CPU. Let's say your input and output tensors are in the main memory, and you'd like to process them on the CPU, but you want to take advantage of the GPU for the model processing itself. Nothing is easier, if you use Deep Diamond. Just specify an :ep (execution provider) in the onnx function configuration, and tell it that you'd like to use only CUDA. Now your network is executed on the GPU, while your input and output tensors are in the main memory, and can be easily accessed.

(with-release [mnist (network input-tz [(onnx "../../data/mnist-12.onnx" {:ep [:cuda]})])
               infer-number! (mnist input-tz)]
        (iamax (infer-number!)))

… and again the same answer.

Way four

Still need more options? No problem, onnx can create a standalone blueprint, and that blueprint recognizes the :ep configuration too.

(with-release [mnist-bp (onnx input-tz "../../data/mnist-12.onnx" {:ep [:cuda]})
               infer-number! (mnist-bp input-tz)]
        (iamax (infer-number!)))

No surprises here.

Is there anything easier?

If you've seen code in any programming language that does this in a simpler and easier way, please let me know, so we can try to make Clojure even better in the age of AI!

The books

Should I mention that the book Deep Learning for Programmers: An Interactive Tutorial with CUDA, OpenCL, DNNL, Java, and Clojure teaches the nuts and bolts of neural networks and deep learning by showing you how Deep Diamond is built, from scratch? In interactive sessions. Each line of code can be executed and the results inspected in the plain Clojure REPL. The best way to master something is to build it yourself!