Interactive GPU Programming - Part 3 - CUDA Context Shenanigans

March 1, 2018

Please share this post in your communities. Without your help, it will stay burried under tons of corporate-pushed, AI and blog farm generated slop, and very few people will know that this exists.

These books fund my work! Please check them out.

All communication with the GPU takes place in a context. Context is an analogous to a CPU program. It sets the environment for the particular GPU we want to use in the computations. Simple on the surface, it is, unfortunately, incidentally complex due to early CUDA legacy. So what? We have to deal with that, so, instead of complaining, we'll see how to tame it…

Context is a life-cycle manager

GPU has quite a few of computing cores, and quite a bit of expensive memory. However, since we have huge demands, it is never enough of them. We always need top performance, and these resources are always scarce, no matter how abundant they are. Context manages the life-cycles of typical objects we find in CUDA programs, some of whom we have seen (and used through ClojureCUDA ) in Part 1 - Hello CUDA:

memory buffers
modules
functions
streams
events
textures, pinned staging buffers, etc.

Each context isolates the resources within it. For example, the memory created in one context cannot be referenced by a function from another context, even if they are physically located on the same GPU. The objects are valid only in the context where they have been created: a linear memory address from one context points to garbage in another. Objects can be used only in the context in which they were created. When the context is destroyed (by calling the release function), all the resources (memory, streams, etc.) inside are destroyed.

One for all and all for one

A context is attached to a specific device. Although more than one context may be created on one device, it is usually not a good idea. The recommended usage is to have one and only one context operating with the device.

(require '[uncomplicate.clojurecuda.core :refer :all])
(init)
(def ctx (context (device 0)))

When I create a context using the context function, this context becomes current for the thread where the function have been evaluated and transparently supplied to functions that need a context to operate. Check this with the current-context function:

(current-context)

#object[uncomplicate.clojurecuda.internal.impl.CUContext 0x6677dfba "#CUContext[0x7fb2248a1020]"]

Depending on how much time passed since you typed and evaluated the context creation function, and made a query for the current context, you might get the context we expected, or an empty one (nativePointer=0x0).

This oddity is an (un)fortunate manifestation of the mismatch between CUDA's transparent management of thread local contexts and multi-threaded code. Although we expected that we are in the same thread in our REPL, Clojure's REPL is executing in a thread pool. You can check this by calling the following snippet a couple of times with a pause of several dozens of seconds.

(.getId (Thread/currentThread))

You'll see a different id number if the REPL has been inactive for some time. The next call is evaluated in the same REPL, in the thread called "main", but the actual JVM/OS thread that executes is a random thread from the pool, so the CUDA driver thinks it doesn't have a context and sweeps the rug under our feet. How unfortunate!

The current context stack

During the walk-through in Part 1 - Hello CUDA, you might be puzzled by my constant calling of the current-context! function. Shouldn't once be enough? Or, even cleaner and more functional, wouldn't it be better to just supply the context as an argument to context-aware functions? Yes, it would be much better, especially in more complex systems. However, early design decisions in popular technologies are here to stay as legacy, and we do not have a choice.

The designers of CUDA tried to design an API that appeals to what people are used to. CPU programs rarely need arguments such as context, device, etc. As you have seen in the Hello CUDA article, a function call that programmers like is (launch! fun params), not (launch! ctx device stream something-else fun params). Most C++ programs are not as heavily multithreaded as Clojure programs are, while managing extra arguments everywhere is an immediately obvious annoyance. However, in more dynamic threading setups, these side effects require constant nannying.

To a functional programmer, that mistake seems obvious. An additional parameter or two are much easier to deal with, especially with Clojure's macros, than having to constantly worry about switching the current context to the right one for the actual thread. In the C++ world, it is still not so obvious! OpenCL takes the right approach by supplying context as an argument to functions that need it, but that leads to the API being a bit more verbose than CUDA, and it is one of the main sources for complaints related to OpenCL!

I guess you've got the point by now, so I'll stop complaining about CUDA's sinful ways, and get on to how to handle contexts.

Here is how the CUDA context handling was envisioned. The program has one or more threads, and is handling one or more contexts, while functions that need to be executed in the context do not receive that information as an argument. The function receives the current context from the thread-local storage that CUDA driver maintains for each thread.

This leads to the following situations:

One thread manages one context: this is the case in most example programs; CUDA seems easy and simple.
One thread manages many contexts: the programmer needs to keep references and keep switching the current context whenever he wants to switch the GPU that will execute the trailing code.
Many threads manage one context: the programmer needs to set the current context in each thread before CUDA function calls.
Many threads manage many contexts: there will be a bunch of references and a bunch of setting the current context calls.
Some of the previous cases, with a catch that the code does not control the thread it is executing in! Example: core.async!

As calling current-context! replaces the context that may have previously been current, it may not be enough for cases beyond simple programs. CUDA driver maintains a context stack for each thread. There are functions push-context! and pop-context! that can be used to put the context on top of the stack, making it current, and, after the work in that context has been done, remove the context from the top, reverting the current context to the one that was at the top previously. While current-context! is completely destructive, push/pop offer a bit more gracious mechanism. However, this mechanism still relies on side effects, and great care should be taken when working in multithreaded setups that use thread-pooled executors.

A small demonstration:

(do
  (def new-ctx (context (device 0)))
  (current-context! ctx)
  (= ctx (current-context)))

true

(do
  (push-context! new-ctx)
  (= new-ctx (current-context)))

true

(do
  (pop-context!)
  (= ctx (current-context)))

true

The CUDA functions that work inside the context will always work with the top context in the current context stack of the thread.

The easy stuff

If you need information about the hardware and software environment that your code will run on, the info function can provide it. There are many specialized info functions in the info namespace.

(require '[uncomplicate.clojurecuda.info :refer :all])

Most of the information is related to the device.

(require '[uncomplicate.commons.core :refer :all])

(info (device 0))

info returns the map that I show here pretty-printed. You can learn about these attributes in CUDA documentation and books.

{:async-engine-count 2,
 :managed-memory true,
 :multi-gpu-board false,
 :maximum-surface2d-layered-layers 2048,
 :maximum-texturecubemap-width 32768,
 :ecc-enabled false,
 :max-pitch 2147483647,
 :max-grid-dim-y 65535,
 :compute-mode :default,
 :can-map-host-memory true,
 :max-grid-dim-z 65535,
 :pci-bus-id-string "0000:01:00.0",
 :maximum-texture2d-mipmapped-width 32768,
 :texture-pitch-alignment 32,
 :kernel-exec-timeout false,
 :maximum-texture2d-linear-height 65000,
 :max-shared-memory-per-multiprocessor 98304,
 :total-mem 8508145664,
 :maximum-texture1d-layered-width 32768,
 :maximum-texturecubemap-layered-layers 2046,
 :maximum-texture3d-width 16384,
 :maximum-surface2d-layered-height 32768,
 :max-block-dim-z 64,
 :maximum-surface1d-width 32768,
 :maximum-surface3d-width 16384,
 :name "GeForce GTX 1080",
 :maximum-texture3d-height-alternate 8192,
 :max-threads-per-multiprocessor 2048,
 :max-shared-memory-per-block 49152,
 :maximum-texture3d-width-alternate 8192,
 :compute-capability-major 6,
 :texture-alignment 512,
 :global-memory-bus-width 256,
 :maximum-surface2d-layered-width 32768,
 :memory-clock-rate 5005000,
 :maximum-surfacecubemap-layered-layers 2046,
 :maximum-surface2d-height 65536,
 :clock-rate 1733500,
 :concurrent-kernels 1,
 :compute-capability-minor 1,
 :maximum-texture2d-width 131072,
 :max-threads-per-block 1024,
 :maximum-texture1d-linear-width 134217728,
 :integrated false,
 :maximum-texture2d-layered-layers 2048,
 :max-block-dim-x 1024,
 :maximum-texture1d-mipmapped-width 16384,
 :maximum-texture2d-mipmapped-height 32768,
 :local-L1-cache-supported true,
 :maximum-surface1d-layered-layers 2048,
 :pci-bus-id 1,
 :maximum-texture1d-layered-layers 2048,
 :maximum-surfacecubemap-layered-width 32768,
 :max-grid-dim-x 2147483647,
 :maximum-texture2d-height 65536,
 :global-L1-cache-supported true,
 :maximum-texture2d-linear-pitch 2097120,
 :maximum-texturecubemap-layered-width 32768,
 :multi-gpu-board-group-id 0,
 :pci-domain-id 0,
 :maximum-surface3d-depth 16384,
 :maximum-surface2d-width 131072,
 :stream-priorities-supported true,
 :multiprocessor-count 20,
 :tcc-driver false,
 :warp-size 32,
 :unified-addressing true,
 :maximum-texture3d-height 16384,
 :L2-cache-size 2097152,
 :maximum-surfacecubemap-width 32768,
 :maximum-texture1d-width 131072,
 :maximum-surface1d-layered-width 32768,
 :maximum-surface3d-height 16384,
 :pci-device-id 0,
 :max-registers-per-block 65536,
 :max-block-dim-y 1024,
 :surface-alignment 512,
 :maximum-texture3d-depth-alternate 32768,
 :maximum-texture3d-depth 16384,
 :total-constant-memory 65536,
 :maximum-texture2d-linear-width 131072,
 :max-registers-per-multiprocessor 65536,
 :maximum-texture2d-layered-height 32768}

What this has to do with contexts? Nothing directly, but in CUDA, context is usually strongly related one-to-one to the device. Otherwise, I hate CUDA contexts :)

Clean up what's left

(release ctx)
(release new-ctx)

The next article…

…is about OpenCL contexts, where we can see that this tricky issue can be much more straightforward to work with.

In the meantime, you may wish to check higher level performance libraries, that offer Clojure functions that transparently execute on CPU and various vendor GPU devices, such as the vectorization library Neanderthal .