Interactive GPU Programming - Part 1 - Hello CUDA

Need help with your custom Clojure software? I'm open to (selected) contract work.

January 17, 2018

Please share: Twitter.

These books fund my work! Please check them out.

Who wouldn't like to speed up their programs by thousands of times? When we reach the limits of software optimizations, the only hope is faster hardware. Today, that means parallel programs that run on Graphical Processing Units (GPUs). Thanks to the recent highly publicized deep learning and artificial intelligence advances, everyone has heard about Nvidia's CUDA - an environment and set of C++ oriented tools that transform your GPU card into a sort of a desktop supercomputer. Yes, that is interesting, and we want to get on board! But we would also like an interactive programming experience. I like power and performance, but I also like to experiment and play with flexible code interactively.

Listening to hunting stories of Google engineers is the furthest most programmers who work with higher-level platforms such as Java, .Net, Ruby, Python, Javascript, etc. will go. Except for the Pythonistas. They will get a bit further despite not knowing how to write any GPU-aware code, since there are DL libraries such as TensorFlow, Caffe, or PyTorch, which run on GPUs under the hood. However, even that depends on the other people writing the code that solves your problem. Sometimes there is such code, but more often there isn't any, and you are the one to write it.

The whole GPU programming ecosystem is not very approachable. Not only that you have to learn about parallel hardware and algorithms, but you have to find your way through the complex maze of different drivers, operating systems, and modified C++ build chains that is in itself the endless jungle of nitty-gritty details and incompatibilities.

There is a saner way I intend to show you in this series of blog posts. This better way relies on the environment that is:

platform-independent (you access it from Java Virtual Machine)
easy to set up and start with (relative to the "official" C++ way at least)
interactive (gives an instant feedback for each tiny piece of code you write)
easy to experiment and play with

And the best of all: it offers the access to almost the full power of your hardware, without hiding the details that do matter.

On top of it, there is a bonus: these tools work on Intel and AMD hardware too, so you are not constrained with Nvidia's proprietary ecosystem!

What also demonstrate how easy and malleable this approach is that this whole page has been automatically generated from a live org-mode session (this is a kind of interactive notebook), connected to a live Clojure REPL (read-eval-print-loop) session. The immediate output from the code execution is shown in the text as-is.

This first post shows the introductory Hello World example, and gives a glimpse of a typical CUDA application. In the following articles - this is only the first part of a series - I will explore major CUDA and OpenCL topics in detail.

Set up the environment

Hardware

The one thing we can't continue without is the hardware: ideally, you'd have a GPU in your machine:

Nvidia's GPU (supports CUDA)
AMD's GPU (you'd have to use OpenCL instead)
Intel's CPU and GPU (also supports OpenCL)

We access these hardware devices through the appropriate device drivers. Our everyday CPU programs do not need any special support of this kind only because the whole operating system is based around the specific architecture of the CPU (usually x86/amd64) and is a sort of the device driver for the CPU. So, you'd install the appropriate device drivers from the manufacturer of your GPU. Note that the generic drivers that come with your operating system are usually only capable of basic 2D display, and do not support GPU computing. Install the "real" drivers from Nvidia or AMD.

Toolkits

On top of that we need the actual GPU computing tools:

Nvidia: install the CUDA toolkit
AMD or Intel: install the support for OpenCL for your operating system

Clojure and Java

I assume that you already have a recent Java Development Kit (JDK). I also assume you like or are willing to try the fantastic Clojure programming language that compiles directly to the Java bytecode. There are many books, newbie tutorials, and conference talks that can help you quickly learning enough Clojure to be dangerous. I thing the free book Clojure for the Brave and True is very approachable and fun. You'll need to know at least basic Clojure to run this code yourself. If you do not have time to learn it right now, though, you'll still be able to read and understand examples, since Clojure has trivially simple syntax.

So, you'll create a new Clojure project (or use an example) that includes the actual library for GPU computing that I am talking about:

ClojureCUDA for Nvidia with CUDA
ClojureCL for AMD, Intel, and Nvidia with OpenCL

You can use any Java IDE, or any text editor + a Java environment that you like that has Clojure support (most do). I recommend the fantastic (if a bit unusual in Java community) Emacs + CIDER, with a convenient easy setup via Prelude.

Ready?

It seems there are lots of steps when I describe it like this, but that's because I've tried to support all kinds of cases. Most computers already have the drivers, many have CUDA Toolkit, most already have Java Virtual Machine, and Clojure is technically just a Java library we include in our Java projects through Maven (or a nicer Clojure build "mavens" Leiningen or Boot). Many readers have already been ready :)

This article uses CUDA, while the next one will show the code for OpenCL (most of the narrative applies for both).

Handle the GPU device(s)

At the beginning, we'll require the namespaces that contain functions for CUDA GPU programming: uncomplicate.clojurecuda.core namespace from the ClojureCUDA library.

(require '[uncomplicate.clojurecuda.core :refer :all])

CUDA environment has to be initialized before use. After that, I query the environment to see how many Nvidia GPU cards I have:

(init)

(device-count)

Now that I know that there are two CUDA-capable GPU devices in my computer, I can grab one by the handle:

(def my-nvidia-gpu (device 0))

I've stored the handle of my GTX 1080 in the my-nvidia-gpu global var. Storing data in global vars is fine and convenient for tutorials, but should not be done in the "real" programs. There is support in ClojureCUDA for nice functional approach to writing programs, though, so this technique is fine in this case.

What is the type of this object and how does it look like?

my-nvidia-gpu

#object[jcuda.driver.CUdevice 0x3e31f727 "CUdevice[nativePointer=0x0]"]

It's an instance of CUDevice, which itself stores a reference to a CUdevice native pointer. This is not something you should concern yourself with most of the time, but there is a reason why I'm showing it here:

ClojureCUDA shields you from the low-level details and lets you concentrate on the things that matter;
Sometimes what matters is low-level, and you should be able to control it for full effectiveness;
You should be able to use existing CUDA-based books, articles, and documentation to learn and properly use GPU programming.

Although ClojureCUDA is fairly pleasant and high-level, it is designed to directly correspond to familiar CUDA constructs. It helps when it can, and moves out of the way when necessary!

Congratulations, there is the hello world! Just kidding. Although this is the code that interacts with the driver, I hope you'd expected some "real" code that actually compute something on the GPU. You won't be disappointed. I would still like to point out that you have witnessed something cool, though: a real code, that lets your write CUDA-ish stuff interactively, without even knowing what a basic CUDA program looks like. The best of all, we are able to experiment and get the instant feedback, which is the key thing for learning how a CUDA program looks like and what it can do in the first place!

Working in the context

The first thing we have to do in all CUDA programs is to create the context through which we will be controlling the device(s). It is a step analogous to setting up a connection with a database in traditional programs.

CPU can execute multiple programs simultaneously, and GPU can as well. An analogous of a CPU proccess is a GPU context. It manages the life cycles of various CUDA objects, such as: memory, modules (program code), streams (a kind of threads), events, etc. It's a management unit that sets up an environment in which your device will execute your programs.

The default context setup can be easily created with ClojureCUDA:

(def ctx (context my-nvidia-gpu))

ctx

#object[uncomplicate.clojurecuda.internal.impl.CUContext 0x367f6baa "#CUContext[0x7f09ec014a70]"]

As with CUDevice, when you need a specific information about how to use contexts, there is a convenient fallback to the official literature; just look for CUcontext.

Manage the memory on the GPU device

To utilize the capacity of GPU computing cores, we need to feed them enough data at enough speed. GPU has its own fast memory and dedicated integrated memory controller(s) on board, but, on top of it, it does offer a direct way to control that memory.

An important thing to notice is that, unlike the CPU, which is an all-round skillful commando, the GPU is good at very narrow set of relatively simple tasks, and most of these tasks revolve around numerical operations. The memory management will usually deal with supplying huge arrays of primitive numbers to huge number of parallel workers, and taking care that each worker work on the appropriate part of a huge raw array.

When thinking about CUDA memory, think about huge raw byte arrays. Additional libraries such as Neanderthal would add more structure on top of this, but CUDA API typically manages raw bytes of memory. Here's how:

(current-context! ctx)
(def gpu-array (mem-alloc 1024))

This creates the handle to a chunk of 1024 bytes of global GPU memory called linear memory in CUDA terminology. It is nothing more than a 1 dimensional array of raw bytes in the main (global) memory on the GPU board, as opposed to graphics-oriented 2D, 3D, and texture memory.

gpu-array

#object[uncomplicate.clojurecuda.internal.impl.CULinearMemory 0x60ad688a "uncomplicate.clojurecuda.internal.impl.CULinearMemory@60ad688a"]

Transferring the data from the main memory to the GPU memory

We have defined the memory on which to unleash our many GPU cores. But, before going further, we should think about how to get the data there in the first place, and how to return the result. I assume that the data is not on the hard disk or in the CSV file available over the internet, but that we have already parsed it and loaded it in a float array in our main Clojure program in the Java virtual machine:

(current-context! ctx)
(def main-array (float-array (range 256)))

(take 10 main-array)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

We have created a float array of 256 numbers. Each float takes 4 bytes, so its size of 1024 bytes matches the capacity of the GPU array that is the destination for this data.

In GPU computing terminology, the memory on the GPU is called device memory, while the main memory is called host memory.

Let's do the transfer!

(current-context! ctx)
(memcpy-host! main-array gpu-array)

#object[uncomplicate.clojurecuda.internal.impl.CUContext 0x367f6baa "#CUContext[0x7f09ec014a70]"]#object[uncomplicate.clojurecuda.internal.impl.CULinearMemory 0x60ad688a "uncomplicate.clojurecuda.internal.impl.CULinearMemory@60ad688a"]

To convince you that the data have really been transferred to the GPU memory, I'll transfer it back into a new empty float-array:

(take 12 (memcpy-host! gpu-array (float-array 256)))

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

Now you believe me the data is on the GPU!

Compute something already!

We have the data in memory, but we do not have the program to crunch it. There's a lot to talk about here, but since this is a hello world article after all, I'll be as brief as possible.

In GPU programming terminology, there are two kinds of code:

host code that runs on the CPU and do various management calls to the device driver (such as memcpy-host!)
kernels that run on the GPU cores

We write the host code in Clojure, while the kernels are written in CUDA C. Our hello world example will increment each element in the array, in parallel of course. The kernel looks like this:

extern "C"
__global__ void add(int n, float *a) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        a[i] = a[i] + 1.0f;
    }
};

Keyword __global__ indicates that this is a CUDA kernel. Otherwise, it is a plain C function, that has some parameters, and calls some operators on arrays. Another CUDA-specific things are blockIdx.x and similar fields. Remember that each parallel worker will execute this same code at the same time. Thus, each worker need to position itself in the whole squadron. CUDA environment will make sure that each unit ("worker") will get this data populated. In this hello world case, each worker will be able to compute its ID, and work only on one cell of the array. It will read the value of that cell, add one, and write it to the same location in the global GPU memory.

How will the GPU know which kernel is going to be executed on which array in which order? That's the job of the host code:

(def kernel-source
      "extern \"C\"
         __global__ void increment (int n, float *a) {
           int i = blockIdx.x * blockDim.x + threadIdx.x;
           if (i < n) {
             a[i] = a[i] + 1.0f;
        }
       };")

(current-context! ctx)
(def hello-program (compile! (program kernel-source)))
(def hello-module (module hello-program))
(def increment (function hello-module "increment"))
(launch! increment (grid-1d 256) (parameters 256 gpu-array))
(def result (memcpy-host! gpu-array (float-array 256)))

(take 12 result)

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

Each element of our array has been incremented! Cheers!

Why not write kernels in Clojure, or another higher-level language? It is possible to implement, but in my opinion it does more harm than good. Kernels are tiny, not very complex compared to the typical C++ code, and they match the problem of floating point computations really well. On the other hand, the host code can be long and complex, and C++ is notoriously hard to build. Therefore, simplifying host code by wrapping it with Clojure makes sense, while kernels - not so much.

Keep the environment clean!

The key to fast code is managing the scarce resources on the GPU. That means manually managing the key resource: memory!

ClojureCUDA has macros that can do the bookkeeping for us, which I'll show you in the following articles, but for now, we'll do that in the plain CUDA style: by freeing them by hand. Various specific CUDA functions are avilable for this task, and we can uniformly access them with the release function:

(require '[uncomplicate.commons.core :refer :all])

(release gpu-array)
(release hello-module)
(release hello-program)
(release ctx)

What follows next

Now that we have broken the ice by creating and running a complete CUDA program from scratch interactively in the REPL, I'll take time to get into the details of how to handle memory, kernels, transfer, contexts, and all these specialized CUDA topics. We'll also see how to integrate the custom low-level CUDA code we write with some powerful Clojure number crunching libraries such as Neanderthal.

But first, we'll repeat this hello world in OpenCL, for those of us who have AMD or Intel hardware, or just prefer to use a more open and standards-based solution with Nvidia.