Interactive GPU Programming - Part 2 - Hello OpenCL

February 7, 2018

This is really the same article as Part 1 - Hello CUDA, but focused on OpenCL, so I'll skip most of the narration and just show you the code. OpenCL is an open standard for heterogeneous, cross-platform parallel programming that supports GPU's, CPU's, and other accelerators.

Set up the environment

Hardware

Nothing specific to add here. You have to have either an AMD, Intel, or Nvidia GPU. OpenCL can work on CPU's too. You have to install the appropriate drivers that support OpenCL, which usually means regular GPU drivers from AMD and Nvidia.

Toolkits

Here, things are a bit more fuzzy. Graphic drivers usually support OpenCL, but sometimes there need to be an additional library on top. Also, AMD supports OpenCL 2.0 for some hardware, but reverted back to OpenCL 1.2 support in their latest libraries. Nvidia supports OpenCL 1.2, but some OpenCL 2.0 features are supported unofficially. All in all, you have to be careful to see what is supported by your vendor, install the appropriate drivers and, optionally, additional toolkits that best support that hardware. Long story short, on recent (4 years) GPU's, you should end up with OpenCL 1.2 or 2.0.

Clojure and Java

Same as in Part 1 - Hello CUDA, here we are using ClojureCL . Visit ClojureCL's page for more installation instructions.

Handle the GPU device(s)

At the beginning, we'll require the namespaces that contain functions for GPU programming. Functions that work with OpenCL are in the uncomplicate.clojurecl.core namespace of the ClojureCL library.

(require '[uncomplicate.clojurecl.core :refer :all]
         '[uncomplicate.clojurecl.info :refer :all])

Although most concepts of OpenCL are similar to what we've seen with CUDA, the environment setup is a bit more detailed. That's because OpenCL does not support only GPU's, and only from one vendor, but multiple hardware devices from multiple vendors. It has to offer a choice of drivers and supported versions, at the same time.

The entry point is the concept of platform. The platforms function returns a sequence of platforms available on the system:

(map name-info (platforms))
NVIDIA CUDA AMD Accelerated Parallel Processing

My machine reports two platforms: Nvidia and AMD. I'll use AMD platform for this session:

(def amd-platform (second (platforms)))
#'user/amd-platform

Different platforms support different versions of OpenCL standard, and some additional vendor-specific extensions. Each platform can be used to access the appropriate hardware. What devices do I have on AMD's platform?

(map name-info (devices amd-platform))
Hawaii Hawaii Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

Now I know that there are two Hawaii GPU's (R9 290X) and one CPU that can be accessed through AMD's platform. Let's grab the handle of the first device:

(def my-amd-gpu (first (devices amd-platform)))

What is the type of this object and how does it look like?

my-amd-gpu
#object[org.jocl.cl_device_id 0x30af1ea4 "cl_device_id[0x7f6ffc71fc30]"]

Having the handle of the device, we can proceed in a fashion similar to the Hello CUDA tutorial.

Working in the context

The default context setup can be easily created with ClojureCL:

(def ctx (context [my-amd-gpu]))
ctx
#object[org.jocl.cl_context 0x5a773a7a "cl_context[0x7f6ffc3b7310]"]

As with CUDA, when you need a specific information about how to use contexts, there is a convenient fallback to the official literature; just look for cl_context.

Manage the memory on the GPU device

Memory in OpenCL is created in the explicitly supplied context:

(def gpu-array (cl-buffer ctx 1024 :read-write))
gpu-array
#object[uncomplicate.clojurecl.core.CLBuffer 0xad4e84b "[email protected]"]

Transferring the data from the main memory to the GPU memory

Create the data:

(def main-array (float-array (range 256)))
(take 10 main-array)
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

Let's do the transfer!

Differently from CUDA, in OpenCL we have to explicitly set the context and command queue (equivalent of CUDA stream) that will process the tasks (that is a good thing!).

(def queue (command-queue ctx my-amd-gpu))
(enq-write! queue gpu-array main-array)
#'user/queue#object[org.jocl.cl_command_queue 0x4821e434 "cl_command_queue[0x7f6ffcf166e0]"]

To convince you that the data have really been transferred to the GPU memory, I'll transfer it back into a new empty float-array:

(def roundtrip-array (float-array 256))
(enq-read! queue gpu-array roundtrip-array)
(take 12 roundtrip-array)
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0

Now you believe me the data is on the GPU!

Compute something already!

Both OpenCL and CUDA kernels are based on C, with some additional parallel programming additions. Here's the OpenCL version of the mul10 kernel.

__kernel void mul10(__global float *a) {
    int i = get_global_id(0);
    a[i] = a[i] * 10.0f;
};

The host code:

(def kernel-source
      "__kernel void mul10(__global float *a) {
         int i = get_global_id(0);
         a[i] = a[i] * 10.0f;
       };")


(def hello-program (build-program! (program-with-source ctx [kernel-source])))
(def mul10 (kernel hello-program "mul10"))
(def result (float-array 256))
(set-arg! mul10 0 gpu-array)
(enq-nd! queue mul10 (work-size-1d 256))
(enq-read! queue gpu-array result)
(take 12 result)
0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0

Each element of our array has been multiplied by 10! Cheers!

Keep the environment clean!

OpenCL also requires that we take care of the loose ends:

(require '[uncomplicate.commons.core :refer :all])
(release gpu-array)
(release hello-program)
(release queue)
(release ctx)

What follows next

We've seen that working in OpenCL is similar to CUDA programming. Next, we will explore each of the steps we've done in more detail. I know that you are most interested in kernels and algorithms, but contexts, platforms, memory, and streams is what we should get a firm grasp on first. Please be patient, it is for a good reason.

Interactive GPU Programming - Part 2 - Hello OpenCL - February 7, 2018 - Dragan Djuric