Interactive GPU Programming - Part 2 - Hello OpenCL

Need help with your custom Clojure software? I'm open to (selected) contract work.

February 7, 2018

Please share: Twitter.

These books fund my work! Please check them out.

This is really the same article as Part 1 - Hello CUDA, but focused on OpenCL, so I'll skip most of the narration and just show you the code. OpenCL is an open standard for heterogeneous, cross-platform parallel programming that supports GPU's, CPU's, and other accelerators.

Set up the environment

Hardware

Nothing specific to add here. You have to have either an AMD, Intel, or Nvidia GPU. OpenCL can work on CPU's too. You have to install the appropriate drivers that support OpenCL, which usually means regular GPU drivers from AMD and Nvidia.

Toolkits

Here, things are a bit more fuzzy. Graphic drivers usually support OpenCL, but sometimes there need to be an additional library on top. Also, AMD supports OpenCL 2.0 for some hardware, but reverted back to OpenCL 1.2 support in their latest libraries. Nvidia supports OpenCL 1.2, but some OpenCL 2.0 features are supported unofficially. All in all, you have to be careful to see what is supported by your vendor, install the appropriate drivers and, optionally, additional toolkits that best support that hardware. Long story short, on recent (4 years) GPU's, you should end up with OpenCL 1.2 or 2.0.

Clojure and Java

Same as in Part 1 - Hello CUDA, here we are using ClojureCL. Visit ClojureCL's page for more installation instructions.

Handle the GPU device(s)

At the beginning, we'll require the namespaces that contain functions for GPU programming. Functions that work with OpenCL are in the uncomplicate.clojurecl.core namespace of the ClojureCL library.

(require '[uncomplicate.clojurecl.core :refer :all]
         '[uncomplicate.clojurecl.info :refer :all])

nil

Although most concepts of OpenCL are similar to what we've seen with CUDA, the environment setup is a bit more detailed. That's because OpenCL does not support only GPU's, and only from one vendor, but multiple hardware devices from multiple vendors. It has to offer a choice of drivers and supported versions, at the same time.

The entry point is the concept of platform. The platforms function returns a sequence of platforms available on the system:

(map name-info (platforms))

AMD Accelerated Parallel Processing

My machine reports one platform: AMD.

(def amd-platform (first (platforms)))

#'user/amd-platform

Different platforms support different versions of OpenCL standard, and some additional vendor-specific extensions. Each platform can be used to access the appropriate hardware. What devices do I have on AMD's platform?

(map name-info (devices amd-platform))

Hawaii

Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

Now I know that there is one Hawaii GPU's (R9 290X) and one CPU that can be accessed through AMD's platform. Let's grab the handle of the first device:

(def my-amd-gpu (first (devices amd-platform)))

What is the type of this object and how does it look like?

my-amd-gpu

#object[uncomplicate.clojurecl.internal.impl.CLDevice 0x3f393c1e "#CLDevice[0x7f34a472bfb0]"]

Having the handle of the device, we can proceed in a fashion similar to the Hello CUDA tutorial.

Working in the context

The default context setup can be easily created with ClojureCL:

(def ctx (context [my-amd-gpu]))

ctx

#object[uncomplicate.clojurecl.internal.impl.CLContext 0x56ba4c21 "#CLContext[0x0]"]

As with CUDA, when you need a specific information about how to use contexts, there is a convenient fallback to the official literature; just look for cl_context.

Manage the memory on the GPU device

Memory in OpenCL is created in the explicitly supplied context:

(def gpu-array (cl-buffer ctx 1024 :read-write))

gpu-array

#object[uncomplicate.clojurecl.internal.impl.CLBuffer 0x3ea1bac2 "#CLBuffer[0x0]"]

Transferring the data from the main memory to the GPU memory

Create the data:

(def main-array (float-array (range 256)))

(take 10 main-array)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

Let's do the transfer!

Differently from CUDA, in OpenCL we have to explicitly set the context and command queue (equivalent of CUDA stream) that will process the tasks (that is a good thing!).

(def queue (command-queue ctx my-amd-gpu))
(enq-write! queue gpu-array main-array)

class clojure.lang.Compiler$CompilerExceptionclass clojure.lang.ExceptionInfoclass clojure.lang.Compiler$CompilerExceptionclass clojure.lang.ExceptionInfoCompilerException clojure.lang.ExceptionInfo: OpenCL error: CL_INVALID_CONTEXT. {:name "CL_INVALID_CONTEXT", :code -34, :type :opencl-error, :details {:device #uncomplicate.clojurecl.info.DeviceInfo{:address-bits 64, :available true, :built-in-kernels #{}, :compiler-available true, :double-fp-config #{:round-to-zero :inf-nan :denorm :round-to-inf :round-to-nearest :fma}, :endian-little true, :error-correction-support false, :execution-capabilities #{:kernel}, :extensions #{"cl_amd_printf" "cl_khr_spir" "cl_amd_device_attribute_query" "cl_khr_local_int32_base_atomics" "cl_khr_local_int32_extended_atomics" "cl_amd_popcnt" "cl_khr_gl_depth_images" "cl_khr_global_int32_base_atomics" "cl_amd_fp64" "cl_khr_mipmap_image_writes" "cl_khr_subgroups" "cl_khr_global_int32_extended_atomics" "cl_khr_fp64" "cl_amd_media_ops" "cl_amd_media_ops2" "cl_khr_depth_images" "cl_ext_atomic_counters_32" "cl_khr_gl_sharing" "cl_khr_3d_image_writes" "cl_khr_image2d_from_buffer" "cl_khr_mipmap_image" "cl_khr_byte_addressable_store" "cl_khr_gl_event" "cl_khr_int64_extended_atomics" "cl_khr_int64_base_atomics" "cl_amd_vec3"}, :global-mem-cache-size 16384, :global-mem-cache-type :read-write, :global-mem-cacheline-size 64, :global-mem-size 4063047488, :global-variable-preferred-total-size 4063047488, :image2d-max-height 16384, :image2d-max-width 16384, :image3d-max-depth 2048, :image3d-max-height 2048, :image3d-max-width 2048, :image-base-address-alignment 256, :image-max-array-size 2048, :image-max-buffer-size 65536, :image-pitch-alignment 256, :image-support true, :linker-available true, :local-mem-size 32768, :local-mem-type :local, :max-clock-frequency 1040, :max-compute-units 44, :max-constant-args 8, :max-constant-buffer-size 65536, :max-global-variable-size 2642637312, :max-mem-aloc-size 2936263680, :max-on-device-events 1024, :max-on-device-queues 1, :max-parameter-size 1024, :max-pipe-args 16, :max-read-image-args 128, :max-read-write-image-args 64, :max-samplers 16, :max-work-group-size 256, :max-work-item-dimensions 3, :max-work-item-sizes [256 256 256], :max-write-image-args 64, :mem-base-addr-align 2048, :name "Hawaii", :native-vector-width-char 4, :native-vector-width-short 2, :native-vector-width-int 1, :native-vector-width-long 1, :native-vector-width-double 1, :native-vector-width-float 1, :native-vector-width-half 1, :opencl-c-version {:version 2.0, :vendor-specific-info nil}, :parent-device nil, :partition-affinity-domain #{}, :partition-max-sub-devices 44, :partition-properties (), :partition-type (), :pipe-max-active-reservations 16, :pipe-max-packet-size 2936263680, :platform #object[org.jocl.cl_platform_id 0x52ebe789 "cl_platform_id[0x7f350fedba18]"], :preferred-global-atomic-alignment 0, :preferred-interop-user-sync true, :preferred-local-atomic-alignment 0, :preferred-platform-atomic-alignment 0, :preferred-vector-width-char 4, :preferred-vector-width-short 2, :preferred-vector-width-int 1, :preferred-vector-width-long 1, :preferred-vector-width-double 1, :preferred-vector-width-float 1, :preferred-vector-width-half 1, :printf-buffer-size 1048576, :profile "FULL_PROFILE", :profiling-timer-resolution 1, :queue-on-device-max-size 8388608, :queue-on-device-preferred-size 262144, :queue-on-device-properties #{:out-of-order-exec-mode :profiling}, :queue-on-host-properties #{:profiling}, :reference-count 1, :single-fp-config #{:correctly-rounded-divide-sqrt :round-to-zero :inf-nan :round-to-inf :round-to-nearest :fma}, :spir-versions #{1.2}, :svm-capabilities #{:coarse-grain-buffer :fine-grain-buffer}, :device-type :gpu, :vendor "Advanced Micro Devices, Inc.", :vendor-id 4098, :device-version "OpenCL 2.0 AMD-APP (1912.5)", :driver-version "1912.5 (VM)"}}}, compiling:(form-init7666750429595232756.clj:1:12)
ExceptionInfo OpenCL error: CL_INVALID_COMMAND_QUEUE.  clojure.core/ex-info (core.clj:4739)

To convince you that the data have really been transferred to the GPU memory, I'll transfer it back into a new empty float-array:

(def roundtrip-array (float-array 256))
(enq-read! queue gpu-array roundtrip-array)

(take 12 roundtrip-array)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

Now you believe me the data is on the GPU!

Compute something already!

Both OpenCL and CUDA kernels are based on C, with some additional parallel programming additions. Here's the OpenCL version of the mul10 kernel.

__kernel void mul10(__global float *a) {
    int i = get_global_id(0);
    a[i] = a[i] * 10.0f;
};

The host code:

(def kernel-source
      "__kernel void mul10(__global float *a) {
         int i = get_global_id(0);
         a[i] = a[i] * 10.0f;
       };")


(def hello-program (build-program! (program-with-source ctx [kernel-source])))
(def mul10 (kernel hello-program "mul10"))
(def result (float-array 256))
(set-arg! mul10 0 gpu-array)
(enq-kernel! queue mul10 (work-size-1d 256))
(enq-read! queue gpu-array result)

(take 12 result)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

110.0

Each element of our array has been multiplied by 10! Cheers!

Keep the environment clean!

OpenCL also requires that we take care of the loose ends:

(require '[uncomplicate.commons.core :refer :all])

(release gpu-array)
(release hello-program)
(release queue)
(release ctx)

What follows next

We've seen that working in OpenCL is similar to CUDA programming. Next, we will explore each of the steps we've done in more detail. I know that you are most interested in kernels and algorithms, but contexts, platforms, memory, and streams is what we should get a firm grasp on first. Please be patient, it is for a good reason.