Clojure Walk through the OpenCL in Action GPU Computing Book - Part 1

Need help with your custom Clojure software? I'm open to (selected) contract work.

October 24, 2018

Please share: Twitter.

These books fund my work! Please check them out.

Since its inception, ClojureCL tests have been carrying on the walk-through of examples from, in my opinion, the best introductory book for GPU computing: OpenCL in Action by Matthew Scarpino. I've just never had time to write the commentary as proper naration. Thanks to Nikola Milikic, now we have it for Chapter 4.

Please note that this OpenCL book is perfectly suitable as the first CUDA book. I have never seen the introductory CUDA book that teaches fundamentals of GPU programming than Matthew's underappreciated text. If it has been written for CUDA, or if OpenCL have had more luck, it would probably have been a bestseller!

Note: the code evaluation was run on my machine when the page was automatically generated, so where Nikola says "Apple" and the results show "AMD" don't be confused. Your machine might say "Nvidia".

Enjoy this article written by Nikola Milikic!

Basic ClojureCL Examples - Part 1

In the initial post titled Interactive GPU Programming - Part 2 - Hello OpenCL, Dragan explained how to get started with the ClojureCL library and its basic use. The idea of this and the following posts in this series is to show more examples of how to use ClojureCL and to explain the basic concepts of OpenCL needed for understanding of the examples.

Literature in OpenCL is relatively scarce. I have taken Dragan’s advice and started learning it by going through the book OpenCL in Action. But instead of doing the examples in C/C++, as in the book, I was doing the examples in Clojure (they can be found in the ClojureCL repo). It is a good way of learning OpenCL host programming as you can concentrate on the main OpenCL concepts in your favourite language.

Example 1 (Listing 4.1)

Let’s start with the first example from the Listing 4.1 in the book (p. 69). The example illustrates how to run a basic OpenCL program, i.e. a kernel. A kernel is function that can be executed by one or more OpenCL compliant devices. This is an example of a kernel:

__kernel void hello_kernel(__global char16 *msg) {
        *msg = (char16)('H', 'e', 'l', 'l', 'o', ' ',
                'k', 'e', 'r', 'n', 'e', 'l', '!', '!', '!', '\0')
}

We won't go into the details of writing kernels (you should consult the book for this purpose). But generally, kernels can accept arguments by value or by reference. If you pass an argument by a reference, you are actually providing a pointer that references a memory object. In our example, the argument msg references a 16-byte buffer object. And what this kernel does is stores 16 characters ("Hello, kernel!!!) into the argument reference.

In order to run this kernel, we need to write a host program, i.e. an application on a development system. The role of a host program is to build a kernel from its source, set kernel argument data and send the kernel to the device’s command queue for execution. Also, the host program is responsible for reading back results from the executed kernel and observing the events coming in during the execution process.

This is how the host program that executes a kernel, written in Clojure, looks like (I have adjusted the test “Section 4.1, Page 69.” from the ClojureCL repo so it is self-contained and is not executed as a test):

(ns openclinaction.ch04
  (:require [clojure.java.io :as io]
            [clojure.core.async :refer [chan <!!]]
            [uncomplicate.commons
             [core :refer [with-release info]]
             [utils :refer [direct-buffer]]]
            [uncomplicate.clojurecl
             [core :refer :all]
             [info :refer :all]]))

(let [notifications (chan)
      follow (register notifications)]

  (with-release [dev (first (devices (first (platforms))))
                 ctx (context [dev])
                 cqueue (command-queue-1 ctx dev)]

    (let [host-msg (direct-buffer 16)
          work-sizes (work-size [1])
          program-source "__kernel void hello_kernel(__global char16 *msg) {
                            *msg = (char16)('H', 'e', 'l', 'l', 'o', ' ',
                              'k', 'e', 'r', 'n', 'e', 'l', '!', '!', '!', '\\0');
                          }"]
      (with-release [cl-msg (cl-buffer ctx 16 :write-only)
                     prog (build-program! (program-with-source ctx [program-source]))
                     hello-kernel (kernel prog "hello_kernel")
                     read-complete (event)]

        (set-args! hello-kernel cl-msg)
        (enq-kernel! cqueue hello-kernel work-sizes)
        (enq-read! cqueue cl-msg host-msg read-complete)
        (follow read-complete host-msg)
        (let [data ^java.nio.ByteBuffer (:data (<!! notifications))
              res ^bytes (make-array Byte/TYPE 16)]
          (dotimes [i 16] (aset res i (.get data i)))
          (apply str (map (comp char) res)))))))

Hello kernel!!!

In the remainder of the post I will explain what the code does.

In a simplified world, every host application performs the following steps in order to execute a kernel:

Select an OpenCL device (or multiple devices),
Create a context for communicating with the device,
Create a command queue through which the host tells the device what to do,
Compile and build a kernel (from the source text where the OpenCL program is written),
Set arguments to the kernel through which data is passed to the kernel or returned from the kernel (actually, kernels can not return a result, but can write it to an argument reference),
Send a kernel to the command queue for execution,
After the kernel execution, read the results,
Release all resources.

Note that steps 1-3 should be performed once in an application. Once you obtain a reference to a device, context, command queue, you can reuse them multiple times. Similarly, Step 8 should also be performed once, at the end of interacting with your devices. Step 4 can be performed multiple times, meaning several kernels can be built and used with the same context. And in the Step 5 arguments can be reused for multiple kernels.

In the previous code, most of these steps are achieved in a single line of code. So, let's dissect the code from the previous example and associate it with these steps.

In lines 11-12, a clojure channel is created called notifications, and we create a function follow that will be used for creating callbacks fro events coming from the command queue. This can be pretty handy as Clojure’s channels can be used for communicating between different threads.

Next, in line 14, from the available OpenCL platforms (represented by the function platforms) we select the OpenCL platform to be used (Step 1). On my MacBook Pro (mid 2014) I know I have only one OpenCL platform called Apple.

(name-info (first (platforms)))
;;=> "Apple"

AMD Accelerated Parallel Processing

And on my platform, I have two devices supporting OpenCL, my processor and my graphics card.

(map name-info (devices (first (platforms))))
;;=> ("Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz" "Iris")

Hawaii

Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

In our example we can choose either of them to execute the sample kernel.

Once the device is set (in my case my CPU), in line 15 we create a context ctx for communicating with the CPU (Step 2) and the command queue cqueue through which we will send the kernel and read the results (Step 3). Note that here I am using a legacy version of the function for creating the command queue command-queue-1. The legacy function is intended to be used for devices that do support OpenCL 1.2 or older versions. And since my CPU is supporting only OpenCL 1.2, I need to use the legacy version.

(device-version (first (devices (first (platforms)))))
; => "OpenCL 1.2 "

OpenCL 2.0 AMD-APP (1912.5)

If your device(s) support OpenCL 2.0, you simply use the regular command-queue function.

Continuing with our example, in lines 18-20, we are setting up some variables we need later. We create a byte buffer of size 16, called host-msg, to store the result returned from the kernel. In the following line, we initialize the work-sizes record that will be used later when enqueuing the kernel to the command queue (I will get back to this later). And finally, we define our OpenCL program to be executed. It is preferred to store this program code in a separate file, but I wanted to have a self-contained example here.

Next, in line 24, we call the cl-buffer function. This creates an OpenCL data structure that represents a memory object and can have specific properties. In our case, we create a 16-byte memory object, called cl-msg, that can only be written to by an OpenCL program (hence the :write-only flag).

In line 25 we compile our OpenCL program and build it so it can be used with all devices (the build-program! function). And finally, in the following line, we create a kernel from the compiled program with a name “hello_kernel” (Step 4).

In line 27, we create an instance of the OpenCL’s event that we will use later.

Next, we come to the line 29 where we set arguments to be passed to the kernel (Step 5). We pass the cl-msg memory object to the kernel and in line 30 we enqueue the kernel to the command queue (Step 6). As already mentioned, a kernel can not return a result as in most programming languages. But the usual approach is to write the result of an execution (or in our case write the message) to an argument reference. In our program, after the kernel is executed, the program enqueues a read operation from the cl-msg (line 31). We are also supplying a read-complete event that will notify us when the read operation completes. After that, we are invoking the follow function that registers a callback to the event read-complete and pass it a also the host-msg as a memory reference where the result should be read from. Finally, in the remainder of the code, we are reading from the notifications channel (once there is an event in the channel), fetching the result data (that is initially stored in a byte buffer) and storing it to a byte variable res that is more convenient to work with (Step 7).

What I missed on purposely (so I can explain it here in the end) is that all resources, after you finish working with them, should be released, i.e. the memory they occupied should be freed. This includes releasing all references to memory objects, kernels, command queues, programs and contexts. We perform this by encapsulating the code with the with-release macro. The purpose of this macro is to bind releasable elements to symbols (like we do in let section), evaluate the body, and at the end release the resources.

In the next post, I’ll continue with dissecting the examples from the OpenCL in Action book. I know this post is a bit lengthy since I wanted to introduce the basic OpenCL concepts and the flow of the kernel execution from the host program. But hopefully, this text helped you understand these concepts and you can follow the upcoming posts with more advanced examples.

Guest post by Nikola Milkic