Interactive, Functional,
GPU Accelerated
Programming in Clojure

Dragan Djuric

[email protected]

Dragan Djuric

Clojure and FP are great!

  • Dynamic and fast
  • First-class functions
  • Great abstractions and data structures
  • Many useful libraries
  • Even more experimental libraries
  • Access to Java and the JVM
  • Hey, the community is amazing!

Number crunching?

only thinking about the money and the fame and attention. –Urban Dictionary

Number cruncher?

a computer that can solve many problems at a fast rate –Urban Dictionary

  • How fast is fast?

What I'll talk about?

  • Not (necessarily) Big Data
  • Lots of computations regardless of data size
  • numerical data
  • numerical algorithms

Is FP good at number crunching?

  • Good? sometimes.
  • Great? NO!
  • Poor access to hardware-specific optimizations.
  • Some FP ideas are very useful!

CPU is not so great either!

  • R, Python? Even worse than Java or Hasekell.
  • C? complicated, verbose platform-specific optimizations.
  • CPU? too beefed-up!

GPU has a lot to offer …at a price

  • many dumb computing units
  • but, power-efficient for number crunching
  • hardware support for massive parallelism
  • faster and cheaper each year
  • notoriously difficult to program

Uncomplicate

ClojureCL
take control of the GPU, CPU, and accelerators from Clojure
ClojureCUDA
take control of the Nvidia GPU
Neanderthal
vectors and matrices, but optimized for CPU and GPU
Bayadera
high performance Bayesian statistics and data analysis on the GPU

Hello world: dot product

  • One of the simplest linear algebra functions
  • Consists of two familiar FP building blocks:
    • map
    • reduce
  • Multiply each corresponding element of two arrays (vectors) and add them up.

Idiomatic Clojure

(let [dot-product (fn [xs ys]
                    (reduce + (map * xs ys)))

      x-vec (vec (range 100000))
      y-vec (vec (range 100000))]

  (dot-product x-vec y-vec))
333328333350000

Execution time: 14 ms

Neanderthal: using optimized library

(require '(uncomplicate.neanderthal
           [core :refer :all]
           [native :refer :all]))
(let [x (fv (range 100000))
      y (copy x)]

  (dot x y))
3.33328352E14

  • Execution time: 6 μs

Neanderthal: a taste of GPU

(require '[uncomplicate.clojurecuda.core :refer :all]
         '[uncomplicate.neanderthal.cuda :refer :all]
         '[uncomplicate.commons.core :refer :all])
(with-default
  (with-default-engine
    (with-release [gpu-x (cuv (range 100000))
                   gpu-y (copy gpu-x)]

      (dot gpu-x gpu-y))))
3.33328352E14

  • Execution time: 26 μs
  • Not faster than the CPU at all! Why?

A million dot products on the GPU

(with-default
  (with-default-engine
    (with-release [gpu-x (entry! (cuge 1000 100000) 0.01)
                   gpu-y (copy (trans gpu-x))
                   cpu-c (cuge 1000 1000)]

      (do (mm! 1 gpu-x gpu-y 0 cpu-c)
          (synchronize!)
          true))))
true

  • Execution time: 23 ms
  • 23 nanoseconds per a 100000 element dot product!
  • Our Clojure code started from 14 milliseconds (600,000 × difference!)
  • 1000 × faster than a solo dot product!

A glimpse of ClojureCUDA API

(init)
true
(device-count)
2
(def my-nvidia-gpu (device 0))
#'user/my-nvidia-gpu

Querying information

(info my-nvidia-gpu)
{:max-grid-dim-y 65535, :total-mem 11721506816,
 :name "GeForce GTX 1080 Ti", :max-threads-per-multiprocessor 2048,
 :max-shared-memory-per-block 49152, :compute-capability-major 6,
 :global-memory-bus-width 352, :memory-clock-rate 5505000,
 :max-threads-per-block 1024, :multiprocessor-count 28,
 :warp-size 32, :max-registers-per-block 65536
 ;;... much more data
}

Managing Context

(def ctx (context my-nvidia-gpu))
#'user/ctx
(info ctx)
{:dev-runtime-pending-launch-count 2048  :dev-runtime-sync-depth 2
 :malloc-heap-size 8388608  :stack-size 1024  :api-version 3020
 :stream-priority-range (0 -1)  :cache-config :prefer-none  :printf-fifo-size 1048576
 :device #object(jcuda.driver.CUdevice 0x12be4426 "CUdevice[nativePointer=0x0]")
 :shared-config :four-byte-bank-size}
(= ctx (current-context))
true

Memory

(def gpu-array (mem-alloc 1024))
#'user/gpu-array
(def main-array (float-array (range 256)))
#'user/main-array
(take 10 main-array)
(0 1 2 3 4 5 6 7 8 9)

Transfer data to GPU memory

(memcpy-host! main-array gpu-array)
#object[uncomplicate.clojurecuda.internal.impl.CULinearMemory 0x38701ca4 "uncomplicate.clojurecuda.internal.impl.CULinearMemory@38701ca4"]
(take 12 (memcpy-host! gpu-array (float-array 256)))
(0 1 2 3 4 5 6 7 8 9 10 11)

Compute something already!

The kernel

extern "C"
__global__ void increment(int n, float *a) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        a[i] = a[i] + 1.0f;
    }
};

The program

(def kernel-source
      "extern \"C\"
         __global__ void increment (int n, float *a) {
           int i = blockIdx.x * blockDim.x + threadIdx.x;
           if (i < n) {
             a[i] = a[i] + 1.0f;
        }
       };")

(def hello-program (compile! (program kernel-source)))
#'user/kernel-source#'user/hello-program

The GPU Function

(def hello-module (module hello-program))
(def increment (function hello-module "increment"))
#'user/hello-module#'user/increment

Running the GPU function

(launch! increment (grid-1d 256) (parameters 256 gpu-array))
nil
(take 12 (memcpy-host! gpu-array (float-array 256)))
(1 2 3 4 5 6 7 8 9 10 11 12)

Wrap-up

  • Managing low-level CUDA kernels from interactive Clojure code
  • Managing low-level 3-rd party GPU code (cuBlas, cuDNN etc) (also interactively!)
  • Mixing the two approaches! (again, inteactively)

…instead of writing everything as a huge, impenetrable C++ codebase and then just calling it with dummy wrappers

Thank You

The presentation can be accessed on my blog:

Find more at: