Interactive, Functional, <br> GPU Accelerated <br> Programming in Clojure

Interactive, Functional,
GPU Accelerated
Programming in Clojure

Dragan Djuric

[email protected]

Dragan Djuric

blog https://dragan.rocks
twitter @draganrocks
Professor of Software Engineering
University of Belgrade
Clojure as a primary language since 2009
Teach Clojure since 2010
[email protected]
github: blueberry / uncomplicate
https://uncomplicate.org

Clojure and FP are great!

Dynamic and fast
First-class functions
Great abstractions and data structures
Many useful libraries
Even more experimental libraries
Access to Java and the JVM
Hey, the community is amazing!

Number crunching?

only thinking about the money and the fame and attention. –Urban Dictionary

Number cruncher?

a computer that can solve many problems at a fast rate –Urban Dictionary

How fast is fast?

What I'll talk about?

Not (necessarily) Big Data
Lots of computations regardless of data size
numerical data
numerical algorithms

Is FP good at number crunching?

Good? sometimes.
Great? NO!
Poor access to hardware-specific optimizations.
Some FP ideas are very useful!

CPU is not so great either!

R, Python? Even worse than Java or Hasekell.
C? complicated, verbose platform-specific optimizations.
CPU? too beefed-up!

GPU has a lot to offer …at a price

many dumb computing units
but, power-efficient for number crunching
hardware support for massive parallelism
faster and cheaper each year
notoriously difficult to program

Uncomplicate

ClojureCL: take control of the GPU, CPU, and accelerators from Clojure
ClojureCUDA: take control of the Nvidia GPU
Neanderthal: vectors and matrices, but optimized for CPU and GPU
Bayadera: high performance Bayesian statistics and data analysis on the GPU

Hello world: dot product

One of the simplest linear algebra functions
Consists of two familiar FP building blocks:
- map
- reduce
Multiply each corresponding element of two arrays (vectors) and add them up.

Idiomatic Clojure

(let [dot-product (fn [xs ys]
                    (reduce + (map * xs ys)))

      x-vec (vec (range 100000))
      y-vec (vec (range 100000))]

  (dot-product x-vec y-vec))

333328333350000

Execution time: 14 ms

Neanderthal: using optimized library

(require '(uncomplicate.neanderthal
           [core :refer :all]
           [native :refer :all]))

(let [x (fv (range 100000))
      y (copy x)]

  (dot x y))

3.33328352E14

Execution time: 6 μs

Neanderthal: a taste of GPU

(require '[uncomplicate.clojurecuda.core :refer :all]
         '[uncomplicate.neanderthal.cuda :refer :all]
         '[uncomplicate.commons.core :refer :all])

(with-default
  (with-default-engine
    (with-release [gpu-x (cuv (range 100000))
                   gpu-y (copy gpu-x)]

      (dot gpu-x gpu-y))))

3.33328352E14

Execution time: 26 μs
Not faster than the CPU at all! Why?

A million dot products on the GPU

(with-default
  (with-default-engine
    (with-release [gpu-x (entry! (cuge 1000 100000) 0.01)
                   gpu-y (copy (trans gpu-x))
                   cpu-c (cuge 1000 1000)]

      (do (mm! 1 gpu-x gpu-y 0 cpu-c)
          (synchronize!)
          true))))

true

Execution time: 23 ms
23 nanoseconds per a 100000 element dot product!
Our Clojure code started from 14 milliseconds (600,000 × difference!)
1000 × faster than a solo dot product!

A glimpse of ClojureCUDA API

(init)

true

(device-count)

(def my-nvidia-gpu (device 0))

#'user/my-nvidia-gpu

Querying information

(info my-nvidia-gpu)

{:max-grid-dim-y 65535, :total-mem 11721506816,
 :name "GeForce GTX 1080 Ti", :max-threads-per-multiprocessor 2048,
 :max-shared-memory-per-block 49152, :compute-capability-major 6,
 :global-memory-bus-width 352, :memory-clock-rate 5505000,
 :max-threads-per-block 1024, :multiprocessor-count 28,
 :warp-size 32, :max-registers-per-block 65536
 ;;... much more data
}

Managing Context

(def ctx (context my-nvidia-gpu))

#'user/ctx

(info ctx)

{:dev-runtime-pending-launch-count 2048  :dev-runtime-sync-depth 2
 :malloc-heap-size 8388608  :stack-size 1024  :api-version 3020
 :stream-priority-range (0 -1)  :cache-config :prefer-none  :printf-fifo-size 1048576
 :device #object(jcuda.driver.CUdevice 0x12be4426 "CUdevice[nativePointer=0x0]")
 :shared-config :four-byte-bank-size}

(= ctx (current-context))

true

Memory

(def gpu-array (mem-alloc 1024))

#'user/gpu-array

(def main-array (float-array (range 256)))

#'user/main-array

(take 10 main-array)

(0 1 2 3 4 5 6 7 8 9)

Transfer data to GPU memory

(memcpy-host! main-array gpu-array)

#object[uncomplicate.clojurecuda.internal.impl.CULinearMemory 0x38701ca4 "uncomplicate.clojurecuda.internal.impl.CULinearMemory@38701ca4"]

(take 12 (memcpy-host! gpu-array (float-array 256)))

(0 1 2 3 4 5 6 7 8 9 10 11)

Compute something already!

The kernel

extern "C"
__global__ void increment(int n, float *a) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        a[i] = a[i] + 1.0f;
    }
};

The program

(def kernel-source
      "extern \"C\"
         __global__ void increment (int n, float *a) {
           int i = blockIdx.x * blockDim.x + threadIdx.x;
           if (i < n) {
             a[i] = a[i] + 1.0f;
        }
       };")

(def hello-program (compile! (program kernel-source)))

#'user/kernel-source#'user/hello-program

The GPU Function

(def hello-module (module hello-program))
(def increment (function hello-module "increment"))

#'user/hello-module#'user/increment

Running the GPU function

(launch! increment (grid-1d 256) (parameters 256 gpu-array))

nil

(take 12 (memcpy-host! gpu-array (float-array 256)))

(1 2 3 4 5 6 7 8 9 10 11 12)

Wrap-up

Managing low-level CUDA kernels from interactive Clojure code
Managing low-level 3-rd party GPU code (cuBlas, cuDNN etc) (also interactively!)
Mixing the two approaches! (again, inteactively)

…instead of writing everything as a huge, impenetrable C++ codebase and then just calling it with dummy wrappers

Thank You

The presentation can be accessed on my blog:

https://dragan.rocks

Find more at:

Interactive, Functional, GPU Accelerated Programming in Clojure

Dragan Djuric

Dragan Djuric

Clojure and FP are great!

Number crunching?

Number cruncher?

What I'll talk about?

Is FP good at number crunching?

CPU is not so great either!

GPU has a lot to offer …at a price

Uncomplicate

Hello world: dot product

Idiomatic Clojure

Neanderthal: using optimized library

Neanderthal: a taste of GPU

A million dot products on the GPU

A glimpse of ClojureCUDA API

Querying information

Managing Context

Memory

Transfer data to GPU memory

Compute something already!

The kernel

The program

The GPU Function

Running the GPU function

Wrap-up

Thank You

Interactive, Functional,
GPU Accelerated
Programming in Clojure