Going faster than TensorFlow on the GPU with Clojure (GTX 1080Ti)

Need help with your custom Clojure software? I'm open to (selected) contract work.

November 2, 2020

Please share: Twitter.

These books fund my work! Please check them out.

A few weeks ago I've shown you how simple Clojure's Deep Diamond() is, even compared to Keras. I've also mentioned that it's superfast. Here's how fast it is on the GPU!

TL;DR Much faster than Keras+TensorFlow on the GPU, too!

In the previous article, we have only compared the libraries on the CPU. Deep Diamond was considerably faster: 368 seconds vs 509 seconds. Most readers were intrigued, but, being skeptical as they should be, they complained that CPU performance doesn't matter anyway, since everybody uses GPU for training convolution networks; let's do the GPU comparison then.

Both Deep Diamond, and Keras with TensorFlow, use Nvidia's cuDNN low level performance library under the hood, and any difference is due to the higher-level implementation.

Deep Diamond completes this training in 21 seconds while Keras + TensorFlow takes 35 seconds. The gap even increased in favor of Deep Diamond! Now the ratio is 1.67, in place of 1.38 on the CPU.

Keras CNN in Python

I repeat the relevant model code for reference. We're interested in the running time of model.fit, with minimal verbosity, for 12 epochs. I'm using Nvidia's GTX 1080Ti GPU. Keras code is taken from official Keras examples.

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=(28, 28, 1)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=Adam(learning_rate=0.01),
              metrics=['accuracy'])

s = time.time_ns()
model.fit(x_train, y_train,
          batch_size=128,
          verbose=2,
          epochs=12)
e = time.time_ns()
print((e-s)/(10**9), " seconds")

Deep Diamond CNN in Clojure

In Clojure, we're measuring the runtime of the train function.

(defonce net-bp
  (network (desc [128 1 28 28] :float :nchw)
           [(convo [32] [3 3] :relu)
            (convo [64] [3 3] :relu)
            (pooling [2 2] :max)
            (dropout)
            (dense [128] :relu)
            (dropout)
            (dense [10] :softmax)]))

(defonce net (init! (net-bp :adam)))

(time (train net train-images y-train :crossentropy 12 []))

The books

The book Deep Learning for Programmers: An Interactive Tutorial with CUDA, OpenCL, DNNL, Java, and Clojure teaches the nuts and bolts of neural networks and deep learning by showing you how Deep Diamond is built, from scratch, in interactive sessions. Each line of code can be executed and the results inspected in the plain Clojure REPL. The best way to master something is to build it yourself!

It' simple. But fast and powerful!

Please subscribe, read the drafts, get the full book soon, and support my work on this free open source library.