A Common Gotcha with Asynchronous GPU Computing

Need help with your custom Clojure software? I'm open to (selected) contract work.

September 18, 2019

Please share: Twitter.

These books fund my work! Please check them out.

For the most part, computing libraries such as Neanderthal () abstract the complexity of GPU computing. When you have high-level linear algebra operations, such as mm! (matrix multiply), nrm2 (vector/matrix norm), or sum, you do not have to worry about compiling kernels, loading modules, sending operations to the right stream in the right context, handling stream execution errors, and a bunch of other details. The code we write and test on the CPU can be executed on the GPU!

However, there is one important thing we have to keep in mind: GPU computing routines are asynchronous by default! Most of the time we see the benefits of such behavior, but in some common situations, it can surprise even an experienced developer (yes, humans are fallible).

Suppose we have some code that uses a costly matrix multiplication operation.

(with-release [cpu-a (entry! (fge 100 100) 0.01)
               cpu-b (entry! (fge 100 100) 0.02)
               cpu-c (entry! (fge 100 100) 0.02)]
  (time (dotimes [i 100]
          (mm! 1.0 cpu-a cpu-b 0.0 cpu-c))))

"Elapsed time: 1.153745 msecs"

We called the mm! operation 100 times in a loop and measured the execution time. A bit over 1 millisecond in total, or 10 microseconds per operation.

We decided that, although it is quite fast, it's not fast enough for our customer's requirements. There's an opportunity to use powerful GPU accelerators, and we read in their specs that they can achieve quite a performance boost over Intel CPUs. Since we do not have to adjust the code to run on the GPU (that is, not that much), we quickly write a few tests to see whether the platform switch is worth the effort.

The following code is a port of the previous example to the Nvidia GPU. It is a bit different since GPU context has to be created and managed. It is possible to further abstract these differences away, but I wanted to keep this example straightforward and obvious.

(with-default
  (with-default-engine
    (with-release [gpu-a (entry! (cuge 100 100) 0.01)
                   gpu-b (entry! (cuge 100 100) 0.02)
                   gpu-c (entry! (cuge 100 100) 0.02)]
;; sunchronize! makes sure that measurement is not initiated
;; until the GPU stream is ready
      (synchronize!)
      (time
       (dotimes [i 100]
         (mm! 1.0 gpu-a gpu-b 0.0 gpu-c))))))

"Elapsed time: 0.796928 msecs"

With the same size and computation complexity, we got a modest speedup. What is the problem? We expected a boost! We ask around, and see that communication with the GPU driver carries a fixed overhead. GPU pays off with more expensive operations.

OK, then; let's give it some monster matrices.

(with-default
  (with-default-engine
    (with-release [gpu-a (entry! (cuge 10000 10000) 0.01)
                   gpu-b (entry! (cuge 10000 10000) 0.02)
                   gpu-c (entry! (cuge 10000 10000) 0.02)]
      (synchronize!)
      (time
       (dotimes [i 100]
         (mm! 1.0 gpu-a gpu-b 0.0 gpu-c))))))

"Elapsed time: 2.232541 msecs"

Whow, we increased matrix sizes by a lot, and the GPU didn't even sweat. It finished in almost the same time as before!

What happens on the CPU, do large matrices slow it down?

(with-release [cpu-a (entry! (fge 10000 10000) 0.01)
               cpu-b (entry! (fge 10000 10000) 0.02)
               cpu-c (entry! (fge 10000 10000) 0.02)]
  (time (mm! 1.0 cpu-a cpu-b 0.0 cpu-c)))

"Elapsed time: 6557.759385 msecs"

We excitedly tell everyone decide to transfer all computations to the GPU, since what took days to complete on the CPU, will complete in seconds according to our calculations.

A skeptical colleague is not so sure!

We are reminded that (many) GPU operations are asynchronous.

Our mm! operation does have the same signature on GPU and CPU, but while on the CPU it's synchronous, on the GPU is asynchronous. Simply put, on the GPU, calling mm! means "I acknowledge that I received your request for matrix multiplication and I have put it in the computation queue", and not "I have completed the matrix multiplication that you've requested", as it means on the CPU.

When we measured it in the last example, we just measured how fast the GPU can receive the requests for the mm! operation. Not surprisingly, the size of the arguments does not make much difference there.

To measure how fast the GPU can complete the operation, we must synchronize the queue.

(with-default
  (with-default-engine
    (with-release [gpu-a (entry! (cuge 10000 10000) 0.01)
                   gpu-b (entry! (cuge 10000 10000) 0.02)
                   gpu-c (entry! (cuge 10000 10000) 0.02)]
      (synchronize!)
      (time
       (dotimes [i 100]
         (mm! 1.0 gpu-a gpu-b 0.0 gpu-c)
         (synchronize!))))))

"Elapsed time: 17117.724387 msecs"

Ouch! Instead of 2 microseconds for 100 invocations, we get 17 seconds. That's 171 milli seconds per operation, not 20 micro seconds. Disappointed? We should not be! It's almost 40 times faster than the 6.5 seconds we measured on the CPU. For large matrices, it usually does make sense to use GPU; for medium sized it depends.

If we only work with small-ish data, it never pays off to use GPU. Never? Well, that depends, too. We can group small chunks into larger units that can be computed in batches, but that is out of scope of this article.

Anyway, suppose that we have decided that it pays off to add GPU engine to our project. The code is similar, but there are enough differences. This call to synchronize! bothers us, since it is CUDA-specific.

Fortunately, most of the time we do not need to synchronize streams, and even when we do have to, we do not have to do it explicitly. There are operations available on both CPU and GPU, that implicitly synchronize the stream.

For example, summary matrix and vector operations such as sum, nrm2, dot, etc, which compute a scalar result, implicitly synchronize the stream. The operations that transfer data between CPU and GPU are synchronous by default, too.

(with-default
  (with-default-engine
    (with-release [gpu-a (entry! (cuge 10000 10000) 0.01)
                   gpu-b (entry! (cuge 10000 10000) 0.02)
                   gpu-c (entry! (cuge 10000 10000) 0.02)]
      ;; (synchronize!)
      (time
       (dotimes [i 100]
         (mm! 1.0 gpu-a gpu-b 0.0 gpu-c)
         (asum gpu-c))))))

;; "Elapsed time: 17211.007038 msecs"

Source code for this article

The leiningen project with full source code is available at draganrocks Patreon page. You can choose one of pre-selected tiers, or pledge any amount that you can afford to support https://github.com/uncomplicate open-source Clojure libraries.