Tips
In order to fully utilise the additional resources provided by using a GPU, we must keep in mind some causes of poor performance.
Copying Data
Remember that a GPU has its own local memory, and any data processed by the GPU, must first be transferred onto the device memory. To make the terms clear, we refer to the main memory that the CPU has access to as the host memory. Often, you will see the CPU (and the rest of the machine) referred to as the host. The GPU only has access to its own memory, called device memory. It can take a long time for the host to copy memory to the GPU, and vice versa. This is why it is strongly recommended that all possible processing is done on the GPU until it is needed back on the CPU.
If you have a function that performs much better on the GPU, you must consider whether it is worth transferring the data to the GPU, processing it there, and then transferring it back to the CPU. Sometimes this will be beneficial, as the speed-up is so significant that the copy times are usually worth the cost. However, one can consider streamlining the processing of the data, so that it all occurs on the GPU. Even if one part of the process is actually faster on the CPU, it may be worth implementing a GPU version in order to keep the memory on the device and avoid copying.
Launching Kernels
An individual kernel launch can have quite a high latency, especially when compared with launching a CPU thread. While the CPU is constructing the kernel and sending it to the GPU for scheduling, the GPU may be quite idle. Launching many smaller kernels can incur a huge overhead cost, and cause the GPU to be idle for a large amount of time. If profiling with the NVIDIA tools, this is the performance bug to look out for.
To remedy this, one should attempt to fuse as many kernel calls together, and group the work into larger items. For example, if one is using array programming, make full use of fused broadcasting:
y .= sin.(x)
y .*= x
y .+= x .^ 3
y .= exp.(y)
# use a single statement instead
y .= exp.(x.*sin.(x) .+ x .^ 3)
While it may look better in the source code to split up the operations into multiple lines, it is often better to put these calls on a single line. The first section calls kernels, whereas the single line only calls one kernel. If the call is extremely long, write a function for a single element and broadcast it across the entire array:
_f(x) = exp(x*sin(x) + x * x * x)
y .= _f.(x)
We will give an example of this optimisation in the case study section, by loading our kernel with work.