Benchmarking & Profiling
Benchmarking GPU code can be a bit tricky. When you are launching a CUDA kernel, this will simply send the program to the GPU, and not wait for it to finish executing. Let’s take the following example:
a = CUDA.rand(256,256); b = similar(a);
display(@benchmark begin $b .= $a .* $a end)BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
Range (min … max): 5.543 μs … 641.615 μs ┊ GC (min … max): 0.00% … 98.07%
Time (median): 5.671 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.905 μs ± 7.817 μs ┊ GC (mean ± σ): 1.82% ± 1.38%
▂▅▇███▇▆▄▃▂ ▁▁▂▁▁ ▁▂▁▁ ▁▂▂▂▂▂▁▁▂▁▂▁▁ ▂
▆████████████████████▇▇████████████████████████▇▇▇▇▇▅▅▅▃▄▃▄ █
5.54 μs Histogram: log(frequency) by time 6.75 μs <
Memory estimate: 3.25 KiB, allocs estimate: 115.This only measures the time taken to launch the kernel, not the time taken to execute the kernel. We can re-run this using the CUDA.@sync macro:
display(@benchmark CUDA.@sync begin $b .= $a .* $a end)BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 10.609 μs … 63.151 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 11.257 μs ┊ GC (median): 0.00%
Time (mean ± σ): 11.372 μs ± 1.515 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▅▆▇█▆▂
▂▂▂▂▂▂▂▂▂▃▄▆████████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
10.6 μs Histogram: frequency by time 12.9 μs <
Memory estimate: 3.25 KiB, allocs estimate: 115.Which shows us that the kernel actually took a lot longer to run. CUDA.jl also provides the CUDA.@elapsed and CUDA.@time functions which can be very useful for measuring the performance of your code.
Remember that when you are benchmarking the GPU code, you should be sure to call CUDA.@sync to ensure that you are measuring the total time of a CUDA kernel.
Application Profiling
When you have a larger application, with multiple calls to custom CUDA kernels, it is sometimes better to profile an entire application. We can use the NVIDIA Nsight system to profile an application. The CUDA.jl documentation1 suggests installing NSight Systems directly from the NVIDIA website2.
Example: Ensure that your NSight Systems installation is working correctly by typing the following into the terminal:
nsys --versionOn Windows, you will need to run the terminal as an Administrator and also ensure that the folder with nsys.exe is in your system PATH environment variable.
NSight Systems can be used to view the bottlenecks in your system and diagnose whether kernels are being too frequently and hurting performance.