Case Study: Monte-Carlo Simulations
Here, we will provide an example of porting a Monte-Carlo random walk simulation to the GPU, such as the one shown in Figure 1. As with the earlier advice in our book, we will write a function that performs a single step in the random walk.
function mc_random_walk_step(x, sigma)
return x + randn(typeof(x)) * sigma
endNow imagine that we want to study some population level statistics, by running many of these walkers in parallel. Despite us choosing a very simple example, Monte-Carlo simulations are an extremely useful tool for computational science. Our aim for this exercise is to perform some number of steps of this Monte-Carlo update for many independent walkers and obtain an array with their final positions in.
We can implement a non-allocating array version of our desired algorithm:
function mc_random_walk!(y, x, sigma, steps)
# Copy the initial values from x into y
y .= x
for t in 1:steps
y .= mc_random_walk_step.(x, sigma)
end
return nothing
endWe can run our algorithm on the CPU easily:
n=2048; x = zeros(Float32, n); y = similar(x);
sigma = 1.0f0; steps=100;
mc_random_walk!(y, x, sigma, steps);We can extend this to run on the GPU just by changing the types:
x_gpu = cu(x); y_gpu = similar(x_gpu);
mc_random_walk!(y_gpu, x_gpu, sigma, steps);Let’s benchmark the CPU version:
display(@benchmark mc_random_walk!($y, $x, $sigma, $steps))BenchmarkTools.Trial: 9725 samples with 1 evaluation per sample.
Range (min … max): 481.505 μs … 554.328 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 512.029 μs ┊ GC (median): 0.00%
Time (mean ± σ): 512.760 μs ± 7.746 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃██▇█▄▅▃▂▁
▂▁▂▂▁▁▁▁▁▂▁▂▁▂▂▂▂▂▄▅▆▇▆▆▆▅▅▅▆███████████████▇▆▅▅▄▄▄▄▃▃▃▃▃▃▂▂▂ ▄
482 μs Histogram: frequency by time 535 μs <
Memory estimate: 0 bytes, allocs estimate: 0.And the GPU version:
display(@benchmark CUDA.@sync mc_random_walk!($y_gpu, $x_gpu, $sigma, $steps))BenchmarkTools.Trial: 5210 samples with 1 evaluation per sample.
Range (min … max): 821.190 μs … 5.750 ms ┊ GC (min … max): 0.00% … 86.19%
Time (median): 942.847 μs ┊ GC (median): 0.00%
Time (mean ± σ): 952.241 μs ± 156.685 μs ┊ GC (mean ± σ): 1.19% ± 5.52%
▂▁ ▁▁ ▄█▇▄▂
▂▄▇▇███▆▅████▇▇▆█████▇▆▄▄▃▃▃▂▂▂▂▁▁▁▁▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▁▁▁ ▃
821 μs Histogram: frequency by time 1.28 ms <
Memory estimate: 211.44 KiB, allocs estimate: 8686.We see that our GPU version is actually faster (for this number of walkers), however, there is actually a bit of a performance bug under hood that should be addressed. Upon each of our for loops, we are calling a new kernel. Instead, we should try to fuse these kernels together:
function mc_random_walk(x, sigma, steps)
for t in 1:steps
x = mc_random_walk_step(x, sigma)
end
return x
end
function mc_random_walk_fused!(y, x, sigma, steps)
y .= mc_random_walk.(x, sigma, steps)
return nothing
endNow we can try and benchmark this again:
display(@benchmark CUDA.@sync mc_random_walk_fused!($y_gpu, $x_gpu, $sigma, $steps))BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 417.505 μs … 1.375 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 446.091 μs ┊ GC (median): 0.00%
Time (mean ± σ): 470.362 μs ± 70.401 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▆▇█▇▆▇▆▄▃▂▂▁▂▁▁▂▂▃▃▂▂▂▁ ▁▁▁▁ ▁▁ ▂
█████████████████████████████████▆▇▆▅▄▆▅▅▆▁▆▄▅▆▅▇▆▆▆▆▅▅▇▇▇▇█ █
418 μs Histogram: log(frequency) by time 798 μs <
Memory estimate: 2.19 KiB, allocs estimate: 86.We have fused all the kernel calls together. This minimises the amount of overhead for scheduling, and allows the GPU cores to stay busy for the duration of the computation. This small change allowed us to dramatically improve the performance of our GPU code.
If the kernel calls are very large (e.g. a large matrix multiply), the overhead in calling multiple kernels is only a very small portion of the total time.
Custom Kernel
It is a good exercise to write a custom kernel and compare the execution to the broadcasted notation. Our kernel will be straightforward:
function _mc_random_walk_cuda!(y, x, sigma, steps)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if i > length(y)
return nothing
end
@inbounds pos = x[i]
for t in 1:steps
pos += randn(typeof(pos))
end
@inbounds y[i] = pos * sigma
return nothing
end
function mc_random_walk_gpu!(y, x, sigma, steps)
n = length(y)
@assert n == length(x)
kernel = @cuda name="mc_walk" launch=false _mc_random_walk_cuda!(y, x, sigma, steps)
config = launch_configuration(kernel.fun)
threads = min(n, config.threads)
blocks = cld(n, threads)
kernel(y, x, sigma, steps; threads=threads, blocks=blocks)
nothing
endWe can now benchmark on the same data:
display(@benchmark CUDA.@sync mc_random_walk_gpu!($y_gpu, $x_gpu, $sigma, $steps))BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 414.685 μs … 1.092 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 437.232 μs ┊ GC (median): 0.00%
Time (mean ± σ): 467.754 μs ± 78.637 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▆██▆▅▄▂▂ ▁▂▂▃▃▃▂▂▁▁ ▁ ▂
█████████████████████████▇▇█████▇▆▆▆▅▄▄▅▅▅▆▆▆▅▆▅▆▆▆▆▅▅▆▇██▇▇ █
415 μs Histogram: log(frequency) by time 801 μs <
Memory estimate: 416 bytes, allocs estimate: 22.We can see that our custom kernel did not perform as well as our simpler approach. It is clear that writing a custom kernel is not necessary to achieve performance gains, as we can rely on the compiler to generate fast code for the GPU, using the array notation. It is entirely possible to rewrite our kernel to be of a similar performance to our previous implementation, but this would require more effort on our part.