Case Study: Monte-Carlo Simulations
Here, we will provide an example of porting a Monte-Carlo random walk simulation to the GPU, such as the one shown in Figure 1. As with the earlier advice in our book, we will write a function that performs a single step in the random walk.
function mc_random_walk_step(x, sigma)
return x + randn(typeof(x)) * sigma
end
Now imagine that we want to study some population level statistics, by running many of these walkers in parallel. Despite us choosing a very simple example, Monte-Carlo simulations are an extremely useful tool for computational science. Our aim for this exercise is to perform some number of steps of this Monte-Carlo update for many independent walkers and obtain an array with their final positions in.
We can implement a non-allocating array version of our desired algorithm:
function mc_random_walk!(y, x, sigma, steps)
# Copy the initial values from x into y
y .= x
for t in 1:steps
y .= mc_random_walk_step.(x, sigma)
end
return nothing
end
We can run our algorithm on the CPU easily:
n=2048; x = zeros(Float32, n); y = similar(x);
sigma = 1.0f0; steps=100;
mc_random_walk!(y, x, sigma, steps);
We can extend this to run on the GPU just by changing the types:
x_gpu = cu(x); y_gpu = similar(x_gpu);
mc_random_walk!(y_gpu, x_gpu, sigma, steps);
Let’s benchmark the CPU version:
display(@benchmark mc_random_walk!($y, $x, $sigma, $steps))
And the GPU version:
display(@benchmark CUDA.@sync mc_random_walk!($y_gpu, $x_gpu, $sigma, $steps))
BenchmarkTools.Trial: 5902 samples with 1 evaluation per sample.
Range (min … max): 803.100 μs … 1.264 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 838.400 μs ┊ GC (median): 0.00%
Time (mean ± σ): 844.387 μs ± 33.240 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▅▁ ▁▂▂ ▁▁
▄█████▇████▇▇▆▇▇▇▆▆▅▇██▇▆▅▅▇█▇▆▄▄▄▄▄▄▃▃▃▃▂▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▄
803 μs Histogram: frequency by time 952 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
We see that our GPU version is actually faster (for this number of walkers), however, there is actually a bit of a performance bug under hood that should be addressed. Upon each of our for
loops, we are calling a new kernel. Instead, we should try to fuse these kernels together:
function mc_random_walk(x, sigma, steps)
for t in 1:steps
x = mc_random_walk_step(x, sigma)
end
return x
end
function mc_random_walk_fused!(y, x, sigma, steps)
y .= mc_random_walk.(x, sigma, steps)
return nothing
end
BenchmarkTools.Trial: 4192 samples with 1 evaluation per sample.
Range (min … max): 874.400 μs … 7.485 ms ┊ GC (min … max): 0.00% … 86.19%
Time (median): 1.001 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.189 ms ± 411.545 μs ┊ GC (mean ± σ): 0.99% ± 4.85%
▂▆▇██▆▅▅▄▃▂▂▁▁▁▁▁ ▁▂▃▃▃▂▂▂▁▁▁ ▂
██████████████████▇▇▇▇▅▆▆▄▅▅▅▃▃▅▃▇█████████████▅▆▆▇▆▅▅▃▅▃▆▄▆▅ █
874 μs Histogram: log(frequency) by time 2.48 ms <
Memory estimate: 211.44 KiB, allocs estimate: 8686.
Now we can try and benchmark this again:
display(@benchmark CUDA.@sync mc_random_walk_fused!($y_gpu, $x_gpu, $sigma, $steps))
We have fused all the kernel calls together. This minimises the amount of overhead for scheduling, and allows the GPU cores to stay busy for the duration of the computation. This small change allowed us to dramatically improve the performance of our GPU code.
If the kernel calls are very large (e.g. a large matrix multiply), the overhead in calling multiple kernels is only a very small portion of the total time.
Custom Kernel
It is a good exercise to write a custom kernel and compare the execution to the broadcasted notation. Our kernel will be straightforward:
function _mc_random_walk_cuda!(y, x, sigma, steps)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if i > length(y)
return nothing
end
@inbounds pos = x[i]
for t in 1:steps
pos += randn(typeof(pos))
end
@inbounds y[i] = pos * sigma
return nothing
end
function mc_random_walk_gpu!(y, x, sigma, steps)
n = length(y)
@assert n == length(x)
kernel = @cuda name="mc_walk" launch=false _mc_random_walk_cuda!(y, x, sigma, steps)
config = launch_configuration(kernel.fun)
threads = min(n, config.threads)
blocks = cld(n, threads)
kernel(y, x, sigma, steps; threads=threads, blocks=blocks)
nothing
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 410.000 μs … 1.565 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 438.700 μs ┊ GC (median): 0.00%
Time (mean ± σ): 473.074 μs ± 102.420 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▆▆▆▅▄▃▂▂▁▂▂▂▂▂▂▁▁▁ ▁ ▂
████████████████████████▇█▇█▇▇▇▆▆▆▆▆▇▆▆▆▆▆▄▅▅▅▄▅▅▆▆▅▄▅▄▂▅▆▆▇▇ █
410 μs Histogram: log(frequency) by time 977 μs <
Memory estimate: 2.20 KiB, allocs estimate: 87.
We can now benchmark on the same data:
display(@benchmark CUDA.@sync mc_random_walk_gpu!($y_gpu, $x_gpu, $sigma, $steps))
We can see that our custom kernel did not perform as well as our simpler approach. It is clear that writing a custom kernel is not necessary to achieve performance gains, as we can rely on the compiler to generate fast code for the GPU, using the array notation. It is entirely possible to rewrite our kernel to be of a similar performance to our previous implementation, but this would require more effort on our part.