JM

Case Study: Monte-Carlo Simulations

Figure 1: Shows an example of a continuous space, discrete time random walk, starting at the origin.

Here, we will provide an example of porting a Monte-Carlo random walk simulation to the GPU, such as the one shown in Figure 1. As with the earlier advice in our book, we will write a function that performs a single step in the random walk.

julia
function mc_random_walk_step(x, sigma)
    return x + randn(typeof(x)) * sigma
end

Now imagine that we want to study some population level statistics, by running many of these walkers in parallel. Despite us choosing a very simple example, Monte-Carlo simulations are an extremely useful tool for computational science. Our aim for this exercise is to perform some number of steps of this Monte-Carlo update for many independent walkers and obtain an array with their final positions in.

We can implement a non-allocating array version of our desired algorithm:

julia
function mc_random_walk!(y, x, sigma, steps)
    # Copy the initial values from x into y
    y .= x
    for t in 1:steps
        y .= mc_random_walk_step.(x, sigma)
    end
    return nothing
end

We can run our algorithm on the CPU easily:

julia
n=2048; x = zeros(Float32, n); y = similar(x);
sigma = 1.0f0; steps=100;
mc_random_walk!(y, x, sigma, steps);

We can extend this to run on the GPU just by changing the types:

julia
x_gpu = cu(x); y_gpu = similar(x_gpu);
mc_random_walk!(y_gpu, x_gpu, sigma, steps);

Let’s benchmark the CPU version:

julia
display(@benchmark mc_random_walk!($y, $x, $sigma, $steps))

And the GPU version:

julia
display(@benchmark CUDA.@sync mc_random_walk!($y_gpu, $x_gpu, $sigma, $steps))
Output
BenchmarkTools.Trial: 5902 samples with 1 evaluation per sample.
 Range (min … max):  803.100 μs …  1.264 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     838.400 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   844.387 μs ± 33.240 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆█▅▁  ▁▂▂           ▁▁                                       
  ▄█████▇████▇▇▆▇▇▇▆▆▅▇██▇▆▅▅▇█▇▆▄▄▄▄▄▄▃▃▃▃▂▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▄
  803 μs          Histogram: frequency by time          952 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

We see that our GPU version is actually faster (for this number of walkers), however, there is actually a bit of a performance bug under hood that should be addressed. Upon each of our for loops, we are calling a new kernel. Instead, we should try to fuse these kernels together:

julia
function mc_random_walk(x, sigma, steps)
    for t in 1:steps
        x = mc_random_walk_step(x, sigma)
    end
    return x
end

function mc_random_walk_fused!(y, x, sigma, steps)
    y .= mc_random_walk.(x, sigma, steps)
    return nothing
end
Output
BenchmarkTools.Trial: 4192 samples with 1 evaluation per sample.
 Range (min … max):  874.400 μs …   7.485 ms  ┊ GC (min … max): 0.00% … 86.19%
 Time  (median):       1.001 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.189 ms ± 411.545 μs  ┊ GC (mean ± σ):  0.99% ±  4.85%

  ▂▆▇██▆▅▅▄▃▂▂▁▁▁▁▁                 ▁▂▃▃▃▂▂▂▁▁▁                 ▂
  ██████████████████▇▇▇▇▅▆▆▄▅▅▅▃▃▅▃▇█████████████▅▆▆▇▆▅▅▃▅▃▆▄▆▅ █
  874 μs        Histogram: log(frequency) by time       2.48 ms <

 Memory estimate: 211.44 KiB, allocs estimate: 8686.

Now we can try and benchmark this again:

julia
display(@benchmark CUDA.@sync mc_random_walk_fused!($y_gpu, $x_gpu, $sigma, $steps))

We have fused all the kernel calls together. This minimises the amount of overhead for scheduling, and allows the GPU cores to stay busy for the duration of the computation. This small change allowed us to dramatically improve the performance of our GPU code.

If the kernel calls are very large (e.g. a large matrix multiply), the overhead in calling multiple kernels is only a very small portion of the total time.

Custom Kernel

It is a good exercise to write a custom kernel and compare the execution to the broadcasted notation. Our kernel will be straightforward:

julia
function _mc_random_walk_cuda!(y, x, sigma, steps)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i > length(y)
        return nothing
    end
    @inbounds pos = x[i]
    for t in 1:steps
        pos += randn(typeof(pos))
    end
    @inbounds y[i] = pos * sigma
    return nothing
end

function mc_random_walk_gpu!(y, x, sigma, steps)
    n = length(y)    
    @assert n == length(x)
    
    kernel = @cuda name="mc_walk" launch=false _mc_random_walk_cuda!(y, x, sigma, steps)
    config = launch_configuration(kernel.fun)
    threads = min(n, config.threads)
    blocks = cld(n, threads)
    kernel(y, x, sigma, steps; threads=threads, blocks=blocks)
    nothing
end
Output
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  410.000 μs …   1.565 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     438.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   473.074 μs ± 102.420 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▆▆▅▄▃▂▂▁▂▂▂▂▂▂▁▁▁ ▁                                        ▂
  ████████████████████████▇█▇█▇▇▇▆▆▆▆▆▇▆▆▆▆▆▄▅▅▅▄▅▅▆▆▅▄▅▄▂▅▆▆▇▇ █
  410 μs        Histogram: log(frequency) by time        977 μs <

 Memory estimate: 2.20 KiB, allocs estimate: 87.

We can now benchmark on the same data:

julia
display(@benchmark CUDA.@sync mc_random_walk_gpu!($y_gpu, $x_gpu, $sigma, $steps))

We can see that our custom kernel did not perform as well as our simpler approach. It is clear that writing a custom kernel is not necessary to achieve performance gains, as we can rely on the compiler to generate fast code for the GPU, using the array notation. It is entirely possible to rewrite our kernel to be of a similar performance to our previous implementation, but this would require more effort on our part.