JM

Table of Contents

Case Study: Monte-Carlo Simulations

Figure 1: Shows an example of a continuous space, discrete time random walk, starting at the origin.

Here, we will provide an example of porting a Monte-Carlo random walk simulation to the GPU, such as the one shown in Figure 1. As with the earlier advice in our book, we will write a function that performs a single step in the random walk.

julia
function mc_random_walk_step(x, sigma)
    return x + randn(typeof(x)) * sigma
end

Now imagine that we want to study some population level statistics, by running many of these walkers in parallel. Despite us choosing a very simple example, Monte-Carlo simulations are an extremely useful tool for computational science. Our aim for this exercise is to perform some number of steps of this Monte-Carlo update for many independent walkers and obtain an array with their final positions in.

We can implement a non-allocating array version of our desired algorithm:

julia
function mc_random_walk!(y, x, sigma, steps)
    # Copy the initial values from x into y
    y .= x
    for t in 1:steps
        y .= mc_random_walk_step.(x, sigma)
    end
    return nothing
end

We can run our algorithm on the CPU easily:

julia
n=2048; x = zeros(Float32, n); y = similar(x);
sigma = 1.0f0; steps=100;
mc_random_walk!(y, x, sigma, steps);

We can extend this to run on the GPU just by changing the types:

julia
x_gpu = cu(x); y_gpu = similar(x_gpu);
mc_random_walk!(y_gpu, x_gpu, sigma, steps);

Let’s benchmark the CPU version:

julia
display(@benchmark mc_random_walk!($y, $x, $sigma, $steps))
Output
BenchmarkTools.Trial: 9725 samples with 1 evaluation per sample.
 Range (min … max):  481.505 μs … 554.328 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     512.029 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   512.760 μs ±   7.746 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▁▃██▇█▄▅▃▂▁                       
  ▂▁▂▂▁▁▁▁▁▂▁▂▁▂▂▂▂▂▄▅▆▇▆▆▆▅▅▅▆███████████████▇▆▅▅▄▄▄▄▃▃▃▃▃▃▂▂▂ ▄
  482 μs           Histogram: frequency by time          535 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

And the GPU version:

julia
display(@benchmark CUDA.@sync mc_random_walk!($y_gpu, $x_gpu, $sigma, $steps))
Output
BenchmarkTools.Trial: 5210 samples with 1 evaluation per sample.
 Range (min … max):  821.190 μs …   5.750 ms  ┊ GC (min … max): 0.00% … 86.19%
 Time  (median):     942.847 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   952.241 μs ± 156.685 μs  ┊ GC (mean ± σ):  1.19% ±  5.52%

      ▂▁    ▁▁    ▄█▇▄▂                                          
  ▂▄▇▇███▆▅████▇▇▆█████▇▆▄▄▃▃▃▂▂▂▂▁▁▁▁▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▁▁▁ ▃
  821 μs           Histogram: frequency by time         1.28 ms <

 Memory estimate: 211.44 KiB, allocs estimate: 8686.

We see that our GPU version is actually faster (for this number of walkers), however, there is actually a bit of a performance bug under hood that should be addressed. Upon each of our for loops, we are calling a new kernel. Instead, we should try to fuse these kernels together:

julia
function mc_random_walk(x, sigma, steps)
    for t in 1:steps
        x = mc_random_walk_step(x, sigma)
    end
    return x
end

function mc_random_walk_fused!(y, x, sigma, steps)
    y .= mc_random_walk.(x, sigma, steps)
    return nothing
end

Now we can try and benchmark this again:

julia
display(@benchmark CUDA.@sync mc_random_walk_fused!($y_gpu, $x_gpu, $sigma, $steps))
Output
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  417.505 μs …  1.375 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     446.091 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   470.362 μs ± 70.401 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▆▇█▇▆▇▆▄▃▂▂▁▂▁▁▂▂▃▃▂▂▂▁ ▁▁▁▁ ▁▁                             ▂
  █████████████████████████████████▆▇▆▅▄▆▅▅▆▁▆▄▅▆▅▇▆▆▆▆▅▅▇▇▇▇█ █
  418 μs        Histogram: log(frequency) by time       798 μs <

 Memory estimate: 2.19 KiB, allocs estimate: 86.

We have fused all the kernel calls together. This minimises the amount of overhead for scheduling, and allows the GPU cores to stay busy for the duration of the computation. This small change allowed us to dramatically improve the performance of our GPU code.

If the kernel calls are very large (e.g. a large matrix multiply), the overhead in calling multiple kernels is only a very small portion of the total time.

Custom Kernel

It is a good exercise to write a custom kernel and compare the execution to the broadcasted notation. Our kernel will be straightforward:

julia
function _mc_random_walk_cuda!(y, x, sigma, steps)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i > length(y)
        return nothing
    end
    @inbounds pos = x[i]
    for t in 1:steps
        pos += randn(typeof(pos))
    end
    @inbounds y[i] = pos * sigma
    return nothing
end

function mc_random_walk_gpu!(y, x, sigma, steps)
    n = length(y)    
    @assert n == length(x)
    
    kernel = @cuda name="mc_walk" launch=false _mc_random_walk_cuda!(y, x, sigma, steps)
    config = launch_configuration(kernel.fun)
    threads = min(n, config.threads)
    blocks = cld(n, threads)
    kernel(y, x, sigma, steps; threads=threads, blocks=blocks)
    nothing
end

We can now benchmark on the same data:

julia
display(@benchmark CUDA.@sync mc_random_walk_gpu!($y_gpu, $x_gpu, $sigma, $steps))
Output
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  414.685 μs …  1.092 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     437.232 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   467.754 μs ± 78.637 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▆██▆▅▄▂▂    ▁▂▂▃▃▃▂▂▁▁      ▁                               ▂
  █████████████████████████▇▇█████▇▆▆▆▅▄▄▅▅▅▆▆▆▅▆▅▆▆▆▆▅▅▆▇██▇▇ █
  415 μs        Histogram: log(frequency) by time       801 μs <

 Memory estimate: 416 bytes, allocs estimate: 22.

We can see that our custom kernel did not perform as well as our simpler approach. It is clear that writing a custom kernel is not necessary to achieve performance gains, as we can rely on the compiler to generate fast code for the GPU, using the array notation. It is entirely possible to rewrite our kernel to be of a similar performance to our previous implementation, but this would require more effort on our part.