JM

Table of Contents

Chunking

The problem of false sharing can be avoided by chunking the larger calculation into much smaller ones. An additional benefit of this method is that we can use the original implementation to maximise code reuse. If we make any performance improvements to the serial algorithm, this will also improve the parallel implementation.

A very helpful method for chunking the operations is from the base Iterators library called Iterators.partition which will break up a collection into a specified number of smaller blocks.

julia
function est_pi_mc_threaded_chunked(n)
    n_threads = Threads.nthreads()
    num_inside = zeros(Float64, n_threads)
    # Calculate maximum chunk size
    chunk_size = div(n, n_threads, RoundUp)

    # Create an iterator and collect to turn into an array
    iter = collect(enumerate(Iterators.partition(1:n, chunk_size)))
    Threads.@threads for info in iter
        i, idx_range = info # Unpack the tuple from enumerate
        n_block = length(idx_range)
        pi_est = est_pi_mc_serial(n_block)
        num_inside[i] = pi_est*n_block/4
    end

    n_c = sum(num_inside)
    return 4 * n_c / n
end

Now, we can benchmark this final algorithm and compare it to the others:

julia
function est_pi_mc_serial(n)
    n_c = zero(typeof(n))
    for _ in 1:n
        # Choose random numbers between -1 and +1 for x and y
        x = rand() * 2 - 1
        y = rand() * 2 - 1
        # Work out the distance from origin using Pythagoras
        r2 = x*x+y*y
        # Count point if it is inside the circle (r^2=1)
        if r2 <= 1
            n_c += 1
        end
    end
    return 4 * n_c / n
end

function est_pi_mc_threaded(n)
    n_cs = zeros(typeof(n), Threads.nthreads())
    Threads.@threads for _ in 1:n
        # Choose random numbers between -1 and +1 for x and y
        x = rand() * 2 - 1
        y = rand() * 2 - 1
        # Work out the distance from origin using Pythagoras
        r2 = x*x+y*y
        # Count point if it is inside the circle (r^2=1)
        if r2 <= 1
            n_cs[Threads.threadid()] += 1
        end
    end
    n_c = sum(n_cs)
    return 4 * n_c / n
end

function est_pi_mc_threaded_chunked(n)
    n_threads = Threads.nthreads()
    num_inside = zeros(Float64, n_threads)
    # Calculate maximum chunk size
    chunk_size = div(n, n_threads, RoundUp)

    # Create an iterator and collect to turn into an array
    iter = collect(enumerate(Iterators.partition(1:n, chunk_size)))
    Threads.@threads for info in iter
        i, idx_range = info # Unpack the tuple from enumerate
        n_block = length(idx_range)
        pi_est = est_pi_mc_serial(n_block)
        num_inside[i] = pi_est*n_block/4
    end

    n_c = sum(num_inside)
    return 4 * n_c / n
end

n = 100_000_000
mc_pi_serial_time = @belapsed est_pi_mc_serial($n)
mc_pi_threaded_time = @belapsed est_pi_mc_threaded($n)
mc_pi_threaded_chunked_time = @belapsed est_pi_mc_threaded_chunked($n)
mc_pi_serial_time/mc_pi_threaded_time
mc_pi_serial_time/mc_pi_threaded_chunked_time
mc_pi_threaded_time/mc_pi_threaded_chunked_time

We see that the chunked approach was must faster, and almost reached the theoretical maximum performance. In general, it is better split parallel tasks into large chunks that can be sequentially processed by that chunk. This avoids a lot of scheduling and orchestration overhead when managing the threads, as most variables can live inside the stack with little need to coordinate execution. Additionally, we massively reduced the number of writes to memory with the chunked approach as the count could live in registers close to the CPU and only be saved to memory once the bulk of the calculation was completed.

However, it should be noted that the overhead of parallel execution can be significant. The only way to see the effect is to measure the performance relative to the input size n n. We can already assess that this algorithm has a time complexity of O(n) \mathcal{O}(n). Instead of plotting these lines together, we will plot the S S, compared to the theoretical maximum given by Amdahl’s law.

100100010k100k1M10M100M4567890.12345678912345678910
SerialThreaded ChunkedAmdahl MaxnS
Figure 1: Shows the relative speed-up S of using the chunked parallel implementation over the serial implementation. We have also plotted the theoretical maximum performance increase given by Amdahl's law.

Inspecting Figure 1, we can see that the chunked implementation approaches the maximum speed-up for this algorithm, but suffers at lower values of n n.

One can infer that the cost of using multithreading is quite high, especially when the contents of the for loop are computationally inexpensive. However, this cost is more or less constant and increasing the throughput will minimise relative size of this cost. It takes at least 105 10^5 to 106 10^6 samples to make the switch worth it, as the overhead of the threaded approach is much greater than the time taken to perform the calculations.