Naive Parallel Implementation

Each dart thrown requires two randomly generated numbers, mapped to be between 0 and 1. Additionally, they must calculate the square of the distance from the dart to the origin and check to see whether it is less than 1. All of these steps can be done in parallel, however, the variable n_c is shared between the threads and so provides a race condition if multiple workers were to try and alter it at the same time. In order to solve this, we can have a variable for each thread and then separately sum these at the end.

julia

function est_pi_mc_threaded(n)
    n_cs = zeros(typeof(n), Threads.nthreads())
    Threads.@threads for _ in 1:n
        # Choose random numbers between -1 and +1 for x and y
        x = rand() * 2 - 1
        y = rand() * 2 - 1
        # Work out the distance from origin using Pythagoras
        r2 = x*x+y*y
        # Count point if it is inside the circle (r^2=1)
        if r2 <= 1
            n_cs[Threads.threadid()] += 1
        end
    end
    n_c = sum(n_cs)
    return 4 * n_c / n
end

Here, we have used very little memory to keep track of the counts, and made sure the implementation was thread-safe as each thread accessed a different part of memory. It should be noted here that Threads.threadid() returns an ID for the currently executing thread within the threaded loop. Additionally, the rand() function call is thread-safe on recent versions of Julia (v1.6+). Note that this implementation, even when seeded, will not be deterministic. Even if each thread had its own seeded RNG, then additional randomness is introduced when scheduling the threads for execution. This is usually not an issue, but it is something that should be kept in mind, especially when one would like to write testable code.

Now, we can benchmark these algorithms and see which is fastest. We can calculate the efficiency based on the number of threads:

julia

function est_pi_mc_serial(n)
    n_c = zero(typeof(n))
    for _ in 1:n
        # Choose random numbers between -1 and +1 for x and y
        x = rand() * 2 - 1
        y = rand() * 2 - 1
        # Work out the distance from origin using Pythagoras
        r2 = x*x+y*y
        # Count point if it is inside the circle (r^2=1)
        if r2 <= 1
            n_c += 1
        end
    end
    return 4 * n_c / n
end

function est_pi_mc_threaded(n)
    n_cs = zeros(typeof(n), Threads.nthreads())
    Threads.@threads for _ in 1:n
        # Choose random numbers between -1 and +1 for x and y
        x = rand() * 2 - 1
        y = rand() * 2 - 1
        # Work out the distance from origin using Pythagoras
        r2 = x*x+y*y
        # Count point if it is inside the circle (r^2=1)
        if r2 <= 1
            n_cs[Threads.threadid()] += 1
        end
    end
    n_c = sum(n_cs)
    return 4 * n_c / n
end

num_threads = Threads.nthreads()

The benchmarks:

julia

n = 100_000_000;
mc_pi_serial_time = @belapsed est_pi_mc_serial($n)
mc_pi_threaded_time = @belapsed est_pi_mc_threaded($n)
mc_pi_serial_time/mc_pi_threaded_time

One would expect the ratio of those times approach the number of available threads, however, even for this very large value of n, the parallel efficiency is still very small. This is very likely to be a case of false sharing.

Back Next

High Performance Computing in Julia

1. Introduction

2. Foundations

3. Julia

4. Measuring Performance

5. Optimisation

6. Parallel Programming

7. Multithreading

8. GPU Programming

Table of Contents

Naive Parallel Implementation