Naive Parallel Implementation
Each dart thrown requires two randomly generated numbers, mapped to be between 0 and 1. Additionally, they must calculate the square of the distance from the dart to the origin and check to see whether it is less than 1. All of these steps can be done in parallel, however, the variable n_c
is shared between the threads and so provides a race condition if multiple workers were to try and alter it at the same time. In order to solve this, we can have a variable for each thread and then separately sum these at the end.
function est_pi_mc_threaded(n)
n_cs = zeros(typeof(n), Threads.nthreads())
Threads.@threads for _ in 1:n
# Choose random numbers between -1 and +1 for x and y
x = rand() * 2 - 1
y = rand() * 2 - 1
# Work out the distance from origin using Pythagoras
r2 = x*x+y*y
# Count point if it is inside the circle (r^2=1)
if r2 <= 1
n_cs[Threads.threadid()] += 1
end
end
n_c = sum(n_cs)
return 4 * n_c / n
end
Here, we have used very little memory to keep track of the counts, and made sure the implementation was thread-safe as each thread accessed a different part of memory. It should be noted here that Threads.threadid()
returns an ID for the currently executing thread within the threaded loop. Additionally, the rand()
function call is thread-safe on recent versions of Julia (v1.6+). Note that this implementation, even when seeded, will not be deterministic. Even if each thread had its own seeded RNG, then additional randomness is introduced when scheduling the threads for execution. This is usually not an issue, but it is something that should be kept in mind, especially when one would like to write testable code.
Now, we can benchmark these algorithms and see which is fastest. We can calculate the efficiency based on the number of threads:
function est_pi_mc_serial(n)
n_c = zero(typeof(n))
for _ in 1:n
# Choose random numbers between -1 and +1 for x and y
x = rand() * 2 - 1
y = rand() * 2 - 1
# Work out the distance from origin using Pythagoras
r2 = x*x+y*y
# Count point if it is inside the circle (r^2=1)
if r2 <= 1
n_c += 1
end
end
return 4 * n_c / n
end
function est_pi_mc_threaded(n)
n_cs = zeros(typeof(n), Threads.nthreads())
Threads.@threads for _ in 1:n
# Choose random numbers between -1 and +1 for x and y
x = rand() * 2 - 1
y = rand() * 2 - 1
# Work out the distance from origin using Pythagoras
r2 = x*x+y*y
# Count point if it is inside the circle (r^2=1)
if r2 <= 1
n_cs[Threads.threadid()] += 1
end
end
n_c = sum(n_cs)
return 4 * n_c / n
end
num_threads = Threads.nthreads()
The benchmarks:
n = 100_000_000;
mc_pi_serial_time = @belapsed est_pi_mc_serial($n)
mc_pi_threaded_time = @belapsed est_pi_mc_threaded($n)
mc_pi_serial_time/mc_pi_threaded_time
One would expect the ratio of those times approach the number of available threads, however, even for this very large value of n
, the parallel efficiency is still very small. This is very likely to be a case of false sharing.