False Sharing
When a CPU requests data from memory, it is usually gathered as an entire cache line, which is usually large enough to gather several adjacent values in memory. If one CPU core acquires this cache line in their L cache, it may store values that other CPU cores are currently writing to, even if the current CPU core is not using it. If the cache line is invalidated by another write (such as incrementing another value), extra processing time has to be spent synchronising the operations.
Let’s modify our previous algorithm to conduct an experiment on false sharing by spreading out the memory of each calculation to avoid accessing the same cache lines. We will do this by adding a spacing
variable.
function est_pi_mc_threaded_spaced(n, spacing=1)
n_cs = zeros(typeof(n), Threads.nthreads()*spacing)
Threads.@threads for _ in 1:n
x = rand() * 2 - 1
y = rand() * 2 - 1
r2 = x*x+y*y
if r2 <= 1
n_cs[Threads.threadid()*spacing] += 1
end
end
n_c = sum(n_cs)
return 4 * n_c / n
end
We can clearly see that increasing the spacing between elements has a huge impact on performance. We can see that the performance increases stop around a spacing of 8, suggesting that each cache line is around 512 bits wide.
The main reason that this false sharing is such a huge problem for performance in this problem is due to the frequency at which each core is accessing and writing to that memory. This problem is not as bad if each core is only reading memory, as the memory does not change and no synchronisation needs to occur. This is more of our worst case scenario.
It should be noted that one should almost never implement an algorithm like the spaced version to avoid false sharing. There are much better methods of doing this, discussed next.