Theoretical Expectations
Amdahl’s Law
Let’s benchmark the algorithm from the previous section with an array of 1024 numbers:
function _inner_solve(x)
rate, limit, x = promote(0.01, 2.0, x)
for t in 1:100
x += rate*clamp(x, -limit, limit)
end
return x
end
function map_solve!(y, x)
@inbounds for i in eachindex(x, y)
y[i] = _inner_solve(x[i])
end
nothing
end
x = rand(1024);
y = similar(x);
@btime _inner_solve($(x[1]))
serial_benchmark = @benchmark map_solve!($y, $x)
best_serial_time = minimum(serial_benchmark.times)
display(serial_benchmark)223.031 ns (0 allocations: 0 bytes)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 227.755 μs … 446.794 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 232.209 μs ┊ GC (median): 0.00%
Time (mean ± σ): 231.890 μs ± 5.901 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇▅▁▄▄▂▂▁▁▂▁▁▁▁█▄▁▅▅▂ ▁▁ ▄ ▁ ▂
█████████████████████████████████▆▇▇▇▇▅▅▆▆▅▆▇▄▅▆▅▄▅▅▄▄▄▄▂▂▄▃▅ █
228 μs Histogram: log(frequency) by time 247 μs <
Memory estimate: 0 bytes, allocs estimate: 0.We can see that the processing scales linearly with the size of the array. If we had 1000 processors, we could theoretically speed up this function call, by having each processor calculate the inner loop for each element and put the result in the final array. However, we would still have to spend the ~200 nanoseconds required to process a single element. This brings up the concept of Amdahl’s law, which states:
where is the speed-up factor, is the proportion of the program which can be parallelised, and is the speed-up of the part of the task that is parallelisable. We can interpret the factor as the proportion of the program which cannot be improved.
Amdahl’s Law tells us the theoretically maximum speed-up we can achieve by parallelising part of our program. In our example, we can know that as all tasks can be done in parallel. However, the speed-up factor is not as trivial. We know that each worker can process one element at a time, and so the minimum amount of time needed is proportional to (where represents the ceiling function, which rounds up to the nearest integer. For example, , but ).
If we put these equations together, we will see that an embarrassingly parallel problem (where each element takes the same amount of time) will have a speed-up given by
If we take the limit as , we know that and hence , which is what is expected.
Let’s say that we also need to read in the data from the disk, which takes around per element and must be done sequentially. If the operation on each element takes ns, then the factor changes to . This will change our maximum speed-up:
If we take limits as and assume we have unlimited resources (i.e ), then the maximum speed-up is . This shows us that no matter how many resources we have, if the sequential part of an algorithm takes much longer than the part which can be parallelised then we are severely limited in how much we can speed-up a task.
Embarrassingly Parallel Problems
For our practical example, we need to know the number of workers we have available. We can see the number of processors available within Julia with the built-in function
display(Threads.nthreads())8If this is set to 1, you most likely have to start Julia with more threads. It is common to set the number of threads to the number of physical cores available on your processor. You can use julia -t auto to start Julia with the number of threads equal to the number of logical processors. Due to Simultaneous Multithreading (SMT) this can be double the number of your actual physical cores.
Let’s quickly implemented a parallel version of our algorithm using multithreading. We will discuss this in more detail later. In Julia, one can perform tasks in parallel by simply adding a macro on the beginning of a for loop:
function map_solve_parallel!(y, x)
@inbounds Threads.@threads for i in eachindex(x, y)
y[i] = _inner_solve(x[i])
end
nothing
endSame as the previous algorithm, but using the Threads.@threads macro in the Julia base library. Note that the @inbounds is not required, but speeds up performance.
We can benchmark this algorithm the same as before:
function map_solve_parallel!(y, x)
@inbounds Threads.@threads for i in eachindex(x, y)
y[i] = _inner_solve(x[i])
end
nothing
end
parallel_benchmark = @benchmark map_solve_parallel!($y, $x)
display(parallel_benchmark)
best_parallel_time = minimum(parallel_benchmark.times)
println("Speedup: ", round(best_serial_time / best_parallel_time; sigdigits=3), "x")
println("Best Speedup: ", Threads.nthreads())BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 34.742 μs … 308.690 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 36.038 μs ┊ GC (median): 0.00%
Time (mean ± σ): 39.690 μs ± 8.810 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆█▇▄▂▂▁▁ ▁ ▁▂▁ ▁▁▁ ▃▃▃▄▃▁▁ ▂
█████████▇▆▇██████▇▇▆▇▆▄▇▆▆▇████▇▆▅▃▃▃▃▃▁▃▃▁▁▁▁█████████▆▆▅▅ █
34.7 μs Histogram: log(frequency) by time 61.2 μs <
Memory estimate: 4.25 KiB, allocs estimate: 42.
Speedup: 6.56x
Best Speedup: 8Notice how the number of allocations has increased. This is because the threading macro rewrites your code an inserts some heap allocations. However, the performance of this algorithm is much better. The actual speedup is lower than our theoretical maximum (the output of Threads.nthreads()) for many reasons. The first of which is that running code in parallel requires a lot of overhead and synchronisation to happen across the cores. The second reason is that each core may access the same cache line in memory, and even worse, each core can be modifying the same cache line, invalidating it and causing a huge slowdown. This is what is known as false sharing, which we will see the effects of later on in the course. Additionally, the memory access is not uniform and the compiler cannot vectorise the code, negatively impacting performance. We will visit better ways of parallelising this sort of algorithm in a later chapter.
Gustafson’s Law
While Amdahl’s Law focuses on fixed problem sizes, Gustafson’s Law addresses a more practical scenario: scaled speedup. In real-world applications, when we get more computing resources, we often increase the problem size rather than just solving the same problem faster.
Gustafson’s Law states that the scaled speedup is given by:
where is the number of processors, is the fraction of time spent in the parallelisable portion when running on processors, and is the serial fraction.
The key insight is that as we scale up both the problem size and number of processors, the parallel portion often grows while the serial portion remains relatively constant.
Let’s demonstrate this with a theoretical example of matrix multiplication, which has a complexity of roughly for multiplying two by matrices. There is also some amount of serial overhead for one of these algorithms. We assume that some overhead comes from coordinating the workers to handle the problem and that this time is constant, regardless of the number of elements in the matrix. This means that as grows, so does . This means a larger problem size is more parallelisable and hence it becomes more efficient to throw a larger number of workers at the problem - allowing us to handle this larger problem size. You can see this effect in Figure 1 below.
We can essentially recover Amdahl’s law if we look at a vertical slice in Figure 1. For a fixed input size (and smaller one), the factor of is small and hence maximum speedup is very limited, even for a large amount of workers. Once can see this for the smaller matrix sizes (), where the maximum speedup is around 20%, and that is when using workers.
Key differences from Amdahl’s Law:
- Fixed vs. Scaled Problems: Amdahl assumes fixed problem size; Gustafson assumes problem size scales with resources
- Practical vs. Theoretical Limits: Amdahl shows theoretical limits; Gustafson shows practical benefits of scaling
- Pessimistic vs. Optimistic: Amdahl is pessimistic about parallel benefits; Gustafson is more optimistic for scaled workloads
Note on thread count: The number of threads in Julia is fixed at startup and cannot be changed during execution. To truly test Gustafson’s Law with different processor counts, you would need to run separate Julia sessions with different thread settings (e.g., julia -t 1, julia -t 4, etc.) and compare results across those runs.
The demonstration above shows how parallel efficiency changes as we scale the problem size with a fixed number of threads, which illustrates the core principle of Gustafson’s Law: larger problems often maintain better parallel efficiency than smaller ones.