JM

Table of Contents

Theoretical Expectations

Amdahl’s Law

Let’s benchmark the algorithm from the previous section with an array of 1024 numbers:

julia
function _inner_solve(x)
    rate, limit, x = promote(0.01, 2.0, x)
    for t in 1:100
        x += rate*clamp(x, -limit, limit)
    end
    return x
end

function map_solve!(y, x)
    @inbounds for i in eachindex(x, y)
        y[i] = _inner_solve(x[i])
    end
    nothing
end

x = rand(1024);
y = similar(x);
@btime _inner_solve($(x[1]))
serial_benchmark = @benchmark map_solve!($y, $x)
best_serial_time = minimum(serial_benchmark.times)
display(serial_benchmark)
Output
223.031 ns (0 allocations: 0 bytes)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  227.755 μs … 446.794 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     232.209 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   231.890 μs ±   5.901 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▅▁▄▄▂▂▁▁▂▁▁▁▁█▄▁▅▅▂ ▁▁     ▄   ▁                             ▂
  █████████████████████████████████▆▇▇▇▇▅▅▆▆▅▆▇▄▅▆▅▄▅▅▄▄▄▄▂▂▄▃▅ █
  228 μs        Histogram: log(frequency) by time        247 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

We can see that the processing scales linearly with the size of the array. If we had 1000 processors, we could theoretically speed up this function call, by having each processor calculate the inner loop for each element and put the result in the final array. However, we would still have to spend the ~200 nanoseconds required to process a single element. This brings up the concept of Amdahl’s law, which states:

S=1(1p)+psS = \frac{1}{(1-p)+\frac{p}{s}}

where SS is the speed-up factor, pp is the proportion of the program which can be parallelised, and ss is the speed-up of the part of the task that is parallelisable. We can interpret the (1p)(1-p) factor as the proportion of the program which cannot be improved.

Amdahl’s Law tells us the theoretically maximum speed-up we can achieve by parallelising part of our program. In our example, we can know that p=1p=1 as all NN tasks can be done in parallel. However, the speed-up factor is not as trivial. We know that each worker can process one element at a time, and so the minimum amount of time needed is proportional to Nw\left \lceil \frac{N}{w} \right \rceil (where \left \lceil \cdot \right \rceil represents the ceiling function, which rounds up to the nearest integer. For example, 3.2=4\left \lceil 3.2 \right \rceil = 4, but 3=3\left \lceil 3 \right \rceil = 3).

If we put these equations together, we will see that an embarrassingly parallel problem (where each element takes the same amount of time) will have a speed-up given by

S=NNwS = \frac{N}{\left \lceil \frac{N}{w} \right \rceil}

If we take the limit as NN\rightarrow \infty, we know that Nw=Nw\left \lceil \frac{N}{w} \right \rceil = \frac{N}{w} and hence S=wS = w, which is what is expected.

Let’s say that we also need to read in the xx data from the disk, which takes around 1μs1\mu\text{s} per element and must be done sequentially. If the operation on each element takes 200200ns, then the factor pp changes to p200ns1μs=0.2p\frac{200\text{ns}}{1\mu\text{s}}=0.2. This will change our maximum speed-up:

S=10.8+0.2NwNS = \frac{1}{0.8+0.2\frac{\left \lceil \frac{N}{w} \right \rceil}{N}}

If we take limits as NN\rightarrow \infty and assume we have unlimited resources (i.e ww\rightarrow \infty), then the maximum speed-up is 1.251.25. This shows us that no matter how many resources we have, if the sequential part of an algorithm takes much longer than the part which can be parallelised then we are severely limited in how much we can speed-up a task.

Embarrassingly Parallel Problems

For our practical example, we need to know the number of workers we have available. We can see the number of processors available within Julia with the built-in function

julia
display(Threads.nthreads())
Output
8

If this is set to 1, you most likely have to start Julia with more threads. It is common to set the number of threads to the number of physical cores available on your processor. You can use julia -t auto to start Julia with the number of threads equal to the number of logical processors. Due to Simultaneous Multithreading (SMT) this can be double the number of your actual physical cores.

Let’s quickly implemented a parallel version of our algorithm using multithreading. We will discuss this in more detail later. In Julia, one can perform tasks in parallel by simply adding a macro on the beginning of a for loop:

julia
function map_solve_parallel!(y, x)
    @inbounds Threads.@threads for i in eachindex(x, y)
        y[i] = _inner_solve(x[i])
    end
    nothing
end

Same as the previous algorithm, but using the Threads.@threads macro in the Julia base library. Note that the @inbounds is not required, but speeds up performance.

We can benchmark this algorithm the same as before:

julia
function map_solve_parallel!(y, x)
    @inbounds Threads.@threads for i in eachindex(x, y)
        y[i] = _inner_solve(x[i])
    end
    nothing
end
parallel_benchmark = @benchmark map_solve_parallel!($y, $x)
display(parallel_benchmark)
best_parallel_time = minimum(parallel_benchmark.times)

println("Speedup: ", round(best_serial_time / best_parallel_time; sigdigits=3), "x")
println("Best Speedup: ", Threads.nthreads())
Output
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  34.742 μs … 308.690 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.038 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   39.690 μs ±   8.810 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▇▄▂▂▁▁    ▁ ▁▂▁          ▁▁▁                 ▃▃▃▄▃▁▁      ▂
  █████████▇▆▇██████▇▇▆▇▆▄▇▆▆▇████▇▆▅▃▃▃▃▃▁▃▃▁▁▁▁█████████▆▆▅▅ █
  34.7 μs       Histogram: log(frequency) by time      61.2 μs <

 Memory estimate: 4.25 KiB, allocs estimate: 42.
Speedup: 6.56x
Best Speedup: 8

Notice how the number of allocations has increased. This is because the threading macro rewrites your code an inserts some heap allocations. However, the performance of this algorithm is much better. The actual speedup is lower than our theoretical maximum (the output of Threads.nthreads()) for many reasons. The first of which is that running code in parallel requires a lot of overhead and synchronisation to happen across the cores. The second reason is that each core may access the same cache line in memory, and even worse, each core can be modifying the same cache line, invalidating it and causing a huge slowdown. This is what is known as false sharing, which we will see the effects of later on in the course. Additionally, the memory access is not uniform and the compiler cannot vectorise the code, negatively impacting performance. We will visit better ways of parallelising this sort of algorithm in a later chapter.

Gustafson’s Law

While Amdahl’s Law focuses on fixed problem sizes, Gustafson’s Law addresses a more practical scenario: scaled speedup. In real-world applications, when we get more computing resources, we often increase the problem size rather than just solving the same problem faster.

Gustafson’s Law states that the scaled speedup is given by:

S=(1p)+pNS = (1-p) + p \cdot N

where NN is the number of processors, pp is the fraction of time spent in the parallelisable portion when running on NN processors, and (1p)(1-p) is the serial fraction.

The key insight is that as we scale up both the problem size and number of processors, the parallel portion often grows while the serial portion remains relatively constant.

Let’s demonstrate this with a theoretical example of matrix multiplication, which has a complexity of roughly O(n3)\mathcal{O}(n^3) for multiplying two nn by nn matrices. There is also some amount of serial overhead for one of these algorithms. We assume that some overhead comes from coordinating the workers to handle the problem and that this time is constant, regardless of the number of elements in the matrix. This means that as nn grows, so does pp. This means a larger problem size is more parallelisable and hence it becomes more efficient to throw a larger number of workers at the problem - allowing us to handle this larger problem size. You can see this effect in Figure 1 below.

Figure 1: Gustafson's Law speedup for matrix multiplication. Note how larger problems achieve better speedup with more workers, unlike Amdahl's Law which shows diminishing returns.

We can essentially recover Amdahl’s law if we look at a vertical slice in Figure 1. For a fixed input size (and smaller one), the factor of pp is small and hence maximum speedup is very limited, even for a large amount of workers. Once can see this for the smaller matrix sizes (<50<50), where the maximum speedup is around 20%, and that is when using 3232 workers.

Key differences from Amdahl’s Law:

  1. Fixed vs. Scaled Problems: Amdahl assumes fixed problem size; Gustafson assumes problem size scales with resources
  2. Practical vs. Theoretical Limits: Amdahl shows theoretical limits; Gustafson shows practical benefits of scaling
  3. Pessimistic vs. Optimistic: Amdahl is pessimistic about parallel benefits; Gustafson is more optimistic for scaled workloads

Note on thread count: The number of threads in Julia is fixed at startup and cannot be changed during execution. To truly test Gustafson’s Law with different processor counts, you would need to run separate Julia sessions with different thread settings (e.g., julia -t 1, julia -t 4, etc.) and compare results across those runs.

The demonstration above shows how parallel efficiency changes as we scale the problem size with a fixed number of threads, which illustrates the core principle of Gustafson’s Law: larger problems often maintain better parallel efficiency than smaller ones.