JM

When to Parallelise?

The first question that a developer must fully understand is when to parallelise. Since a parallel implementation can often be difficult and sometimes even detrimental to performance, one must answer this crucial question.

Profiling and Benchmarking

Usually, one already has an implementation in serial code, which is the target of optimisation. Before deciding to optimise this function and make it parallel, one should begin by benchmarking the function on a typical workload. This benchmarking, ideally, should be written as a function so that it can be repeated throughout the optimisation process to make sure that improvements are being made. Benchmarking the process, usually done via profiling, has to reveal that optimising this function is worth it, as it will have a large effect on the overall performance of the application. If the application spends 0.001% of its time in this function, making it twice as fast will have little overall effect (as stated by Amdahl’s law discussed earlier). We, as developers, have a limited amount of time we can spend on our code, and it is important to use it as efficiently as possible.

If we have decided that this function should be optimised and will have a significant effect, then make sure that benchmarks are clearly written out in functions, so they can be repeated in the future.

Identifying Speed-up Candidates

Once one has an idea of which parts of the code will have the largest effect on the overall computation time of the program, one should then use the techniques and laws from the previous sections of this chapter to analytically find out whether any parallelisation is likely to increase performance.

Firstly, one should think about the dependency graph of this part of the algorithm and see if there are significant sections which can be done at the same time.

Secondly, one should ask if the identified parallel sections make up a significant enough fraction of the whole to make a worthwhile impact on the overall execution of the program. This can be estimated using Amdahl’s Law.

Lastly, one can ask if increasing the problem size would change the answer of the previous question. If the speed-up becomes more efficient with larger and larger problem sizes. This comes with the assumption that increasing data throughput has tangible benefits.

These are all clues that will help you to avoid spending time parallelising a problem which will have little effect on the overall performance of your program. However, we have adequately discussed the “how” of parallelising code. That will be the topic of the next few chapters.

Practical Slowdowns

Whenever one wishes to parallelise an algorithm, there are significant factors that can get in the way of gaining the maximum speed-up. The main factors can be summarised as:

  1. Time taken to spawn multiple threads/workers.
  2. Time taken to schedule the work and divide it amongst the workers.
  3. Time spent creating additional storage for the workers to avoid race conditions.
  4. Time wasted due to idling because of poor load-balancing (Load balancing is scheduling the tasks amongst the workers so that each is as busy as possible. If the tasks take different amounts of time, then this strategy can become more complicated.).
  5. Idle time of workers due to locks (mutexes, semaphores and atomics).
  6. When more cores of a processor are being used, the energy used by the chip is vastly increased. In order for the CPU to stay at a safe temperature, the firmware may reduce the clock speed of each core. Lowering the clock speed will lower the overall power draw and the temperature of the CPU. Additionally, single-core may have a turbo mode which ramps up the clock speed of a single core, which is reduced when multiple cores are being used. For this reason, one may not see as big a speed-up as expected.

We will spend time in the next chapters discussing not only how to parallelise the code, but also strategies for mitigating these issues which cause performance hits.