Benchmarking

Video

If the changes in measurement are only small fluctuations, we can average over a number of repeats to get a better idea of how long the code takes to run. This is where the idea of benchmarking comes in. Benchmarking can be thought of as a more systematic approach to timing a piece of code. In Julia, benchmarking can be handled by a package called BenchmarkTools.jl, which will be used frequently throughout this book. Just like the @time macro from before, this package provides a few more macros which are used to systematically time a piece of code. Let’s apply the @benchmark macro to our rand function:

julia

import BenchmarkTools: @benchmark
display(@benchmark rand(1000,1000))

Output

BenchmarkTools.Trial: 2820 samples with 1 evaluation per sample.
 Range (min … max):  1.105 ms …  12.147 ms  ┊ GC (min … max):  0.00% … 89.32%
 Time  (median):     1.711 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   1.763 ms ± 538.265 μs  ┊ GC (mean ± σ):  16.96% ± 14.94%

  ▁█▂   ▃▆▆▂    ▃▃▁  ▃▅▅▂                                      
  ████▆▅████▇▆▄████▇▆██████▅▄▃▄▃▃▃▃▃▂▂▃▂▁▂▁▁▂▁▁▁▁▁▁▁▁▁▂▂▂▂▁▁▂ ▃
  1.1 ms          Histogram: frequency by time        3.48 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 3.

Note: If you run @benchmark in the REPL, it will return a benchmark object which will be printed to the console, without the need to use the display function.

This outputs a much more comprehensive breakdown of the performance of the rand function. One can see that this benchmark yielded a huge range of results. Notice that this benchmark also talks about memory and allocation data which is usually highly correlated with the execution speed of a function, which is discussed in more detail later in the chapter.

Why are some executions of this function so much faster than what we have seen before? To answer this, we must remember that Julia is a “Just-in-Time” compiled language. This means that when Julia executes a piece of code for the first time, it has to compile it into something the computer can process. Most of the in-built timing methods in Julia also include this compilation time. This is the main reason for using the external package, as it will screen out the compile time and only show you the results that are important.

If you only care about the minimum time taken, you might prefer to use the @btime macro over the @benchmark since it gives a much more succinct output:

julia

import BenchmarkTools: @btime
@btime rand(1000,1000);

Output

1.117 ms (3 allocations: 7.63 MiB)

Note: This macro will return the minimum time taken over a set of evaluations. This can give misleading results for code that may be intermittently slowed down, such as code with heavy amounts of allocations. Specifically, for code with a high number of allocations, it is often better to look at the histogram of results from a full @benchmark. If you are embedding the function you have benchmarked with @btime inside a loop, calling it N times, you may expect the result to take N times longer than the number given you by @btime, but this is often an underestimate. This is also key when comparing two implementations, as it is often important to compare the averages and not just the best case scenario.

All timings are usually given with an SI prefix, such as n, μ or m to indicate nano, micro or milli respectively. Remember, these stand for 10⁻⁹, 10⁻⁶ and 10⁻³ respectively.

If you want to be able to time a function and store the result in a variable, you can use the @belapsed macro to simply return the timing, aggregated (via the minimum) over many samples. If you need more precise control over how the benchmark is run, you should consult the documentation for BenchmarkTools.jl.

One very important factor when benchmarking using the macros from BenchmarkTools.jl, is to ensure that you are properly interpolating variables. There are usually allocations when using variables to call the functions, as these are often global variables, which are heap allocated. Let’s take a look at an example:

julia

my_arr = rand(1024);
display(@benchmark sum(my_arr))

Output

BenchmarkTools.Trial: 10000 samples with 988 evaluations per sample.
 Range (min … max):  48.887 ns … 732.996 ns  ┊ GC (min … max): 0.00% … 86.90%
 Time  (median):     52.935 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   58.978 ns ±  26.072 ns  ┊ GC (mean ± σ):  0.36% ±  1.75%

  ▅█▆▄▃▂▁             ▃                                        ▁
  █████████▇█▆▆▆▅▅▅▅▁▄█▆▇█▇▆▅▆▄▅▅▄▄▄▁▃▁▄▅▅▁▅▄▁▅▅▄▄▄▄▅▆▅▃▁▃▅▅▄▅ █
  48.9 ns       Histogram: log(frequency) by time       199 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

Here, we have an allocation of just 16 bytes, but it does exist. We don’t expect this sum to allocate at all. The reason for this, is using the global variable to pass into our function. Instead of using this global variable, we should interpolate the contents inside the benchmark:

julia

display(@benchmark sum($my_arr))

Output

BenchmarkTools.Trial: 10000 samples with 992 evaluations per sample.
 Range (min … max):  36.593 ns … 801.512 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     41.633 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.362 ns ±  24.333 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆██▆▄▂▁▁       ▃▂▂▁                                          ▂
  ██████████▇▇▇▆▆███████▇▆▆▆▆▆▄▅▆▄▄▆▅▄▄▅▄▄▄▄▃▁▃▃▅▅▃▅▅▃▁▃▃▄▁▃▄▅ █
  36.6 ns       Histogram: log(frequency) by time       172 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Notice that this removed the mysterious allocation, while also seemingly improving the performance. It is critical that when passing in arguments to a function to be benchmarked via BenchmarkTools.jl, you must interpolate the values being passed into the function. Similarly, if you want to avoid creating a temporary variable, but do not want to benchmark its creation, you can interpolate:

julia

display(@benchmark sum($(rand(1024))))

Output

BenchmarkTools.Trial: 10000 samples with 975 evaluations per sample.
 Range (min … max):  36.513 ns … 782.564 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     72.513 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   67.799 ns ±  25.930 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▅▄▂              █▁▅▃▄▃▁▂▁                                  ▂
  ████▇▇▆▆▅▆▅▅▆▅▄▅▁▄█████████▇█▇▆▆▆▅▅▅▅▅▆▆▆▆▇▆▇▄▅▅▄▅▅▄▄▅▄▅▆▄▄▅ █
  36.5 ns       Histogram: log(frequency) by time       152 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Interpolation is key to accurate benchmarking using the macros. There are many options to expand on this capability, such as setup arguments, which may be of interest to low level benchmarking.

Back Next

High Performance Computing in Julia

1. Introduction

2. Foundations

3. Julia

4. Measuring Performance

5. Optimisation

6. Parallel Programming

7. Multithreading

8. GPU Programming

Table of Contents

Benchmarking

Video