JM

Table of Contents

Performance Annotations

Julia provides several macros and annotations that can help improve performance by giving the compiler additional information or relaxing certain constraints.

Fast Math

We can allow the compiler to include floating point optimisations that are correct for real numbers, but may lead to differences for IEEE encoded floats. These may change numerical results and accuracy, but may improve the performance of your code. This is enabled via the @fastmath macro in Julia:

julia
function nofastmath_example!(y, x)
    @inbounds for i in eachindex(y, x)
        y[i] = sin(sqrt(x[i] * x[i] + 1.0) / cos(x[i] + 0.1))
    end
    nothing
end

function fastmath_example!(y, x)
    @inbounds @fastmath for i in eachindex(y, x)
        y[i] = sin(sqrt(x[i] * x[i] + 1.0) / cos(x[i] + 0.1))
    end
    nothing
end

We can benchmark these two algorithms:

julia
import BenchmarkTools: @benchmark, @btime, @belapsed

x = rand(1024) .+ 1.0; y = similar(x);
display(@benchmark nofastmath_example!($y, $x))
Output
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (min … max):  9.066 μs …  22.909 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.186 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.317 μs ± 504.506 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▆▆▃▅▆▅▅▄▄▃▃▄▄▂▂▁▁▁▁▁                                      ▂
  ███████████████████████████▇▇▆▇▆▆▆▇▆▇▆▆▇▆▅▆▅▆▆▃▆▅▄▅▄▄▁▅▄▆▇█ █
  9.07 μs      Histogram: log(frequency) by time        11 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
display(@benchmark fastmath_example!($y, $x))
Output
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (min … max):  8.981 μs …  21.634 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.226 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.545 μs ± 705.802 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▄▄ █▇▅▃▃▂   ▂▃▁         ▂▃    ▂ ▂▂▂▄▃▁▁▁                   ▂
  ██████████████████▇▆▇▇▆▆▆███▇▇▇██████████▆▆▆▆▆██▇█▇▆▅▅▅▅▆▄▅ █
  8.98 μs      Histogram: log(frequency) by time      11.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

We can see that the @fastmath version doesn’t provide that much of a performance improvement, however, it is useful to keep in mind as it may improve performance on some systems by allowing the compiler to use faster floating-point operations and reorder computations.

Use the “fast” math operations can violate strict IEEE semantics, making some operations produce undefined behaviour. For this reason, it is often avoided in many scientific applications and is an opt-in performance enhancement.

Bounds Checking

We have already seen the use of the @inbounds macro throughout this course. This is one of the easiest optimisations to make, as long as you are confident that you are accessing memory in a correct way. Turning off bounds checking and accessing incorrect areas of memory may lead to undefined behaviour, memory corruption and crashes. This can be mitigated by proper use of methods like eachindex or axes.

julia
function with_bounds_check(x)
    s = zero(eltype(x))
    for i in 1:length(x)
        s += x[i]
    end
    s
end

function without_bounds_check(x)
    s = zero(eltype(x))
    @inbounds for i in 1:length(x)
        s += x[i]
    end
    s
end

x = rand(1000);
display(@benchmark with_bounds_check($x))
Output
BenchmarkTools.Trial: 10000 samples with 210 evaluations per sample.
 Range (min … max):  364.448 ns …  1.011 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     364.743 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   368.054 ns ± 15.801 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▂  ▂▂▂▁ ▁     ▂                                            ▁
  ██████████████████▇▇▆▆▆▆▅▅▅▅▄▄▅▅▅▄▄▅▅▄▅▄▄▄▃▄▃▄▃▅▄▃▃▄▂▃▃▃▃▂▂▃ █
  364 ns        Histogram: log(frequency) by time       417 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
display(@benchmark without_bounds_check($x))
Output
BenchmarkTools.Trial: 10000 samples with 215 evaluations per sample.
 Range (min … max):  340.591 ns … 772.391 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     347.042 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   351.020 ns ±  13.861 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         █▄▁  ▁▃▂▂  ▁▁▁  ▂▂ ▂▁   ▃▁                             ▁
  █▅▁▃▁▄████▇███████████▇██▇██▇▇▆██▇▆▆▆▆▇▆▆▅▅▆▅▅▆▆▆▆▅▅▅▅▅▄▅▅▄▅▅ █
  341 ns        Histogram: log(frequency) by time        392 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

The performance improvement from removing bounds checking can sometimes be significant, especially in tight loops. This performance optimisation was more necessary in older versions of Julia, but is not always required, especially when combined with semantics such as eachindex or axes.

Inlining

If a function is short enough, one can encourage the compiler to always inline the function so that at runtime, one does not have to pay the cost of calling another function. This can be done in Julia using the @inline macro:

julia
@inline function fast_add(a, b)
    a + b
end

function use_fast_add(x)
    s = zero(eltype(x))
    for val in x
        s = fast_add(s, val)
    end
    s
end

This makes a strong suggestion to the compiler that whereever fast_add is used, it should be directly inlined, instead of inserting a function call at the call site. We should note that the compiler contains a heuristic of where or not to inline your code for you, and using the @inline macro is only a suggestion to the compiler to try and encourage inlining. Most of the time this is not necessary, but might be useful in performance critical code.

Constant Propagation

We have already seen that some code can get compiled away if all the constants are known at compile time. Take the following example:

julia
compiled_fn() = sum(1:1000);
display(@code_typed compiled_fn())
Output
CodeInfo(
1 ─     return 500500
) => Int64

One can see that the actual code simply returns the constant value which was calculated. The code never actually performs the sum at runtime, since it can be precomputed at compile time.

If you can give the compiler information about constants during compile time, it can propagate that information forwards to avoid costly computations down the line. There is a special type in Julia called Val which allows us to insert data into the type information:

julia

function dynamics_rule_with_val(u, ::Val{N}) where {N}
    unit = one(typeof(u))
    mask = ~(~zero(typeof(u)) << N)
    u_left = (u << 1) | ((u & (unit << (N-1))) >> (N-1))
    u_right = (u >> 1) | ((u & unit) << (N-1))
    return (xor(xor(u_left, u), u_right) & mask)
end

function dynamics_rule_no_val(u, N)
    unit = one(typeof(u))
    mask = ~(~zero(typeof(u)) << N)
    u_left = (u << 1) | ((u & (unit << (N-1))) >> (N-1))
    u_right = (u >> 1) | ((u & unit) << (N-1))
    return (xor(xor(u_left, u), u_right) & mask)
end

u = rand(Int)
N = 32
display(@benchmark dynamics_rule_no_val($u, $N))
Output
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  2.563 ns … 5.977 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.627 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.632 ns ± 0.117 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅▂     █▇                                                
  ▄███▄▂▂▄███▅▂▂▂▂▄▆▄▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▁▂▁▁▂▂▂▂▂▂▁▁▂ ▃
  2.56 ns        Histogram: frequency by time       2.96 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
display(@benchmark dynamics_rule_with_val($u, $(Val(N))))
Output
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  1.521 ns … 5.884 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.532 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.560 ns ± 0.129 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▂     ▃▄▄       ▁▂    ▁                                ▂
  ████▇▆▅▆▆███▇▅▄▃▄▄▄███▅▁██▇▄▃▄▆█▇▄▁▄▁▁▃▃▄▄▇▆▄▁▁▄▅▇▆▄▁▃▅██ █
  1.52 ns     Histogram: log(frequency) by time      1.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Making use of the Val type can save significant time by precomputing values at compile time rather than runtime. It can also be used to try and make your code as generic as possible, while still retaining as much performance as possible. This technique should only be used if the data contained within the Val type only has a few possible values, and likely won’t change during the execution of your code.

Note: When introducing Val into your code, you lightly have to propagate this changed call signature up the call stack, which might involve significant changes to your codebase. This is because constructing a Val type is usually not type-safe (not predictable by the compiler), and so you want to do this as far up the call stack as possible to avoid the poor performance of type unstable code. In these cases, it is sometimes better to just pass in the datatype you want to use.