JM

Performance Annotations

Julia provides several macros and annotations that can help improve performance by giving the compiler additional information or relaxing certain constraints.

Fast Math

We can allow the compiler to include floating point optimisations that are correct for real numbers, but may lead to differences for IEEE encoded floats. These may change numerical results and accuracy, but may improve the performance of your code. This is enabled via the @fastmath macro in Julia:

julia
function nofastmath_example!(y, x)
    @inbounds for i in eachindex(y, x)
        y[i] = sin(x[i])
    end
    nothing
end

function fastmath_example!(y, x)
    @inbounds @fastmath for i in eachindex(y, x)
        y[i] = sin(x[i])
    end
    nothing
end

We can benchmark these two algorithms:

julia
import BenchmarkTools: @benchmark, @btime, @belapsed

x = rand(1024); y = similar(x);
display(@benchmark nofastmath_example!($y, $x))
Output
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (min … max):  2.333 μs …  21.478 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.644 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.666 μs ± 408.535 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ██▄▃   ▁                                  
  ▁▁▁▁▁▁▃▁▂▇▆▄▃▇▆▅▇▅▄████▇███▅▄▆▂▃▄▃▂▂▃▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  2.33 μs         Histogram: frequency by time        3.19 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
display(@benchmark fastmath_example!($y, $x))
Output
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (min … max):  2.400 μs …   7.644 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.578 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.603 μs ± 178.657 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▃▂    █▅▁                                        
  ▂▃▄▄▃▃▃▁▃▃▃▄██▁▇▅████▄▁▅▅▃▃▅▃▁▃▃▄▅▄▃▁▃▃▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▃▂▂▂ ▃
  2.4 μs          Histogram: frequency by time        2.97 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

We can see that the @fastmath example is slightly faster. These operations can violate strict IEEE semantics, making some operations undefined behaviour. For this reason, it is often avoided in many scientific applications and is an opt-in performance enhancement.

Bounds Checking

We have already seen the use of the @inbounds macro throughout this course. This is one of the easiest optimisations to make, as long as you are confident that you are accessing memory in a correct way. Turning off bounds checking and accessing incorrect areas of memory may lead to undefined behaviour, memory corruption and crashes. This can be mitigated by proper use of methods like eachindex or axes.

julia
function with_bounds_check(x)
    s = zero(eltype(x))
    for i in 1:length(x)
        s += x[i]
    end
    s
end

function without_bounds_check(x)
    s = zero(eltype(x))
    @inbounds for i in 1:length(x)
        s += x[i]
    end
    s
end

x = rand(1000);
display(@benchmark with_bounds_check($x))
Output
BenchmarkTools.Trial: 10000 samples with 210 evaluations per sample.
 Range (min … max):  364.762 ns …  5.597 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     365.714 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   385.122 ns ± 79.397 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▆▃▄▄▃▂▂▃▁▁▁                                                ▂
  █████████████▇▆▆▅▄▄▅▄▄▄▅▃▃▄▄▁▃▁▃▃▁▃▃▁▃▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▆▃▄ █
  365 ns        Histogram: log(frequency) by time       777 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
display(@benchmark without_bounds_check($x))
Output
BenchmarkTools.Trial: 10000 samples with 215 evaluations per sample.
 Range (min … max):  340.930 ns …   2.090 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     348.372 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   386.827 ns ± 122.265 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▆▄▃▂                                                  ▄▁ ▂  ▂
  ████████▇▇▇▆▅▄▄▄▃▄▁▃▄▄▃▃▄▃▃▄▁▄▁▁▁▁▁▁▃▁▃▁▁▁▄▃▁▁▁▁▃▁▁▁▁▁▃▁██▅██ █
  341 ns        Histogram: log(frequency) by time        775 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

The performance improvement from removing bounds checking can be significant, especially in tight loops.

Inlining

If a function is short enough, one can encourage the compiler to always inline the function so that at runtime, one does not have to pay the cost of calling another function. This can be done in Julia using the @inline macro:

julia
@inline function fast_add(a, b)
    a + b
end

function use_fast_add(x)
    s = zero(eltype(x))
    for val in x
        s = fast_add(s, val)
    end
    s
end

This makes sure that everywhere the addition function is used, the code will be inlined. We should note that the compiler will inline your code for you, and using the @inline macro is only a suggestion to the compiler. Most of the time this is not necessary.

Constant Propagation

We have already seen that some code can get compiled away if all the constants are known at compile time. Take the following example:

julia
compiled_fn() = sum(1:1000);
@code_typed compiled_fn()

One can see that the actual code simply returns the constant value which was calculated. The code never actually performs the sum at runtime, since it can be precomputed at compile time.

If you can give the compiler information about constants during compile time, it can propagate that information forwards to avoid costly computations down the line. There is a special type in Julia called Val which allows us to insert data into the type information:

julia

function dynamics_rule_with_val(u, ::Val{N}) where {N}
    unit = one(typeof(u))
    mask = ~(~zero(typeof(u)) << N)
    u_left = (u << 1) | ((u & (unit << (N-1))) >> (N-1))
    u_right = (u >> 1) | ((u & unit) << (N-1))
    return (xor(xor(u_left, u), u_right) & mask)
end

function dynamics_rule_no_val(u, N)
    unit = one(typeof(u))
    mask = ~(~zero(typeof(u)) << N)
    u_left = (u << 1) | ((u & (unit << (N-1))) >> (N-1))
    u_right = (u >> 1) | ((u & unit) << (N-1))
    return (xor(xor(u_left, u), u_right) & mask)
end

u = rand(Int)
N = 32
display(@benchmark dynamics_rule_no_val($u, $N))
Output
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  2.600 ns … 188.200 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.700 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.013 ns ±   4.314 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ █  ▃                                                  ▄ ▄ ▂
  █▁█▁▁█▁▁█▁▁▆▁▁▇▁▇▁▁▆▁▁▄▁▁▁▁▁▄▁▇▁▁█▁▁▅▁▁▄▁▁▁▁▃▁▁▃▁▁▁▁▁▃▁▁█▁█ █
  2.6 ns       Histogram: log(frequency) by time       4.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
display(@benchmark dynamics_rule_with_val($u, $(Val(N))))
Output
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  1.500 ns … 75.800 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.600 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.632 ns ±  0.922 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆   █    ▁                                           ▂     ▁
  █▁▁▁█▁▁▁▁█▁▁▁▁▄▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁█▁▁▁▅ █
  1.5 ns       Histogram: log(frequency) by time      2.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Making use of the Val type can save significant time by precomputing values at compile time rather than runtime.