Performance Annotations
Julia provides several macros and annotations that can help improve performance by giving the compiler additional information or relaxing certain constraints.
Fast Math
We can allow the compiler to include floating point optimisations that are correct for real numbers, but may lead to differences for IEEE encoded floats. These may change numerical results and accuracy, but may improve the performance of your code. This is enabled via the @fastmath
macro in Julia:
function nofastmath_example!(y, x)
@inbounds for i in eachindex(y, x)
y[i] = sin(x[i])
end
nothing
end
function fastmath_example!(y, x)
@inbounds @fastmath for i in eachindex(y, x)
y[i] = sin(x[i])
end
nothing
end
We can benchmark these two algorithms:
import BenchmarkTools: @benchmark, @btime, @belapsed
x = rand(1024); y = similar(x);
display(@benchmark nofastmath_example!($y, $x))
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
Range (min … max): 2.333 μs … 21.478 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.644 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.666 μs ± 408.535 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
██▄▃ ▁
▁▁▁▁▁▁▃▁▂▇▆▄▃▇▆▅▇▅▄████▇███▅▄▆▂▃▄▃▂▂▃▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
2.33 μs Histogram: frequency by time 3.19 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
display(@benchmark fastmath_example!($y, $x))
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
Range (min … max): 2.400 μs … 7.644 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.578 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.603 μs ± 178.657 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▂ █▅▁
▂▃▄▄▃▃▃▁▃▃▃▄██▁▇▅████▄▁▅▅▃▃▅▃▁▃▃▄▅▄▃▁▃▃▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▃▂▂▂ ▃
2.4 μs Histogram: frequency by time 2.97 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
We can see that the @fastmath
example is slightly faster. These operations can violate strict IEEE semantics, making some operations undefined behaviour. For this reason, it is often avoided in many scientific applications and is an opt-in performance enhancement.
Bounds Checking
We have already seen the use of the @inbounds
macro throughout this course. This is one of the easiest optimisations to make, as long as you are confident that you are accessing memory in a correct way. Turning off bounds checking and accessing incorrect areas of memory may lead to undefined behaviour, memory corruption and crashes. This can be mitigated by proper use of methods like eachindex
or axes
.
function with_bounds_check(x)
s = zero(eltype(x))
for i in 1:length(x)
s += x[i]
end
s
end
function without_bounds_check(x)
s = zero(eltype(x))
@inbounds for i in 1:length(x)
s += x[i]
end
s
end
x = rand(1000);
display(@benchmark with_bounds_check($x))
BenchmarkTools.Trial: 10000 samples with 210 evaluations per sample.
Range (min … max): 364.762 ns … 5.597 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 365.714 ns ┊ GC (median): 0.00%
Time (mean ± σ): 385.122 ns ± 79.397 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅▆▃▄▄▃▂▂▃▁▁▁ ▂
█████████████▇▆▆▅▄▄▅▄▄▄▅▃▃▄▄▁▃▁▃▃▁▃▃▁▃▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▆▃▄ █
365 ns Histogram: log(frequency) by time 777 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
display(@benchmark without_bounds_check($x))
BenchmarkTools.Trial: 10000 samples with 215 evaluations per sample.
Range (min … max): 340.930 ns … 2.090 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 348.372 ns ┊ GC (median): 0.00%
Time (mean ± σ): 386.827 ns ± 122.265 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▆▄▃▂ ▄▁ ▂ ▂
████████▇▇▇▆▅▄▄▄▃▄▁▃▄▄▃▃▄▃▃▄▁▄▁▁▁▁▁▁▃▁▃▁▁▁▄▃▁▁▁▁▃▁▁▁▁▁▃▁██▅██ █
341 ns Histogram: log(frequency) by time 775 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
The performance improvement from removing bounds checking can be significant, especially in tight loops.
Inlining
If a function is short enough, one can encourage the compiler to always inline the function so that at runtime, one does not have to pay the cost of calling another function. This can be done in Julia using the @inline
macro:
@inline function fast_add(a, b)
a + b
end
function use_fast_add(x)
s = zero(eltype(x))
for val in x
s = fast_add(s, val)
end
s
end
This makes sure that everywhere the addition function is used, the code will be inlined. We should note that the compiler will inline your code for you, and using the @inline
macro is only a suggestion to the compiler. Most of the time this is not necessary.
Constant Propagation
We have already seen that some code can get compiled away if all the constants are known at compile time. Take the following example:
compiled_fn() = sum(1:1000);
@code_typed compiled_fn()
One can see that the actual code simply returns the constant value which was calculated. The code never actually performs the sum at runtime, since it can be precomputed at compile time.
If you can give the compiler information about constants during compile time, it can propagate that information forwards to avoid costly computations down the line. There is a special type in Julia called Val
which allows us to insert data into the type information:
function dynamics_rule_with_val(u, ::Val{N}) where {N}
unit = one(typeof(u))
mask = ~(~zero(typeof(u)) << N)
u_left = (u << 1) | ((u & (unit << (N-1))) >> (N-1))
u_right = (u >> 1) | ((u & unit) << (N-1))
return (xor(xor(u_left, u), u_right) & mask)
end
function dynamics_rule_no_val(u, N)
unit = one(typeof(u))
mask = ~(~zero(typeof(u)) << N)
u_left = (u << 1) | ((u & (unit << (N-1))) >> (N-1))
u_right = (u >> 1) | ((u & unit) << (N-1))
return (xor(xor(u_left, u), u_right) & mask)
end
u = rand(Int)
N = 32
display(@benchmark dynamics_rule_no_val($u, $N))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 2.600 ns … 188.200 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.700 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.013 ns ± 4.314 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ ▃ ▄ ▄ ▂
█▁█▁▁█▁▁█▁▁▆▁▁▇▁▇▁▁▆▁▁▄▁▁▁▁▁▄▁▇▁▁█▁▁▅▁▁▄▁▁▁▁▃▁▁▃▁▁▁▁▁▃▁▁█▁█ █
2.6 ns Histogram: log(frequency) by time 4.7 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
display(@benchmark dynamics_rule_with_val($u, $(Val(N))))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 1.500 ns … 75.800 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.600 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.632 ns ± 0.922 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆ █ ▁ ▂ ▁
█▁▁▁█▁▁▁▁█▁▁▁▁▄▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁█▁▁▁▅ █
1.5 ns Histogram: log(frequency) by time 2.7 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Making use of the Val
type can save significant time by precomputing values at compile time rather than runtime.