Reducing Allocations
By far, the easiest thing to look for during optimisation is the number and size of allocations made in your code, and seeing if this can be reduced. In this context, we refer to “allocation” as reserving space in memory (RAM) to store data. Most algorithms can be written in a way that allocates the space required at the beginning of the execution, and reuses this memory throughout the execution of the algorithm.
Appending vs Preallocating
In languages like Python and MATLAB, it is very common to see developers appending information to their arrays, instead of preallocating. Appending to an array is when you add an element to an array, changing the size of the array.
The reason for the distinction in many languages is made on purpose, to discourage the use of dynamically sized arrays whenever possible. When MATLAB notices code that resizes an array (i.e. appends a value to the array), it warns the user of the performance hit this process causes. Let’s look at an example below of an algorithm which appends to an array:
numbers = rand(100)
cumulative_sum_array = []
total_sum = 0
for a in numbers
total_sum += a
push!(cumulative_sum_array, total_sum)
end
Every time we append to cumulative_sum_array
, the computer needs to find a free contiguous block of memory with bits of space, in which a new array, consisting of the old array and the new appended value, can be stored. This also involves copying the information from the old array into a new location. Fortunately, the algorithms involved will usually allocate more space than is actually required to try and avoid excessive memory copying and allocation, but this approach is still quite inefficient.
The same algorithm can be written without appending to an array, since we know the size of the output:
numbers = rand(100)
# The 'similar' function allocates an array,
# which is the same size and type as the input.
result_arr = similar(numbers)
total_sum = zero(eltype(numbers))
for i in eachindex(result_arr, numbers)
total_sum += numbers[i]
result_arr[i] = total_sum
end
Notice that this only required a small number of changes to the algorithm, but this can have a huge impact on performance. Let us turn this algorithm into a function, and see the benchmark performance:
function cumulative_sum_preallocated(numbers)
results = similar(numbers)
total = zero(eltype(numbers))
for i in eachindex(numbers)
total += numbers[i]
results[i] = total
end
return results
end
function cumulative_sum_appending(numbers)
results = (eltype(numbers))[]
total = zero(eltype(numbers))
for i in eachindex(numbers)
total += numbers[i]
push!(results, total)
end
return results
end
Let’s create a comparison plot showing the performance difference:
Looking at the results, we can see that the lines are roughly parallel, meaning that they have the same computational complexity (in this case it is linear). However, there is a constant difference between the lines, showing that the preallocated code much faster for all values of .
Using a Cache
Many parts of your code will involve calculating a formula. For readability, many programmers will separate out different terms in an equation into separate variables and then sum them together at the end. This has benefits for the programmer, as the code becomes more readable and maintainable, but it can introduce performance bugs in the process. Let’s take a look at the following example:
Suppose that we had to write a function, which calculate a series of values for each element of an array . An initial implementation could look like:
function example_equation_no_cache(x)
numerator = 5 .* x .^ 5 .* sin.(x.^2) .+ 20
denominator = exp.(-4 .* x) .- x .^ 2
y = numerator ./ denominator
return y
end
Let’s benchmark this function with some data:
x = rand(8192);
display(@benchmark example_equation_no_cache($x))
Notice that there are many more allocations than actually required for this method, which should only allocate as much memory as is stored in x
. We can modify the algorithm to store intermediate values directly in the y
array so that we only allocate once:
function example_equation_cache!(y, x)
# Set y to the value of the numerator
y .= 5 .* x .^ 5 .* sin.(x.^2) .+ 20
# Divide out the denominator
y ./= exp.(-4 .* x) .- x .^ 2
return y
end
y = similar(x);
display(@benchmark example_equation_cache!($y, $x))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 68.524 μs … 1.309 ms ┊ GC (min … max): 0.00% … 93.20%
Time (median): 77.349 μs ┊ GC (median): 0.00%
Time (mean ± σ): 83.073 μs ± 34.662 μs ┊ GC (mean ± σ): 4.83% ± 9.48%
▁▇█▅ ▂▂ ▁
████▇▅▄▄▃▃▇███▆▆▅▃▁▁▃▁▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄██ █
68.5 μs Histogram: log(frequency) by time 296 μs <
Memory estimate: 192.21 KiB, allocs estimate: 9.
While the speed difference between these two implementations is not that large on the lower end, the non-cache version has a significant tail on the timings. Using this within a hot loop can lead to a huge performance degradation.
One major benefit to using broadcasting notation is that Julia’s compiler can fuse broadcasted operations together, so that the intermediate results do not allocate any memory. This is a significant advantage over Python, where each intermediate result in an equation must allocate memory.
There are a few rules which can be used to make this process simpler:
- Write out formulae and equations in scalar functions which are in turn, broadcasted so that one need not deal with adding
.
before each operation - Frequently use broadcasting, so that the operation can be fused to avoid intermediate results
- Identify whether using a cache would have a significant performance difference in the application
- If many cache variables are required, consider using a struct to group all the cache variables together
- If the same sized cache variable is required many times, consider using a functor to wrap the cache away so that one need not keep track of it