Reducing Allocations

By far, the easiest thing to look for during optimisation is the number and size of allocations made in your code, and seeing if this can be reduced. In this context, we refer to “allocation” as reserving space in memory (RAM) to store data. Most algorithms can be written in a way that allocates the space required at the beginning of the execution, and reuses this memory throughout the execution of the algorithm.

Appending vs Preallocating

In languages like Python and MATLAB, it is very common to see developers appending information to their arrays, instead of preallocating. Appending to an array is when you add an element to an array, changing the size of the array.

The reason for the distinction in many languages is made on purpose, to discourage the use of dynamically sized arrays whenever possible. When MATLAB notices code that resizes an array (i.e. appends a value to the array), it warns the user of the performance hit this process causes. Let’s look at an example below of an algorithm which appends to an array:

julia

numbers = rand(100)
cumulative_sum_array = []
total_sum = 0
for a in numbers
    total_sum += a
    push!(cumulative_sum_array, total_sum)
end

Every time we append to cumulative_sum_array, the computer needs to find a free contiguous block of memory with $(n+1) \times 64$ bits of space, in which a new array, consisting of the old array and the new appended value, can be stored. This also involves copying the information from the old array into a new location. Fortunately, the algorithms involved will usually allocate more space than is actually required to try and avoid excessive memory copying and allocation, but this approach is still quite inefficient.

The same algorithm can be written without appending to an array, since we know the size of the output:

julia

numbers = rand(100)
# The 'similar' function allocates an array, 
# which is the same size and type as the input.
result_arr = similar(numbers)
total_sum = zero(eltype(numbers))
for i in eachindex(result_arr, numbers)
    total_sum += numbers[i]
    result_arr[i] = total_sum
end

Notice that this only required a small number of changes to the algorithm, but this can have a huge impact on performance. Let us turn this algorithm into a function, and see the benchmark performance:

julia

function cumulative_sum_preallocated(numbers)
    results = similar(numbers)
    total = zero(eltype(numbers))
    for i in eachindex(numbers)
        total += numbers[i]
        results[i] = total
    end
    return results
end

function cumulative_sum_appending(numbers)
    results = (eltype(numbers))[]
    total = zero(eltype(numbers))
    for i in eachindex(numbers)
        total += numbers[i]
        push!(results, total)
    end
    return results
end

Let’s create a comparison plot showing the performance difference:

Figure 1: Time taken when using cumulative_sum_preallocated and cumulative_sum_appending for several values of n. The preallocated version a few times faster for all values of n.

Looking at the results, we can see that the lines are roughly parallel, meaning that they have the same computational complexity (in this case it is linear). However, there is a constant difference between the lines, showing that the preallocated code much faster for all values of $n$ .

Using a Cache

Many parts of your code will involve calculating a formula. For readability, many programmers will separate out different terms in an equation into separate variables and then sum them together at the end. This has benefits for the programmer, as the code becomes more readable and maintainable, but it can introduce performance bugs in the process. Let’s take a look at the following example:

$y(x) = \frac{5x^5\sin(x^2)+20}{\exp(-4x)-x^2}$

Suppose that we had to write a function, which calculate a series of $y$ values for each element of an array $x$ . An initial implementation could look like:

julia

function example_equation_no_cache(x)
    numerator = 5 .* x .^ 5 .* sin.(x.^2) .+ 20
    denominator = exp.(-4 .* x) .- x .^ 2
    y = numerator ./ denominator
    return y
end

Let’s benchmark this function with some data:

julia

x = rand(8192);
display(@benchmark example_equation_no_cache($x))

Notice that there are many more allocations than actually required for this method, which should only allocate as much memory as is stored in x. We can modify the algorithm to store intermediate values directly in the y array so that we only allocate once:

julia

function example_equation_cache!(y, x)
    # Set y to the value of the numerator
    y .= 5 .* x .^ 5 .* sin.(x.^2) .+ 20
    # Divide out the denominator
    y ./= exp.(-4 .* x) .- x .^ 2
    return y
end

y = similar(x);
display(@benchmark example_equation_cache!($y, $x))

Output

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  68.524 μs …  1.309 ms  ┊ GC (min … max): 0.00% … 93.20%
 Time  (median):     77.349 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   83.073 μs ± 34.662 μs  ┊ GC (mean ± σ):  4.83% ±  9.48%

  ▁▇█▅       ▂▂                                               ▁
  ████▇▅▄▄▃▃▇███▆▆▅▃▁▁▃▁▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄██ █
  68.5 μs      Histogram: log(frequency) by time       296 μs <

 Memory estimate: 192.21 KiB, allocs estimate: 9.

While the speed difference between these two implementations is not that large on the lower end, the non-cache version has a significant tail on the timings. Using this within a hot loop can lead to a huge performance degradation.

One major benefit to using broadcasting notation is that Julia’s compiler can fuse broadcasted operations together, so that the intermediate results do not allocate any memory. This is a significant advantage over Python, where each intermediate result in an equation must allocate memory.

There are a few rules which can be used to make this process simpler:

Write out formulae and equations in scalar functions which are in turn, broadcasted so that one need not deal with adding . before each operation
Frequently use broadcasting, so that the operation can be fused to avoid intermediate results
Identify whether using a cache would have a significant performance difference in the application
If many cache variables are required, consider using a struct to group all the cache variables together
If the same sized cache variable is required many times, consider using a functor to wrap the cache away so that one need not keep track of it

Back Next

Contents

1. Introduction

2. Foundations

3. Julia

4. Measuring Performance

5. Optimisation

6. Parallel Programming

7. Multithreading

8. GPU Programming

Table of Contents

Reducing Allocations

Appending vs Preallocating

Using a Cache