JM

Caching and Memory Locality

In Hardware, we briefly discussed how modern CPUs are structured. This topic usually takes a backseat when talking about performance comparisons of various algorithms. However, it is of paramount importance when writing performance critical code.

A diagram of a typical cache setup for a dual-core processor
Figure 1: A diagram of a typical cache setup for a dual-core processor. The green arrows show the direction of memory travel. The closer the memory is to the CPU, the faster the memory transactions are.

An important observation is that if the memory required by the CPU is stored in the L1 cache, then the operation to retrieve that memory is extremely fast (low latency). Again, if it is stored in the L2 cache, then it is a bit slower to access, but the CPU is still able to access the memory without having to go to RAM. The story is the same for the L3 cache, which is shared amongst all CPU cores. The decrease in speed comes from latency in retrieving the information, if you have to check more places, it will take longer, even if checking each place takes the same amount of time.

It would be very beneficial if all the data required in a program were able to be stored in one of the caches, and the processor would try and minimise the number of times memory has to be retrieved from RAM. A cache miss happens when memory has to be retrieved from RAM, instead of being available in the cache. If the CPU is able to predict what memory will be needed next, it is able to store this memory inside the cache so that cache misses can be avoided.

Effectively handling cache relies on knowing as much about the program’s effect on memory as possible - so that the compiler and the produced machine code can lay out the memory in a way that is cache friendly. A CPU contains heuristics on what memory to cache, and it is important to align your code with the expectation of these heuristics. This is also why types are so important, since a type essentially specifies how much memory an object takes up and what the physical bits mean. If a compiler knows the type of the object, it can write instructions to reserve the exact amount of space required for that object, and make it more likely to be available in the cache when it is needed.

One trick that computers use to minimise cache misses, is to retrieve entire blocks of memory when iterating over an array, instead of a single value: This is known as a cache line. The first time you access an array, and try to access the first element, the program can guess that the next few values in the array may be needed and hence will copy the next few items into the cache as well. This way, when the program iterates to the next element, it is already in the cache, and a journey back to RAM is not required.

Finally, code may be written as to minimise the amount of memory that needs to be stored. If we are only operating on a few bits of memory at once, the machine instructions can keep all these memory inside registers on the CPU - avoiding the need for storing anything in cache!

Multidimensional Array Indexing

As Julia is designed around numerical computation, it contains an implementation for multidimensional arrays in the core library. Multidimensional array support is critical for scientific computing. Python’s main package for this is numpy and MATLAB is centred around the multidimensional array. It is important to understand how these arrays are actually stored in memory, so that we can avoid introducing performance hits to our applications.

Row vs column major ordering
Figure 2: A diagram from Wikipedia showing the way in which multidimensional arrays are stored in linear memory under row-major and column-major ordering schemes.

You will remember that one dimensional arrays are stored in a contiguous block in memory and the variable which “stores” the array is, in fact, just a pointer to the starting point of this contiguous block. The memory address of any value in the array can be calculated by knowing the index of the element and appending this to the value of the pointer. The question then arises, if an array is stored as a contiguous block of memory, how is a multidimensional array stored? The answer is simple, it is still stored as a single 1D block of memory, the only thing that changes is how one calculates the index into that 1D array, based on the indices from each dimension.

Take the example of a matrix. A matrix has two numbers i i and j j which we will take to represent which row and column the element lives in respectively. The element of the matrix A A, indexed as A[i,j] A[i,j] is the number in the i i-th row and the j j-th column. There are two options here for storing this array in linear memory. The one which Julia chooses is known as column-major order, which calculates the linear index, k k, as follows:

k=j×Nrows+i k = j \times N_{\text{rows}} + i

where i i, j j and k k are using zero-based indexing and Nrows N_{\text{rows}} is the total number of rows in the matrix. The alternative equation which uses row-major ordering (as in numpy’s implementation) uses the following equation:

k=i×Ncols+j k = i \times N_{\text{cols}} + j

where Ncols N_{\text{cols}} is the total number of columns.

These schemes can be extended into multiple dimensions. The general rule of thumb for column-major indexing is that one should begin incrementing from the left-most index first, and then increment each subsequent index once one reaches the end of that index.

We can design an experiment to compare the same algorithm, with the only difference being the order in which data is iterated over through an array. For this experiment, we will take the problem of implementing matrix addition:

julia
function row_major_matrix_add!(C, A, B)
    @assert size(C)==size(A)==size(B)
    @inbounds for i in axes(A, 2)
        for j in axes(A, 1)
            C[i, j] = A[i, j] + B[i, j]
        end
    end
    nothing
end

function column_major_matrix_add!(C, A, B)
    @assert size(C)==size(A)==size(B)
    @inbounds for j in axes(A, 2)
        for i in axes(A, 1)
            C[i, j] = A[i, j] + B[i, j]
        end
    end
    nothing
end

These two algorithms are theoretically exactly equivalent, since the order in which the operations happen has no effect on the output. Each algorithm performs identical operations on the data, the only difference is the order in which these operations are calculated, which has no effect on the output. One would expect these algorithms to have the same performance, but let’s test this:

julia
import BenchmarkTools: @benchmark, @btime, @belapsed
N = 1024; A = rand(N, N); B = rand(N, N); C = similar(A);
display(@benchmark row_major_matrix_add!($C, $A, $B))
Output
BenchmarkTools.Trial: 352 samples with 1 evaluation per sample.
 Range (min … max):  10.533 ms … 19.907 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.850 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.165 ms ±  1.476 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▃▃▅ ▅▄  ▄█ ▂ ▁    ▁▃                        
  ▃▁▁▁▁▁▃▁▃▃▅▁▃▅▇▇▇██████▇█████▅██▅▅███▇█▆▇▆▆▇▅▄▄▄▄▃▃▄▃▅▅▃▁▁▃ ▄
  10.5 ms         Histogram: frequency by time        17.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
display(@benchmark column_major_matrix_add!($C, $A, $B))
Output
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  337.600 μs …   2.602 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     390.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   420.395 μs ± 101.681 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄▂▇█▄▁                                                       
  ▃▆██████▇▅▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  338 μs           Histogram: frequency by time          881 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

There is a huge difference between the performance of these two algorithms! Simply swapping the order in which one iterates over rows or columns can have a big impact on performance. This is because the memory is aligned in order to avoid as many cache misses as possible.

Hardware Vectorisation

One of the big selling points of Julia, is that we do not need to write our code in a vectorised way for performance. Here, we are explicitly reserving vectorisation to mean performing bulk, array-level operations (often element-wise). Let’s take the example of adding two matrices together, as given in the previous example. There, we used a for loop to achieve this. We can do the same by using the broadcasting notation in Julia, which operates element-wise:

julia
function broadcasting_add!(C, A, B)
    C .= A .+ B
    nothing
end

We can benchmark this to see it is very similar to the previous implementation, while being a much simpler written implementation:

julia
display(@benchmark broadcasting_add!($C, $A, $B))
Output
BenchmarkTools.Trial: 9535 samples with 1 evaluation per sample.
 Range (min … max):  334.900 μs …   3.084 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     430.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   518.994 μs ± 237.234 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆▇██▇▆▅▄▄▄▄▄▃▃▂▂▂▂▁▁ ▁▁▂▂▂▂▂▁▁▁                              ▂
  █████████████████████████████████████▇▇▇▆▆▇▇▆▆▆▆▆▆▅▄▆▅▅▅▂▅▂▂▅ █
  335 μs        Histogram: log(frequency) by time       1.47 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

We can also write the equivalent for loop, taking advantage of Julia’s linear indexing, even for multidimensional arrays:

julia
function vector_add!(C, A, B)
    @inbounds for i in eachindex(C, A, B)
        C[i] = A[i] + B[i]
    end
    nothing
end

display(@benchmark vector_add!($C, $A, $B))
Output
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  333.600 μs …   4.806 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     408.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   466.436 μs ± 182.759 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▇▇▇█▇▆▅▅▄▄▄▃▃▃▂▂▂▂▁▁▁▁▁▁      ▁                              ▂
  ███████████████████████████▇▇████████▇▇▇▇▇▆▇▆▇▆▇▆▆▅▅▅▆▅▆▅▅▄▄▅ █
  334 μs        Histogram: log(frequency) by time       1.24 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

This means that there is only really an aesthetic difference between using the broadcasting (vectorised) notation, and a for loop. Even if we use @simd to take advantage of hardware vectorisation, we will see no difference, as the Julia compiler is smart enough to do this automatically where it is safe:

julia
function vector_simd_add!(C, A, B)
    @inbounds @simd for i in eachindex(C, A, B)
        C[i] = A[i] + B[i]
    end
    nothing
end

display(@benchmark vector_simd_add!($C, $A, $B))
Output
BenchmarkTools.Trial: 8834 samples with 1 evaluation per sample.
 Range (min … max):  333.800 μs …   3.341 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     429.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   560.307 μs ± 329.970 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆██▇▆▅▄▄▄▃▃▂▂▂▂▃▄▄▃▃▂▂▁▁▁ ▁                                   ▂
  ████████████████████████████████▇▇▇▇▇▇▆▆▆▆▆▇▆▆▆▆▆▆▆▆▆▆▆▃▆▆▆▁▆ █
  334 μs        Histogram: log(frequency) by time        2.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

The @simd macro is usually not needed to speed up your for loops, but can be helpful to give the compiler some additional leeway when the ordering of the results may be important, for example when performing a reduction. From the help documentation, this macro asserts several properties of the for loop:

  • It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
  • Floating-point operations on reduction variables can be reordered, possibly causing different results than without @simd.

The @simd just gives the compiler additional flexibility in auto-vectorising a for loop. Some constraints that should be followed:

  • The loop must be an innermost loop.
  • The loop body must be straight-line code. Therefore, @inbounds is currently needed for all array accesses.
  • Accesses must have a stride pattern and cannot be “gathers” (random-index reads) or “scatters” (random-index writes).
  • The stride should be unit stride.