JM

Table of Contents

Where Does Memory Live?

In Data Types, we introduced the idea of the stack and the heap as two data structures a program uses to organise its data. Again, these are both stored in RAM, and there is no physical difference between the memory in the stack and the heap. As a rough approximation:

  • The stack is a special area of memory which stores temporary variables (such as local variables of a function) which are deleted (or forgotten) once a function has finished executing. This is temporary storage.
  • The heap is used for storage which has variable length, or for objects that are too large for the stack.

It is important to be able to identify which variables will be allocated on the stack, and which will be allocated on the heap. We will talk in later sections about “allocating” memory, but this almost exclusively refers to storing memory on the heap.

Let us analyse the following function:

julia
function stack_only_function(x::T) where {T<:Number}
    a = x + 2
    b = sin(x)+10
    c = cos(x)-5
    d = a*b*c
    return d
end

This function seems to create 4 different variables. It is critical to understand that this function does not allocate any memory on the heap, when the input is just a number. But it is obviously using memory, since we create many different variables. The answer to this apparent contradiction is that it allocates memory on the stack. The reason these variables can exist on the stack is that the size can be predicted at compile-time and hence, space in the stack can be made for these variables.

Instead, let us change the example, so that instead we pass in an array with a single element:

julia
function heap_function(x::T) where {T<:AbstractArray}
    a = x .+ 2
    b = sin.(x) .+ 10
    c = cos.(x) .- 5
    d = a.*b.*c
    return d
end

We can benchmark these two functions on an array:

julia
A = rand(128); B = similar(A);
@show isapprox(stack_only_function.(A), heap_function(A))
Output
isapprox(stack_only_function.(A), heap_function(A)) = true
julia
import BenchmarkTools: @benchmark, @btime, @belapsed
display(@benchmark begin
    ($B) .= stack_only_function.($A)
    nothing
end)
Output
BenchmarkTools.Trial: 10000 samples with 153 evaluations per sample.
 Range (min … max):  671.654 ns …  1.819 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     698.843 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   709.337 ns ± 41.243 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂█   ▁                                                 
  ▁▁▁▄▆▁▂██▃▃▅█▂▂▂▃▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  672 ns          Histogram: frequency by time          865 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
display(@benchmark begin
    ($B) .= heap_function($A)
    nothing
end)
Output
BenchmarkTools.Trial: 10000 samples with 159 evaluations per sample.
 Range (min … max):  642.736 ns …   3.670 μs  ┊ GC (min … max): 0.00% … 58.51%
 Time  (median):     713.604 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   791.681 ns ± 264.888 ns  ┊ GC (mean ± σ):  8.25% ± 14.07%

  ▂▆██▇▇▅▄▃▂▁                                         ▁▂▂▂▂▁▂▁  ▂
  ████████████▇██▇▆▅▅▅▄▃▁▃▄▃▁▃▁▃▁▁▃▃▁▁▁▄▄▅▄▄▁▁▃▄▄▄▃▆▆▇█████████ █
  643 ns        Histogram: log(frequency) by time       1.76 μs <

 Memory estimate: 4.38 KiB, allocs estimate: 8.

Looking at the results, we see that the stack only function does not allocate any memory, but the heap function allocates some memory four times. These algorithms are theoretically identical and yield the exact same results, so why does one algorithm allocate memory and the other does not?

In each example, we are using the broadcasting notation. Notice that in the stack function we are applying the entire function element wise to the array A and then storing the result element wise in the array B. The broadcast notation will simply compile down into a for loop. Inside this loop, the temporary variables a, b, c and d are stored on the stack.

In the heap function, we instead take in the entire array. Each line uses the broadcast notation, but only on the right-hand side. This means that a temporary array must be allocated to store the result. Each of the variables is a new array allocated on the heap, hence the 4 observed allocations from benchmarking.

The performance difference between the two implementations is actually somewhat understated here. We are using @btime which reports the minimum time taken. This will exclude any evaluations which are interrupted by the Garbage Collector, which is likely to happen if this code is used many times within the hot loop of your program.

The reason that the stack is able to store all the temporary information needed is because the size of the intermediate values are known at compile time, whereas the size of the array is not known and therefore must go on the heap.

What if the array we are using is of a known size? For example if we write a function to calculate the cross product of two three-dimensional vectors:

julia
function cross(a, b)
    [
        (a[2]*b[3]-a[3]*b[2]),
        (a[3]*b[1]-a[1]*b[3]),
        (a[1]*b[2]-a[2]*b[1])
    ]
end

a = rand(3);
b = rand(3);
@btime cross($a, $b)
Output
9.830 ns (2 allocations: 80 bytes)

This function runs very quickly, but there is a way we can make this even faster. If we know we are using 3D vectors a lot, we can make use of a package called StaticArrays.jl, which provides a nice utility for implementing fast algorithms which allow arrays to be stack allocated since they are forced to be of a fixed size:

julia
using StaticArrays

function cross(a::SVector{3, T}, b::SVector{3, T}) where {T}
    SVector(
        (a[2]*b[3]-a[3]*b[2]),
        (a[3]*b[1]-a[1]*b[3]),
        (a[1]*b[2]-a[2]*b[1])
    )
end

sa = SVector(a...);
sb = SVector(b...);
@btime cross($sa, $sb)
Output
1.705 ns (0 allocations: 0 bytes)

We can see that this implementation is a lot faster! This is because there are zero allocations and the entire vector fits into the stack, and hence, is very likely to be in the cache. Since the cross product is just 6 multiplications and 3 subtractions, this is designed to be calculated locally.

One thing to remember about static vectors is that the default SArray object is immutable. Instead, one should create new vectors which can be pushed onto the stack. Sometimes, you want some mutability, which can be achieved using the MArray class instead.