Where Does Memory Live?

In Data Types, we introduced the idea of the stack and the heap as two data structures a program uses to organise its data. Again, these are both stored in RAM, and there is no physical difference between the memory in the stack and the heap. As a rough approximation:

The stack is a special area of memory which stores temporary variables (such as local variables of a function) which are deleted (or forgotten) once a function has finished executing. This is temporary storage.
The heap is used for storage which has variable length, or for objects that are too large for the stack.

It is important to be able to identify which variables will be allocated on the stack, and which will be allocated on the heap. We will talk in later sections about “allocating” memory, but this almost exclusively refers to storing memory on the heap.

Let us analyse the following function:

julia

function stack_only_function(x::T) where {T<:Number}
    a = x + 2
    b = sin(x)+10
    c = cos(x)-5
    d = a*b*c
    return d
end

This function seems to create 4 different variables. It is critical to understand that this function does not allocate any memory on the heap, when the input is just a number. But it is obviously using memory, since we create many different variables. The answer to this apparent contradiction is that it allocates memory on the stack. The reason these variables can exist on the stack is that the size can be predicted at compile-time and hence, space in the stack can be made for these variables.

Instead, let us change the example, so that instead we pass in an array with a single element:

julia

function heap_function(x::T) where {T<:AbstractArray}
    a = x .+ 2
    b = sin.(x) .+ 10
    c = cos.(x) .- 5
    d = a.*b.*c
    return d
end

We can benchmark these two functions on an array:

julia

A = rand(128); B = similar(A);
isapprox(stack_only_function.(A), heap_function(A))

julia

import BenchmarkTools: @benchmark, @btime, @belapsed
@btime begin
    ($B) .= stack_only_function.($A)
    nothing
end

Output

585.556 ns (0 allocations: 0 bytes)

julia

@btime begin
    ($B) .= heap_function($A)
    nothing
end

Output

707.092 ns (8 allocations: 4.38 KiB)

Looking at the results, we see that the stack only function does not allocate any memory, but the heap function allocates some memory four times. These algorithms are theoretically identical and yield the exact same results, so why does one algorithm allocate memory and the other does not?

In each example, we are using the broadcasting notation. Notice that in the stack function we are applying the entire function element wise to the array A and then storing the result element wise in the array B. The broadcast notation will simply compile down into a for loop. Inside this loop, the temporary variables a, b, c and d are stored on the stack.

In the heap function, we instead take in the entire array. Each line uses the broadcast notation, but only on the right-hand side. This means that a temporary array must be allocated to store the result. Each of the variables is a new array allocated on the heap, hence the 4 observed allocations from benchmarking.

The performance difference between the two implementations is actually somewhat understated here. We are using @btime which reports the minimum time taken. This will exclude any evaluations which are interrupted by the Garbage Collector, which is likely to happen if this code is used many times within the hot loop of your program.

The reason that the stack is able to store all the temporary information needed is because the size of the intermediate values are known at compile time, whereas the size of the array is not known and therefore must go on the heap.

What if the array we are using is of a known size? For example if we write a function to calculate the cross product of two three-dimensional vectors:

julia

function cross(a, b)
    [
        (a[2]*b[3]-a[3]*b[2]),
        (a[3]*b[1]-a[1]*b[3]),
        (a[1]*b[2]-a[2]*b[1])
    ]
end

a = rand(3);
b = rand(3);
@btime cross($a, $b)

Output

10.600 ns (2 allocations: 80 bytes)

This function runs very quickly, but there is a way we can make this even faster. If we know we are using 3D vectors a lot, we can make use of a package called StaticArrays.jl, which provides a nice utility for implementing fast algorithms which allow arrays to be stack allocated since they are forced to be of a fixed size:

julia

using StaticArrays

function cross(a::SVector{3, T}, b::SVector{3, T}) where {T}
    SVector(
        (a[2]*b[3]-a[3]*b[2]),
        (a[3]*b[1]-a[1]*b[3]),
        (a[1]*b[2]-a[2]*b[1])
    )
end

sa = SVector(a...);
sb = SVector(b...);
@btime cross($sa, $sb)

Output

1.500 ns (0 allocations: 0 bytes)

We can see that this implementation is a lot faster! This is because there are zero allocations and the entire vector fits into the stack, and hence, is very likely to be in the cache. Since the cross product is just 6 multiplications and 3 subtractions, this is designed to be calculated locally.

One thing to remember about static vectors is that they are best used statically, and one should avoid mutating the elements. This is because a lot of the speed comes from their immutability. Instead, one should create a copy of the vector, with the value you want changed.

Back Next

High Performance Computing in Julia

1. Introduction

2. Foundations

3. Julia

4. Measuring Performance

5. Optimisation

6. Parallel Programming

7. Multithreading

8. GPU Programming

Table of Contents

Where Does Memory Live?