JM

Table of Contents

Vectorising vs Loops

Even though we have already covered this in some detail, we must drive home the point of comparing vectorised code and using a loop.

If you have come from MATLAB and Python, the gospel of optimisation is to vectorise your code. Let’s first break down what this means. Take the following two implementations of the same method:

julia
function add_loop!(c, a, b)
    @inbounds for i = 1:length(a)
        c[i] = a[i] + b[i]
    end
    nothing
end

function add_vectorised!(c, a, b)
    c .= a .+ b
    nothing
end

These two implementations are essentially the same in Julia, except that the first implementation does not check the bounds of the inputs before the loop. The first function, containing the for loop, would not be considered to be vectorised, while the second one is the vectorised form. Vectorising your code, simply means writing your code without these for loops, and in terms of only array operations.

In Python or MATLAB, vectorising your code makes a huge difference to performance, despite the actual contents of the computation being essentially the same. Why is there a difference? Well, the difference is because Python and MATLAB rely on array calculations from a library written in a different, faster language. For example, Python extensively uses numpy, which is mostly written in C, and only has Python bindings. When you write C=A+B, when A and B are numpy arrays, then this addition happens “inside C”, and not Python, which is much slower as it is not compiled. Vectorising then, just means offloading most of the computation into a lower level library (such as numpy), having the program spend as much time as possible in the faster language and avoiding doing any bulk of the calculation in the interpreted language.

So what about Julia? Well Julia is a compiled language, and hence is a fast language and so does not need to make this distinction between vectorisation and loops and both operations compile down to practically the same machine code. For this reason, both implementations are practically the same in terms of performance. This means you are free to write loops if it is easier for your code.

Let’s benchmark both approaches:

julia
a = rand(1000); b = rand(1000); c = similar(a);

import BenchmarkTools: @benchmark, @btime, @belapsed
display(@benchmark add_loop!($c, $a, $b))
Output
BenchmarkTools.Trial: 10000 samples with 988 evaluations per sample.
 Range (min … max):  46.154 ns … 484.211 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     54.453 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.737 ns ±   7.416 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁          ▁  █▂  ▆                               
  ▂▁▄▄▁▃█▁▂▆▂▂█▂▂▆▆▃▂▆▅▃▄█▃▄██▄▄█▅▃▃▆▃▃▂▅▄▄▆▆▂▃▂▆▂▂▂▃▂▁▁▁▁▁▁▁▁ ▃
  46.2 ns         Histogram: frequency by time         64.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
display(@benchmark add_vectorised!($c, $a, $b))
Output
BenchmarkTools.Trial: 10000 samples with 987 evaluations per sample.
 Range (min … max):  50.760 ns … 337.386 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     59.878 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   58.932 ns ±   5.865 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄  ▅  ▆▁ ▄▁  ▆▂▁▄▃▁▃▅▁▁▂▅▂▂▁█▄▄▄▅▇▅▃▁▄▂▄ ▁ ▂▁                ▂
  █▄▆█▇▇███████████████████████████████████████▇███▅▇▆▆▆▅█▆▇▇▆ █
  50.8 ns       Histogram: log(frequency) by time      69.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

As we can see, the performance is essentially identical.

However, there are still reasons for preferring vectorisation in Julia. The first reason is readability and maintainability. The broadcast notation (the vectorised notation) is very powerful and can make the code look much cleaner, and hence easier to maintain over time due to the simplicity. On the other hand, even though running native code has practically negligible differences between vector and loop form, if one would like to extend one’s algorithm to run on the GPU, then the vectorised form can be easily (and automatically) translated into efficient GPU kernels. Usually, this can be done without having to change a single line of the source code, only changing the types of the arrays. This means you can have the same code executing on both the CPU and the GPU, which is incredibly powerful. This expressive power massively reduces developer time (since there is only one implementation of an algorithm), while also reducing the effort spent on testing.