JM

Table of Contents

Maps and Reductions

The example given in the previous sections is the essence of what we call a map. This is a term widely used in functional programming. Essentially a map, simply applies the same operation (or function) to all inputs. It is an extension of SIMD (Single Instruction Multiple Data). This form of action is “embarrassingly parallel”. In Julia, whenever we use the broadcast operator, we are applying a map:

julia
x = [1,2,3,4];
is_odd(x) = x % 2;
is_odd.(x)

Here, we have mapped the is_odd function to each element of the input array and then returned the output of that array.

Mapping is a very useful concept, especially since it can be so elegantly expressed in Julia, however, what happens when we need to combine elements together. For example, if we want to sum all the elements in an array? Traditionally, we have a variable to keep track over the total sum, initialised at zero, and then iterate over all the elements, adding each element to the sum. This sort of operation is called a reduction, aptly named for the effect of producing an output with fewer terms than the input - i.e. reducing a collection to a single element.

Let us take the example of the sum function. The dependency graph of the serial algorithm is shown below. However, we know that addition is associative (i.e a+(b+c)=(a+b)+c a + (b + c)=(a+b) + c), which means that we are free to rearrange the order of additions, and transform the dependencies in a way that aids parallelism. However, inside a computer, numbers are represented with a finite amount of bits. Provided there is no overflow with integers, the addition should still be associative. However, floating point numbers are represented in standard form notation, with a sign, mantissa and exponent. Necessarily, these representations are limited and make approximations on most numbers. These approximations lead to the loss of associativity in operations of floating point numbers. For this reason, different implementations of the same sum of floating point numbers may be slightly different.

A dependency graph showing serial reduction
Figure 1: Shows a dependency graph (and execution graph) of a reduction (in the form of a sum), on the numbers from 1 to 8. Some steps have been excluded for clarity.

A simple way of making the algorithm parallel, is to divide up the array into k k chucks, one for each worker, and have each worker serially reduce their chunk of the array and then have a final step which combines the results of all k k chunks serially. Provided that the number of elements in the array is much larger than the number of workers, this will be a very easy task to parallelise.

Another way is to fundamentally change the algorithm. We can assign two pairs of numbers to each worker to add together. This will produce an intermediate result with around half the number of elements in the initial array. From here, the workers repeat the process, combining pairs of numbers, until there is only a single number left. This algorithm actually has practical benefit when applied to a sum reduction of elements of similar magnitude. This relates to the floating point problem. Additions of floating points are most accurate when the exponents of the floats are roughly equivalent (similarly sized numbers). If the exponents are different, then the mantissa of one number needs to be manipulated to match the magnitude before addition, propagating any errors resulting from finite bit size forwards. This effect is magnified as the total sum variable gets larger and larger compared to the elements in a sum, as the algorithm works through the elements.

This adapted algorithm, given enough resources, can vastly improve the parallelism of a reduction. In the serial case, the algorithm scales as O(n) \mathcal{O}(n), where n n is the number of elements in the array. However, in the parallel case, with enough resources, the algorithm scales as O(log2(n)) \mathcal{O}(\log_2(n)). This can be derived easily, since each halving of the problem set can be computed in constant time (given enough resources) and it takes proportionally log2(n) \log_2(n) iterations to halve the initial n n elements down to 1 1. This is the solution to the equation (12)tn=1 \left (\frac{1}{2} \right )^t n = 1.

A dependency graph showing parallel reduction
Figure 2: Shows a dependency graph of a binary, parallel, reduction algorithm, which sums the numbers from 1 to 8. This is a fundamentally different algorithm from the serial version, however, it is equivalent under the assumption that addition is associative, which is true for integers, provided no overflow.

This is just a simple example of a reduction for a sum, but there are many interesting and varied implementations, usually relying on associative (and also commutative) properties of the underlying operations.

Julia provides default implementations for map and reduce functions. map takes in a function and collection, and applies the function to each element in the collection. For example, if we want to square all elements of a list and add 1 we would write:

julia
x = LinRange(0, 1, 5);
x2 = map(y->y*y + 1, x)

Note that y->y*y + 1 is an example of an anonymous function, these are frequently used in Julia when following a more functional style such as the “map/reduce” format.

If we now want to find the product of this array, we can use the reduce method with the multiplication operator:

julia
reduce(*, x2)

This reduction function takes in a two argument function and a collection. It iterates over the collection, combining results until there is a single scalar result left. It should be said that this will only return the correct answer if, and only if, the function is commutative and associative.

As it is extremely common to first map something and then apply a reduction, a fused method is provided:

julia
mapreduce(y->y*y+1, *, x)

The first argument of this function is the mapping function, the second is the reduction function and the third is the collection of elements this will be applied to. We can benchmark the difference between these when put into a function.

julia
map_then_reduce(x) = reduce(+, map(y->y*y+1, x));
map_and_reduce(x) = mapreduce(y->y*y+1, +, x);
x = rand(10_000);
@btime map_then_reduce($x)
@btime map_and_reduce($x)
Output
2.700 μs (3 allocations: 78.19 KiB)
  695.238 ns (0 allocations: 0 bytes)

Note that the reduction operation has been changed to a + for avoiding getting an infinity overflow. One can see that the second fused algorithm is much more efficient as one did not need to allocate an intermediate array for the map. It is almost always better to use the fused version if you do not need the results of the intermediate map.