JM

Table of Contents

Dependency Graphs

In order to be able to parallelise, we first need to formalise how to describe a process that benefits from parallelism. It is easy to build up an intuition of this. Firstly, let’s define what a worker is:

In the context of parallelisation, a worker is an entity which can process information and perform certain tasks.

Purposefully, we have kept this notion as abstract as possible. When thinking about algorithms in an abstract sense, it is often useful to imagine the process being performed by hand, and humans filling the role of the worker. While an algorithm, when implemented on a computer, will be executed by a CPU, it is possible that each core will act like a separate worker. We will often talk about a worker as if they are a human with agency, but this is only for illustration purposes.

We can imagine breaking up a large task into many smaller tasks. A convenient way to organise these subtasks is in a hierarchy, in which tasks at the top do not depend on any other tasks. Moving down the hierarchy, tasks depend on the competition of tasks higher in the hierarchy. This structure is often seen in a tree-like structure. Let’s take a trivial example of a simple stir-fry recipe to illustrate the point:

  1. Slice chicken/protein into thin strips.
  2. Julienne (slide into strips) an onion.
  3. Julienne a bell pepper.
  4. Mince Garlic.
  5. Peel and grate ginger.
  6. Cook the rice according to instruction.
  7. Pre-heat a wok to high heat.
  8. Add half oil for frying.
  9. Brown protein in the wok, stirring frequently.
  10. Remove protein on a plate when slightly browned.
  11. Add onion to the wok and fry until slightly browned.
  12. Add garlic, ginger and pepper.
  13. Add in the cooked protein (with seasoning and soy sauce) and cook for a couple of minutes.
  14. Serve the stir-fry on the bed of cooked rice.

This recipe has many items, but we can draw a diagram of the process. In order to trace dependencies, it is often easiest to go to the last item in the task and work backwards for what is needed. We can draw this process in a dependency graph, shown below.

A dependency graph showing the cooking process for a stir-fry
Figure 1: This is a dependency graph depiction of making a basic stir-fry. Arrows going into a task show a dependency on the task at the origin of the arrow. This type of graph roughly shows the necessary temporal aspect of a task.

Notice how all the tasks in the top row can theoretically be done at the same time. Obviously, we are ignoring the restriction that food loses heat once cooked, and so cooking tasks should only be completed “Just-in-Time”. If we had enough resources (enough workers, knives and equipment), we could have one worker cutting an onion while another cuts a bell pepper. Notice how in doing this, we have not only increased the number of people needed, but we have also increased the resources required (one knife to two knives). Additionally, what is not obvious from the recipe, is that given two woks and two people, we could brown the chicken and the onions at the same time in each wok. This is because they are not inherently dependent on each other. If we only had a single wok, then suddenly, browning the onions would depend on having already browned the chicken. If the rice starts cooking at the beginning of the process, it will be ready and finished by the time the rest of the meal has cooked, this process can happen in parallel too. The important aspect of this diagram, is that we can visually group tasks by which ones can be done in parallel. This gives us an idea of how to speed up the process, given more workers and resources.

Imagine a kitchen of professional chefs, who work together on a single meal. If managed correctly, this will usually speed up the process as some tasks can be done in parallel. However, if each task in the preparation depends on the previous task, then having all the additional resources and workers is not helpful, as only one task can be completed at a time.

A dependency graph shows how a process can be parallelised, and which resources are required at which points during processing. However, we have left out any concept of time. We are not trying to solve a scheduling problem - which inherently would involve knowing about the time taken to complete each task. A dependency graph tells you only if a task can be parallelised, not whether it should be, as there is not enough information to answer that question.

One may recognise a dependency graph as having similarities to a flowchart. A flowchart encapsulates the logic of an algorithm, independent of implementation details. One may use flowcharts with small tweaks to act as a dependency graph, to visualise the structure of a task.

While a flowchart is visually very easy to understand, when talking about dependency graphs we must remember that a dependency graph should never have a cycle - i.e these graphs should be acyclic. These type of graphs have a special name: Directed Acyclic Graph (DAG). While we may show flowcharts with loops (as in a sort of for/while loop), each cycle of this is inherently dependent on the past as is only grouped into a cycle for illustration purposes.

For clarity, let’s take a simple algorithm, written below. This program simply maps each input of an array to another value, which is stored in another array that is preallocated.

julia
function _inner_solve(x)
    rate, limit, x = promote(0.01, 2.0, x)
    for t in 1:100
        x += rate*clamp(x, -limit, limit)
    end
    return x
end

function map_solve!(y, x)
    @inbounds for i in eachindex(x, y)
        y[i] = _inner_solve(x[i])
    end
    nothing
end

An example function which calculates an expensive operation over an array of elements and stores the results in a preallocated array. The function promote is used to make sure the compiler converts the constants into the same, compatible, types during compilation, ensuring that no type conversion happens at runtime. It also allows us to name the variables.

If we label the number of elements in x and n n, then we know that we are calling the function _inner_solve n n times, once for each of the elements. We know that processing the inner loop requires loading the ith i^{\text{th}} element of the array x, and then processing this value in another function and finally storing the output in an array.

We can see a representative dependency graph below. Firstly, we can see that processing one element of the inner loop, first requires that we have a pointer to the array x. We can see that each evaluation of the inner loop is independent of all other inner loop evaluations. Another way of saying this is that these inner operations can safely be done in any order without changing the final result. Finally, once all the processes in the inner loop have completed, then the algorithm is complete.

A dependency graph showing the parallel structure of the example algorithm
Figure 2: This diagram shows a flow chart like DAG. This shows the processing dependencies between each part of the program. As the graph is directed and there exist no connections between each of the blue processing steps, we can conclude that these processes do not overlap with data and hence can be processed at the same time.

For clarity, the diagram of the inner loop has been included below. Note that this graph does not have any independent processes and so cannot be parallelised. We can roll this process into a single processing step as shown in the previous figure.

A dependency graph showing the inner loop structure
Figure 3: This shows the expanded dependency graph for the blue processes in the previous figure. While this graph clearly contains a cycle in the while loop block, it is possible to unroll this loop into a DAG like structure, since the loop is finite and eventually ends. This process has been depicted in this way for convenience, but one should think of this as being unrolled in terms of dependencies. It is clear that each subsequent result of $k$ requires the result of the previous iteration. Unlike the process in the previous figure, this process cannot be parallelised, as each block is dependent on the last.

What is important to see is that each evaluation of the inside of the loop is completely independent of all other processing. This means that it is safe for us to work on multiple inner loop evaluations at the same time - processed in parallel. This type of parallelism is extremely common and has a name - embarrassingly parallel. There are many problems that fit into this category, and thankfully, are the easiest to deal with.