CMU 15-418/618 (Spring 2013) Final Project Report
Parallel Minimum Spanning Tree
Michael Choquette
Main Project Page
Summary
I wrote and tested a number of sequential and parallel algorithms solving minimum spanning tree, and compared their performance on a collection of 11 random graphs. Preliminary results include not only parallel algorithms that get a 3x speedup on 6 cores, which is good for graphs with no inherent locality, but optimized sequential ones with a speedup in the realm of 30% over the common approaches.
Background
A spanning tree of a connected undirected graph G is a subset of n-1 edges of G that, together with the vertices of G, form a tree. The Minimum Spanning Tree problem is to, given a connected undirected graph with edge weights, find a spanning tree of G the sum of whose edge weights is the smallest possible. Minimum Spanning Tree (hereafter called MST) has a host of applications, from travelling-salesman approximations to circuit design to utility networks. With this in mind, it's useful to develop MST algorithms that scale well to the amount of concurrency available on modern-day machines.

Essentially all implementations of MST work because of a theorem known as the Light Edge Theorem: partition the vertices of any graph with a minimum spanning tree into two nonempty subsets, and the edge of minimum weight among those that go between the two subsets must be in the minimum spanning tree. This theorem implies a few easy sequential algorithms, but it also hints at parallel ones as well. In particular, it suggests that edges from different parts of the graph could be selected for inclusion in the MST in parallel.
Challenges
Here are things, that I either knew ahead of time or figured out as I went, that make MST a hard problem to solve well, both sequentially and in parallel:
• The first and most fundamental problem is that random graphs (the graphs I was working with) don't have any inherent structure. This manifests itself in two ways:
• Even a small set of vertices has a large set of neighbors, so the graphs have very little locality, and MST algorithms that run on them can easily become bandwidth-bound
• If you divided the graph into k pieces (say, one for each processor), (k-1)/k of the edges would go between two pieces. This makes it very hard to statically assign vertices to processors, as every edge in the MST that crosses between pieces (and most of them will) is likely a synchronization point.
• The light edge rule isn't a local property, and applying it repeatedly can result in a lot of duplicate work. Data structures that avoid this duplicate work add complexity, and are liable to hinder parallelism.
• There are many more edges in the graph than end up in the MST, so the runtime of an MST algorithm is often determiend by how much work it does on these "bad" edges, rather than the "good" ones it ultimately selects.
Discussion of Known Algorithms
There are three main MST algorithms: Kruskal's, Prim's, and Boruvka's. A detailed description of each is below:
Kruskal's
Kruskal's algorithm is the simplest of the three, and the easiest to implement efficiently.

Description:
• For each edge in the graph in increasing order of weight: if adding that edge doesn't create a cycle, then add it.
Correctness:
Think of the graph as initially having no edges, where each vertex is really a "group" of vertices of size 1: every edge kruskal's algorithm adds must join two different groups (or else it would create a cycle). Side note: the min-weight edge out of a group must be in the MST by the light edge rule (split the graph into that group and the rest; it's the lightest edge between them). What this means is that, since the algorithm examines edges in increasing order, the edge it adds to merge two groups must be the lightest edge out of either group. Because it's the lightest edge out of at least one of the groups, it must be in the MST.

Implementation Details:
The standard way to implement this is to sort the edges by weight at the beginning, then loop through them in sorted order, using a union-find data structure to track and merge the groups. Short of fine-tuning your union-find, not much can be done to improve this. All I did to parallelize it was to do the sort in parallel.

• A major piece of the runtime is sorting, which is well-optimized, has fairly good locality, and parallelizes well.
• After the sorting, each edge is only considered once.
• The code is pretty straightforward, and still runs comparably to the fastest solution I can come up with.
Drawbacks:
• The loop through the edges at the end is hard to do in parallel; in particular, we can't cluster the graph and loop through each cluster's edges because most edges will be between clusters (see the Challenges section), and we can't loop through different parts of the array in parallel because it's not easy to tell ahead of time what edges will be included in the MST. To be more general, the problem is that each edge is processed so quickly (querying the union-find is basically constant-time) that any parallelization would have to have little to no overhead, and the nonobvious dependencies make this too hard.
• While the sort is fast, it's still doing log(E) work per edge, which isn't great since most of those edges won't end up in the final answer. Using a radix sort can improve this, but it's still not ideal.
Prim's
Prim's algorithm is the other well-known MST algorithm.

Description:
• Start with a single node as a singleton group, with the set of its out-edges being your frontier.
• On each time step:
• Add the lightest-weight edge in the frontier to the MST.
• Add the endpoint of this edge to your group, remove any other edges in the frontier which point to it, and add its neighbors to the frontier.
Correctness:
This one is easier to see than Kruskal's: your seen set is one side of the cut, the rest of the graph is the other side, and your frontier is all the edges that go across. You're selecting the min-weight edge across at each step, which we know by the light edge rule must be in the MST, and you don't stop until you run out of edges or reach every vertex, so if there is an MST this algorithm will find it.

Implementation Details:
The standard way to implement this is to represent the frontier as a min-heap and remove edges lazily as they come up. However, I was able to get a speedup of around 2x by only keeping the lightest edge to each vertex in my heap, and removing invalid edges eagerly. I also got moderate improvements on top of that by using heaps optimized for cache-efficiency.

• Prim's is the simplest MST algorithm, provided that your language of choice has builtin priority queues.
• It's lazier than Kruskal's, so it does less unnecessary work.
Drawbacks:
• Heaps don't have much locality, and the edge frontier can grow to contain the majority of the edges in the graph. If it's too big to fit in cache, your access times can become unacceptably slow.
• It's still doing log(E) work per edge once the heap gets large, though that can be reduced to log(V) if you only keep one edge to each vertex as discussed above.
• This algorithm is inherently sequential. It might be possible to start from multiple vertices at once, but dealing with contention would be challenging.
Boruvka's
Boruvka's is less well-known than Prim's and Kruskal's, but it was the original MST algorithm and is inherently parallel.

Description:
• Start out with every vertex in your graph being a singleton group.
• While there are multiple groups in your graph:
• Find the minimum-weight edge out of every group, and add some or all of them to the MST.
• For every edge you added, merge its endpoints into a single group.
Correctness:
The correctness of this is the easiest to see: the only edges ever added are the lightest edges out of a group, so they're all guaranteed to be in the MST, no matter which edges you choose to add in a round (or even if you choose to add all of them). It's also not too hard to show that adding these edges doesn't create a cycle, but I won't here.

Implementation Details:
Boruvka's is less precisely stated than Kruskal's or Prims, so I had to make a couple of choices about how to structure my code that can't be summarized here. See the section on Boruvka's below.

• Boruvka's has the most theoretical parallelism of the three classic MST solvers.
• If implemented carefully, boruvka's can do less work per-edge than the other MST solvers.
• Boruvka's has a lot of synchronization (numerous rounds, each of which has multiple synchronization points).
• Boruvka's has the least locality of any of the solvers: due to the unstructure nature of my graphs (see the Challenges section), each processor has to be aware of most of the vertices all the time, which can strain your memory bandwidth and limit parallelism if the graph is too big.
Discussion of Boruvka's
I thought of boruvka's algorithm as consisting of a series of passes, each of which is made of of two phases: the search phase and the merge phase. During the search phase some number of candidate edges are identified, each of which is the lightest out-edge from a group of vertices. During the merge phase, some (or all) of those edges are added to the MST, and the groups that were newly connected by an edge are merged. Almost all the complexity in Boruvka's comes from trying to implement these efficiently.
How to Search
For now, don't worry about how to merge groups; focus on how to identify the lightest out-edge from each group efficiently. Two options come to mind: we could either find the lightest edge out of each vertex, bin those by group, and find the min for each group, or we could find the lightest out-edge for each group independently (maybe by keeping a list of the vertices in each group, for example). I'll call the first way vertex-based and the second group-based. The vertex-based search has the advantage of being simple to code and cache-efficient (since you're looping over the vertices in order), without any complex data structures, but doing binning in parallel is a challenging problem in its own right. The group-based approach on the other hand has much less contention and allows for more sophisticated data structures like a heap of edges for each group, but it often does so at the expense of locality, which is precious for I/O-bound applications.
How to Merge
The real question here isn't how to merge per se, but how to choose which edges to add in the merge step. One might think the best option is to always add every candidate edge; this is called tree contraction because the edges you add form trees. Tree contraction is guaranteed to reduce the number of groups by at least a factor of two each round, and does far better than that in practice, so this method keeps the number of rounds small, while also not wasting any work. On the other hand, merging the groups in a single tree into one big group isn't a trivial task, so tree contraction doesn't scale well to large numbers of cores. One common alternative is called star contraction: choose a random subset of your groups to be sources, and call the rest sinks. At each merge phase, only add an edge if it's from a source to a sink. This is much easier to do in parallel, since a group that's merging into another group can't themself be the target of a merge, but it also causes many more rounds to occur (it only reduces the number of groups by a quarter on average), leading to more synchronization overhead.
Choice of Data Structures
The operations required of the core data structures are as follows:
• Get the lightest out-edge from a vertex (or group).
• Merge two groups.
It's important to remember that most of the edges examined by your algorithm will be bad edges, and that your data structure should do as little work on those edges as possible. Also, note that merging needs to be efficient even if one or both groups is very large, and doable concurrently with other merges, ideally to the same groups. This is a very tough set of restrictions, but my solution was simple; sort the out-edges of each vertex by weight, and track the next edge from each vertex to examine. This is space-efficient, examines bad edges only once, and requires no modifications on a merge. The downside is that as groups sizes get large, you have to look at a lot of values to determine the min out-edge, but it turns out to be worth it for the scale I'm working at.
The Versions I Wrote
The two major design choices I outlined above suggest four different implementations; I wrote three of them:

Tree-based vertex-based: hereafter called btv, this had multiple threads write to a shared map of candidate edges, then read that candidate map and do concurrent updates on a shared lock-free union-find data structure. This data structure wrote out all its group information to a shared array after every round, so that threads could more quickly look up the group of vertices during the search phase.

Star-based vertex-based: hereafter called bsv, this also had multiple threads writing to a single shared candidate map, but replaced the fancy union-find with a simple array that only supported batch updates. Only supporting batch updates also allowed me to maintain a list of existing groups; it would get updated at the same time as the other array. Having a list of groups meant I didn't have to loop through as much of the candidate map, which was very important since star contraction does so many iterations and for most of the time the candidate map is sparse.

Star-based group-based: hereafter called bsg, this ditched the shared candidate map search phase in favor of giving each thread a chunk of groups and having them look through those groups' vertices directly. I stored the vertices for each group in a space-optimized circular linked-list, but didn't move the edges at all. This way I get very fast (concurrent) merges, and don't have to keep copying the edges around.
Why No btg?
I considered writing a tree-based group-based implementation, but I was low on time and ended up deciding against it for the following reasons:
• The main advantage of a group-based approach is that you can use a sophisticated data structure to store the out-edges of each group, and I didn't want to think of one that could be efficiently tree-contracted.
• Preliminary results had suggested that the group-based approach was going to do worse than vertex-based, and I didn't want to waste any time.
Other Relevant Code
I wrote a lot of library-style code for my solvers. Most of it is explained above, but I wanted to highlight some pieces I didn't get a chance to talk about:

Executor:
I used no external libraries other than pthreads for this project, which made dealing with parallelism messy in places. To combat this, I made a class called Executor which starts up a fixed-size pool of worker threads on construction, takes in encapsulated functions I called Tasks, and executes them on the worker threads. Once I realized I needed to do this it dramatically increased the clarity of my code.

Parallel Sorts:
Since I didn't use any libraries except pthreads, I had to write my parallel sort for Kruskal's myself. I wrote both a mergesort and samplesort, and the mergesort was decidedly better for low core counts so that's what I used.

Random Number Generation:
I used the standard library random number generators to determine the source set in star contraction, but generated ints instead of bools and treated each bit as a bool to speed up the generation. I also ran the random number generators concurrently with other code to take them off the critical path of the program.
Results and Discussion
The results of running my implementations of MST on a graph with 5 million vertices and 50 million edges, where the edges and their weights were selected uniformly at random, are as follows:

Here are my observations:
• Prim's was comparable in speed to Kruskal's even though it uses a heap (heaps generally perform worse than sorting), because it does extra work to limit the number of edges kept in the heap. Across my 11 test cases, prim's does better than kruskal's on the graphs with higher average degree, and worse than kruskal's on the graphs with low average degree.
• Parallel Kruskal's running on 1 core was only 10% slower than the serial version, because my parallel mergesort nearly degenerated into a serial sort in this case.
• Parallel Kruskal's achieved a speedup of 2X on 6 cores. My timing data suggests that about 1/4 of the runtime is sequential, so this makes sense (the sort parallelizes around 5.5X-5.75X)

Here are my observations:
• Tree contraction performed at least twice as well as star contraction on one core. This is because each round of contraction contains a loop which examines at least one edge for each vertex, so when the edge list doesn't fit in cache it involves a ton of slow memory accesses. Star contraction has 4-5 times more rounds than tree contraction, so even though its logic is simpler it has to read too much memory to be competitive.
• Group-based looping was worse than vertex-based looping for star contraction on one core. I think this was because traversing the group data structures added more memory accesses, and the vertex-based approach has no contention on one core anyway.
• Vertex-based tree contraction (btv) got a speedup of just over 3X on 6 cores. I looked through my time logs, and came up with the following reasons:
• Memory allocation: about 5-10% of the runtime was spent allocating and initializing memory, which was completely sequential.
• The search phase gets around a 3X speedup, I think because of contention.
• The merge phase parallelized poorly, especially when the number of groups was small, because this drove up contention. In particular, my lock-free merge avoided deadlock by always merging the larger-numbered group into the smaller-numbered one, so all the groups at the end are likely to be near the start of the array.
• Flattening the union-find after each round parallelized around 3X too; my original thought was that the problem was too small to parallelize more (which may still be true), but I also noticed that it suffers from workload imbalance because smaller-numbered groups will traverse fewer pointers on average to find their representative element (see previous).
• Vertex-based star contraction (bsv) got a speedup of about 3X on 6 cores as well. I was expecting it to get a higher speedup ratio than tree contraction, since the merge phase has much less contention; here are some reasons why it didn't get better speedup:
• There were many more rounds, increasing the amount of synchronization needed.
• A noticeable portion of the runtime was made up of phases that were too small to parallelize well.
• The merge, which should parallelize nearly perfectly, gets closer to 4X speedup. I think this is because of memory bandwidth issues.
• Updating the union-find parallelizes well, but also now requires a parallel scan, which caps its speedup at around 3X.
• Group-based star contraction (bsg) was unexpectedly bad: not only is it the slowest on one core, but its speedup rate was only 2X on 6 cores. On investigation, I've discovered that this was because of worst-case workload imbalance in the find phase: it seems that near the end of star contraction, one group will end up orders of magnitude larger than any other group in the graph. My code was only parallelizing the find step across groups, not within groups, so half the time in the 6 core version was spent with one thread working on a large group and the rest idle. Here are two musings on this:
• Vertex-based looping is the ultimate solution to this problem; it trades perfect intra-group parallelism for higher contention. Maybe it's possible for a hybrid of the two approaches to work better than either alone?
• Another way of phrasing this problem is that my code was too lazy; out of fear of doing too much extra work, it limited its parallelism. A more eager data structure would do more work earlier, but (hopefully) continue to parallelize well at the end.
• Both of the above ideas involve adapting my algorithms and data structures to an unfortunate property of star contraction (i.e. that one group gets much larger than any other). If I instead changed how I did the contraction itself, for example favoring smaller groups, I might get better results.