CMU 15-418/618 (Spring 2013) Final Project Report
Parallel Minimum Spanning Tree
Michael Choquette
Main Project Page
Summary
I wrote and tested a number of sequential and parallel algorithms solving minimum spanning tree, and compared their performance on a collection of 11 random graphs. Preliminary results include not only parallel algorithms that get a 3x speedup on 6 cores, which is good for graphs with no inherent locality, but optimized sequential ones with a speedup in the realm of 30% over the common approaches.
Background
A spanning tree of a connected undirected graph G is a subset of n-1 edges of G that, together with the vertices of G, form a tree. The Minimum Spanning Tree problem is to, given a connected undirected graph with edge weights, find a spanning tree of G the sum of whose edge weights is the smallest possible. Minimum Spanning Tree (hereafter called MST) has a host of applications, from travelling-salesman approximations to circuit design to utility networks. With this in mind, it's useful to develop MST algorithms that scale well to the amount of concurrency available on modern-day machines.

Essentially all implementations of MST work because of a theorem known as the Light Edge Theorem: partition the vertices of any graph with a minimum spanning tree into two nonempty subsets, and the edge of minimum weight among those that go between the two subsets must be in the minimum spanning tree. This theorem implies a few easy sequential algorithms, but it also hints at parallel ones as well. In particular, it suggests that edges from different parts of the graph could be selected for inclusion in the MST in parallel.
Challenges
Here are things, that I either knew ahead of time or figured out as I went, that make MST a hard problem to solve well, both sequentially and in parallel:
Discussion of Known Algorithms
There are three main MST algorithms: Kruskal's, Prim's, and Boruvka's. A detailed description of each is below:
Kruskal's
Kruskal's algorithm is the simplest of the three, and the easiest to implement efficiently.

Description: Correctness:
Think of the graph as initially having no edges, where each vertex is really a "group" of vertices of size 1: every edge kruskal's algorithm adds must join two different groups (or else it would create a cycle). Side note: the min-weight edge out of a group must be in the MST by the light edge rule (split the graph into that group and the rest; it's the lightest edge between them). What this means is that, since the algorithm examines edges in increasing order, the edge it adds to merge two groups must be the lightest edge out of either group. Because it's the lightest edge out of at least one of the groups, it must be in the MST.

Implementation Details:
The standard way to implement this is to sort the edges by weight at the beginning, then loop through them in sorted order, using a union-find data structure to track and merge the groups. Short of fine-tuning your union-find, not much can be done to improve this. All I did to parallelize it was to do the sort in parallel.

Advantages: Drawbacks:
Prim's
Prim's algorithm is the other well-known MST algorithm.

Description: Correctness:
This one is easier to see than Kruskal's: your seen set is one side of the cut, the rest of the graph is the other side, and your frontier is all the edges that go across. You're selecting the min-weight edge across at each step, which we know by the light edge rule must be in the MST, and you don't stop until you run out of edges or reach every vertex, so if there is an MST this algorithm will find it.

Implementation Details:
The standard way to implement this is to represent the frontier as a min-heap and remove edges lazily as they come up. However, I was able to get a speedup of around 2x by only keeping the lightest edge to each vertex in my heap, and removing invalid edges eagerly. I also got moderate improvements on top of that by using heaps optimized for cache-efficiency.

Advantages: Drawbacks:
Boruvka's
Boruvka's is less well-known than Prim's and Kruskal's, but it was the original MST algorithm and is inherently parallel.

Description: Correctness:
The correctness of this is the easiest to see: the only edges ever added are the lightest edges out of a group, so they're all guaranteed to be in the MST, no matter which edges you choose to add in a round (or even if you choose to add all of them). It's also not too hard to show that adding these edges doesn't create a cycle, but I won't here.

Implementation Details:
Boruvka's is less precisely stated than Kruskal's or Prims, so I had to make a couple of choices about how to structure my code that can't be summarized here. See the section on Boruvka's below.

Advantages:
Disadvantages:
Discussion of Boruvka's
I thought of boruvka's algorithm as consisting of a series of passes, each of which is made of of two phases: the search phase and the merge phase. During the search phase some number of candidate edges are identified, each of which is the lightest out-edge from a group of vertices. During the merge phase, some (or all) of those edges are added to the MST, and the groups that were newly connected by an edge are merged. Almost all the complexity in Boruvka's comes from trying to implement these efficiently.
How to Search
For now, don't worry about how to merge groups; focus on how to identify the lightest out-edge from each group efficiently. Two options come to mind: we could either find the lightest edge out of each vertex, bin those by group, and find the min for each group, or we could find the lightest out-edge for each group independently (maybe by keeping a list of the vertices in each group, for example). I'll call the first way vertex-based and the second group-based. The vertex-based search has the advantage of being simple to code and cache-efficient (since you're looping over the vertices in order), without any complex data structures, but doing binning in parallel is a challenging problem in its own right. The group-based approach on the other hand has much less contention and allows for more sophisticated data structures like a heap of edges for each group, but it often does so at the expense of locality, which is precious for I/O-bound applications.
How to Merge
The real question here isn't how to merge per se, but how to choose which edges to add in the merge step. One might think the best option is to always add every candidate edge; this is called tree contraction because the edges you add form trees. Tree contraction is guaranteed to reduce the number of groups by at least a factor of two each round, and does far better than that in practice, so this method keeps the number of rounds small, while also not wasting any work. On the other hand, merging the groups in a single tree into one big group isn't a trivial task, so tree contraction doesn't scale well to large numbers of cores. One common alternative is called star contraction: choose a random subset of your groups to be sources, and call the rest sinks. At each merge phase, only add an edge if it's from a source to a sink. This is much easier to do in parallel, since a group that's merging into another group can't themself be the target of a merge, but it also causes many more rounds to occur (it only reduces the number of groups by a quarter on average), leading to more synchronization overhead.
Choice of Data Structures
The operations required of the core data structures are as follows: It's important to remember that most of the edges examined by your algorithm will be bad edges, and that your data structure should do as little work on those edges as possible. Also, note that merging needs to be efficient even if one or both groups is very large, and doable concurrently with other merges, ideally to the same groups. This is a very tough set of restrictions, but my solution was simple; sort the out-edges of each vertex by weight, and track the next edge from each vertex to examine. This is space-efficient, examines bad edges only once, and requires no modifications on a merge. The downside is that as groups sizes get large, you have to look at a lot of values to determine the min out-edge, but it turns out to be worth it for the scale I'm working at.
The Versions I Wrote
The two major design choices I outlined above suggest four different implementations; I wrote three of them:

Tree-based vertex-based: hereafter called btv, this had multiple threads write to a shared map of candidate edges, then read that candidate map and do concurrent updates on a shared lock-free union-find data structure. This data structure wrote out all its group information to a shared array after every round, so that threads could more quickly look up the group of vertices during the search phase.

Star-based vertex-based: hereafter called bsv, this also had multiple threads writing to a single shared candidate map, but replaced the fancy union-find with a simple array that only supported batch updates. Only supporting batch updates also allowed me to maintain a list of existing groups; it would get updated at the same time as the other array. Having a list of groups meant I didn't have to loop through as much of the candidate map, which was very important since star contraction does so many iterations and for most of the time the candidate map is sparse.

Star-based group-based: hereafter called bsg, this ditched the shared candidate map search phase in favor of giving each thread a chunk of groups and having them look through those groups' vertices directly. I stored the vertices for each group in a space-optimized circular linked-list, but didn't move the edges at all. This way I get very fast (concurrent) merges, and don't have to keep copying the edges around.
Why No btg?
I considered writing a tree-based group-based implementation, but I was low on time and ended up deciding against it for the following reasons:
Other Relevant Code
I wrote a lot of library-style code for my solvers. Most of it is explained above, but I wanted to highlight some pieces I didn't get a chance to talk about:

Executor:
I used no external libraries other than pthreads for this project, which made dealing with parallelism messy in places. To combat this, I made a class called Executor which starts up a fixed-size pool of worker threads on construction, takes in encapsulated functions I called Tasks, and executes them on the worker threads. Once I realized I needed to do this it dramatically increased the clarity of my code.

Parallel Sorts:
Since I didn't use any libraries except pthreads, I had to write my parallel sort for Kruskal's myself. I wrote both a mergesort and samplesort, and the mergesort was decidedly better for low core counts so that's what I used.

Random Number Generation:
I used the standard library random number generators to determine the source set in star contraction, but generated ints instead of bools and treated each bit as a bool to speed up the generation. I also ran the random number generators concurrently with other code to take them off the critical path of the program.
Results and Discussion
The results of running my implementations of MST on a graph with 5 million vertices and 50 million edges, where the edges and their weights were selected uniformly at random, are as follows:


graph 1
Here are my observations: graph 1

Here are my observations: