CMU 15-418/618 (Spring 2013) Final Project Report

Parallel Minimum Spanning Tree

Michael Choquette

Summary

I wrote and tested a number of sequential and parallel algorithms solving
minimum spanning tree, and compared their performance on a collection of 11
random graphs. Preliminary results include not only parallel algorithms that get
a 3x speedup on 6 cores, which is good for graphs with no inherent locality,
but optimized sequential ones with a speedup in the realm of 30% over the common
approaches.
Background

A spanning tree of a connected undirected graph G is a subset of n-1 edges of G
that, together with the vertices of G, form a tree. The Minimum Spanning Tree
problem is to, given a connected undirected graph with edge weights, find a
spanning tree of G the sum of whose edge weights is the smallest possible.
Minimum Spanning Tree (hereafter called MST) has a host of applications, from
travelling-salesman approximations to circuit design to utility networks. With
this in mind, it's useful to develop MST algorithms that scale well to the
amount of concurrency available on modern-day machines.Essentially all implementations of MST work because of a theorem known as the

Challenges

Here are things, that I either knew ahead of time or figured out as I went, that
make MST a hard problem to solve well, both sequentially and in parallel:
- The first and most fundamental problem is that random graphs (the graphs I
was working with) don't have any inherent structure. This manifests itself in
two ways:
- Even a small set of vertices has a large set of neighbors, so the graphs have very little locality, and MST algorithms that run on them can easily become bandwidth-bound
- If you divided the graph into k pieces (say, one for each processor), (k-1)/k of the edges would go between two pieces. This makes it very hard to statically assign vertices to processors, as every edge in the MST that crosses between pieces (and most of them will) is likely a synchronization point.

- The light edge rule isn't a local property, and applying it repeatedly can result in a lot of duplicate work. Data structures that avoid this duplicate work add complexity, and are liable to hinder parallelism.
- There are many more edges in the graph than end up in the MST, so the runtime of an MST algorithm is often determiend by how much work it does on these "bad" edges, rather than the "good" ones it ultimately selects.

Discussion of Known Algorithms

There are three main MST algorithms: Kruskal's, Prim's, and Boruvka's. A
detailed description of each is below:
Kruskal's

Kruskal's algorithm is the simplest of the three, and the easiest to implement
efficiently.- For each edge in the graph in increasing order of weight: if adding that edge doesn't create a cycle, then add it.

Think of the graph as initially having no edges, where each vertex is really a "group" of vertices of size 1: every edge kruskal's algorithm adds must join two different groups (or else it would create a cycle). Side note: the min-weight edge out of a group must be in the MST by the light edge rule (split the graph into that group and the rest; it's the lightest edge between them). What this means is that, since the algorithm examines edges in increasing order, the edge it adds to merge two groups must be the lightest edge out of either group. Because it's the lightest edge out of at least one of the groups, it must be in the MST.

The standard way to implement this is to sort the edges by weight at the beginning, then loop through them in sorted order, using a union-find data structure to track and merge the groups. Short of fine-tuning your union-find, not much can be done to improve this. All I did to parallelize it was to do the sort in parallel.

- A major piece of the runtime is sorting, which is well-optimized, has fairly good locality, and parallelizes well.
- After the sorting, each edge is only considered once.
- The code is pretty straightforward, and still runs comparably to the fastest solution I can come up with.

- The loop through the edges at the end is hard to do in parallel; in particular, we can't cluster the graph and loop through each cluster's edges because most edges will be between clusters (see the Challenges section), and we can't loop through different parts of the array in parallel because it's not easy to tell ahead of time what edges will be included in the MST. To be more general, the problem is that each edge is processed so quickly (querying the union-find is basically constant-time) that any parallelization would have to have little to no overhead, and the nonobvious dependencies make this too hard.
- While the sort is fast, it's still doing log(E) work per edge, which isn't great since most of those edges won't end up in the final answer. Using a radix sort can improve this, but it's still not ideal.

Prim's

Prim's algorithm is the other well-known MST algorithm.- Start with a single node as a singleton group, with the set of its out-edges being your frontier.
- On each time step:
- Add the lightest-weight edge in the frontier to the MST.
- Add the endpoint of this edge to your group, remove any other edges in the frontier which point to it, and add its neighbors to the frontier.

This one is easier to see than Kruskal's: your seen set is one side of the cut, the rest of the graph is the other side, and your frontier is all the edges that go across. You're selecting the min-weight edge across at each step, which we know by the light edge rule must be in the MST, and you don't stop until you run out of edges or reach every vertex, so if there is an MST this algorithm will find it.

The standard way to implement this is to represent the frontier as a min-heap and remove edges lazily as they come up. However, I was able to get a speedup of around 2x by only keeping the lightest edge to each vertex in my heap, and removing invalid edges eagerly. I also got moderate improvements on top of that by using heaps optimized for cache-efficiency.

- Prim's is the simplest MST algorithm, provided that your language of choice has builtin priority queues.
- It's lazier than Kruskal's, so it does less unnecessary work.

- Heaps don't have much locality, and the edge frontier can grow to contain the majority of the edges in the graph. If it's too big to fit in cache, your access times can become unacceptably slow.
- It's still doing log(E) work per edge once the heap gets large, though that can be reduced to log(V) if you only keep one edge to each vertex as discussed above.
- This algorithm is inherently sequential. It might be possible to start from multiple vertices at once, but dealing with contention would be challenging.

Boruvka's

Boruvka's is less well-known than Prim's and Kruskal's, but it was the original
MST algorithm and is inherently parallel.- Start out with every vertex in your graph being a singleton group.
- While there are multiple groups in your graph:
- Find the minimum-weight edge out of every group, and add some or all of them to the MST.
- For every edge you added, merge its endpoints into a single group.

The correctness of this is the easiest to see: the only edges ever added are the lightest edges out of a group, so they're all guaranteed to be in the MST, no matter which edges you choose to add in a round (or even if you choose to add all of them). It's also not too hard to show that adding these edges doesn't create a cycle, but I won't here.

Boruvka's is less precisely stated than Kruskal's or Prims, so I had to make a couple of choices about how to structure my code that can't be summarized here. See the section on Boruvka's below.

- Boruvka's has the most theoretical parallelism of the three classic MST solvers.
- If implemented carefully, boruvka's can do less work per-edge than the other MST solvers.

- Boruvka's has a lot of synchronization (numerous rounds, each of which has multiple synchronization points).
- Boruvka's has the least locality of any of the solvers: due to the unstructure nature of my graphs (see the Challenges section), each processor has to be aware of most of the vertices all the time, which can strain your memory bandwidth and limit parallelism if the graph is too big.

Discussion of Boruvka's

I thought of boruvka's algorithm as consisting of a series of passes, each of
which is made of of two phases: the How to Search

For now, don't worry about how to merge groups; focus on how to identify the
lightest out-edge from each group efficiently. Two options come to mind: we
could either find the lightest edge out of each vertex, bin those by group, and
find the min for each group, or we could find the lightest out-edge for each
group independently (maybe by keeping a list of the vertices in each group, for
example). I'll call the first way How to Merge

The real question here isn't how to merge per se, but how to choose which edges
to add in the merge step. One might think the best option is to always add every
candidate edge; this is called Choice of Data Structures

The operations required of the core data structures are as follows:
- Get the lightest out-edge from a vertex (or group).
- Merge two groups.

The Versions I Wrote

The two major design choices I outlined above suggest four different
implementations; I wrote three of them:Why No btg?

I considered writing a tree-based group-based implementation, but I was low on
time and ended up deciding against it for the following reasons:
- The main advantage of a group-based approach is that you can use a sophisticated data structure to store the out-edges of each group, and I didn't want to think of one that could be efficiently tree-contracted.
- Preliminary results had suggested that the group-based approach was going to do worse than vertex-based, and I didn't want to waste any time.

Other Relevant Code

I wrote a lot of library-style code for my solvers. Most of it is explained
above, but I wanted to highlight some pieces I didn't get a chance to talk
about:
I used no external libraries other than pthreads for this project, which made dealing with parallelism messy in places. To combat this, I made a class called Executor which starts up a fixed-size pool of worker threads on construction, takes in encapsulated functions I called Tasks, and executes them on the worker threads. Once I realized I needed to do this it dramatically increased the clarity of my code.

Since I didn't use any libraries except pthreads, I had to write my parallel sort for Kruskal's myself. I wrote both a mergesort and samplesort, and the mergesort was decidedly better for low core counts so that's what I used.

I used the standard library random number generators to determine the source set in star contraction, but generated ints instead of bools and treated each bit as a bool to speed up the generation. I also ran the random number generators concurrently with other code to take them off the critical path of the program.

Results and Discussion

The results of running my implementations of MST on a graph with 5 million
vertices and 50 million edges, where the edges and their weights were selected
uniformly at random, are as follows:Here are my observations:

- Prim's was comparable in speed to Kruskal's even though it uses a heap (heaps generally perform worse than sorting), because it does extra work to limit the number of edges kept in the heap. Across my 11 test cases, prim's does better than kruskal's on the graphs with higher average degree, and worse than kruskal's on the graphs with low average degree.
- Parallel Kruskal's running on 1 core was only 10% slower than the serial version, because my parallel mergesort nearly degenerated into a serial sort in this case.
- Parallel Kruskal's achieved a speedup of 2X on 6 cores. My timing data suggests that about 1/4 of the runtime is sequential, so this makes sense (the sort parallelizes around 5.5X-5.75X)

Here are my observations:

- Tree contraction performed at least twice as well as star contraction on one core. This is because each round of contraction contains a loop which examines at least one edge for each vertex, so when the edge list doesn't fit in cache it involves a ton of slow memory accesses. Star contraction has 4-5 times more rounds than tree contraction, so even though its logic is simpler it has to read too much memory to be competitive.
- Group-based looping was worse than vertex-based looping for star contraction on one core. I think this was because traversing the group data structures added more memory accesses, and the vertex-based approach has no contention on one core anyway.
- Vertex-based tree contraction (btv) got a speedup of just over 3X on 6
cores. I looked through my time logs, and came up with the following reasons:
- Memory allocation: about 5-10% of the runtime was spent allocating and initializing memory, which was completely sequential.
- The search phase gets around a 3X speedup, I think because of contention.
- The merge phase parallelized poorly, especially when the number of groups was small, because this drove up contention. In particular, my lock-free merge avoided deadlock by always merging the larger-numbered group into the smaller-numbered one, so all the groups at the end are likely to be near the start of the array.
- Flattening the union-find after each round parallelized around 3X too; my original thought was that the problem was too small to parallelize more (which may still be true), but I also noticed that it suffers from workload imbalance because smaller-numbered groups will traverse fewer pointers on average to find their representative element (see previous).

- Vertex-based star contraction (bsv) got a speedup of about 3X on 6 cores
as well. I was expecting it to get a higher speedup ratio than tree contraction,
since the merge phase has much less contention; here are some reasons why it
didn't get better speedup:
- There were many more rounds, increasing the amount of synchronization needed.
- A noticeable portion of the runtime was made up of phases that were too small to parallelize well.
- The merge, which should parallelize nearly perfectly, gets closer to 4X speedup. I think this is because of memory bandwidth issues.
- Updating the union-find parallelizes well, but also now requires a parallel scan, which caps its speedup at around 3X.

- Group-based star contraction (bsg) was unexpectedly bad: not only is it
the slowest on one core, but its speedup rate was only 2X on 6 cores. On
investigation, I've discovered that this was because of worst-case workload
imbalance in the find phase: it seems that near the end of star contraction, one
group will end up orders of magnitude larger than any other group in the graph.
My code was only parallelizing the find step across groups, not within groups,
so half the time in the 6 core version was spent with one thread working on a
large group and the rest idle. Here are two musings on this:
- Vertex-based looping is the ultimate solution to this problem; it trades perfect intra-group parallelism for higher contention. Maybe it's possible for a hybrid of the two approaches to work better than either alone?
- Another way of phrasing this problem is that my code was too lazy; out of fear of doing too much extra work, it limited its parallelism. A more eager data structure would do more work earlier, but (hopefully) continue to parallelize well at the end.
- Both of the above ideas involve adapting my algorithms and data structures to an unfortunate property of star contraction (i.e. that one group gets much larger than any other). If I instead changed how I did the contraction itself, for example favoring smaller groups, I might get better results.