CMU 15-418/618 (Spring 2013) Final Project Report
Parallel Minimum Spanning Tree
I wrote and tested a number of sequential and parallel algorithms solving
minimum spanning tree, and compared their performance on a collection of 11
random graphs. Preliminary results include not only parallel algorithms that get
a 3x speedup on 6 cores, which is good for graphs with no inherent locality,
but optimized sequential ones with a speedup in the realm of 30% over the common
A spanning tree of a connected undirected graph G is a subset of n-1 edges of G
that, together with the vertices of G, form a tree. The Minimum Spanning Tree
problem is to, given a connected undirected graph with edge weights, find a
spanning tree of G the sum of whose edge weights is the smallest possible.
Minimum Spanning Tree (hereafter called MST) has a host of applications, from
travelling-salesman approximations to circuit design to utility networks. With
this in mind, it's useful to develop MST algorithms that scale well to the
amount of concurrency available on modern-day machines.
Essentially all implementations of MST work because of a theorem known as the
Light Edge Theorem
: partition the vertices of any graph with a minimum
spanning tree into two nonempty subsets, and the edge of minimum weight
among those that go between the two subsets
must be in the minimum
spanning tree. This theorem implies a few easy sequential algorithms, but it
also hints at parallel ones as well. In particular, it suggests that edges from
different parts of the graph could be selected for inclusion in the MST in
Here are things, that I either knew ahead of time or figured out as I went, that
make MST a hard problem to solve well, both sequentially and in parallel:
- The first and most fundamental problem is that random graphs (the graphs I
was working with) don't have any inherent structure. This manifests itself in
- Even a small set of vertices has a large set of neighbors, so the graphs
have very little locality, and MST algorithms that run on them can easily become
- If you divided the graph into k pieces (say, one for each processor),
(k-1)/k of the edges would go between two pieces. This makes it very hard to
statically assign vertices to processors, as every edge in the MST that crosses
between pieces (and most of them will) is likely a synchronization point.
- The light edge rule isn't a local property, and applying it repeatedly can
result in a lot of duplicate work. Data structures that avoid this duplicate
work add complexity, and are liable to hinder parallelism.
- There are many more edges in the graph than end up in the MST, so the
runtime of an MST algorithm is often determiend by how much work it does on
these "bad" edges, rather than the "good" ones it ultimately selects.
Discussion of Known Algorithms
There are three main MST algorithms: Kruskal's, Prim's, and Boruvka's. A
detailed description of each is below:
Kruskal's algorithm is the simplest of the three, and the easiest to implement
- For each edge in the graph in increasing order of weight: if adding that
edge doesn't create a cycle, then add it.
Think of the graph as initially having no edges, where each vertex is really a
"group" of vertices of size 1: every edge kruskal's algorithm adds must join
two different groups (or else it would create a cycle). Side note: the
min-weight edge out of a group must be in the MST by the light edge rule (split
the graph into that group and the rest; it's the lightest edge between them).
What this means is that, since the algorithm examines edges in increasing order,
the edge it adds to merge two groups must be the lightest edge out of either
group. Because it's the lightest edge out of at least one of the groups, it must
be in the MST.
The standard way to implement this is to sort the edges by weight at the
beginning, then loop through them in sorted order, using a union-find data
structure to track and merge the groups. Short of fine-tuning your union-find,
not much can be done to improve this. All I did to parallelize it was to do the
sort in parallel.
- A major piece of the runtime is sorting, which is well-optimized, has
fairly good locality, and parallelizes well.
- After the sorting, each edge is only considered once.
- The code is pretty straightforward, and still runs comparably to the
fastest solution I can come up with.
- The loop through the edges at the end is hard to do in parallel; in
particular, we can't cluster the graph and loop through each cluster's edges
because most edges will be between clusters (see the Challenges section),
and we can't loop through different parts of the array in parallel because it's
not easy to tell ahead of time what edges will be included in the MST. To be
more general, the problem is that each edge is processed so quickly (querying
the union-find is basically constant-time) that any parallelization would have
to have little to no overhead, and the nonobvious dependencies make this too
- While the sort is fast, it's still doing log(E) work per edge, which isn't
great since most of those edges won't end up in the final answer. Using a radix
sort can improve this, but it's still not ideal.
Prim's algorithm is the other well-known MST algorithm.
- Start with a single node as a singleton group, with the set of its
out-edges being your frontier.
- On each time step:
- Add the lightest-weight edge in the frontier to the MST.
- Add the endpoint of this edge to your group, remove any other edges in
the frontier which point to it, and add its neighbors to the frontier.
This one is easier to see than Kruskal's: your seen set is one side of the cut,
the rest of the graph is the other side, and your frontier is all the edges that
go across. You're selecting the min-weight edge across at each step, which we
know by the light edge rule must be in the MST, and you don't stop until you run
out of edges or reach every vertex, so if there is an MST this algorithm will
The standard way to implement this is to represent the frontier as a min-heap
and remove edges lazily as they come up. However, I was able to get a speedup of
around 2x by only keeping the lightest edge to each vertex in my heap, and
removing invalid edges eagerly. I also got moderate improvements on top of that
by using heaps optimized for cache-efficiency.
- Prim's is the simplest MST algorithm, provided that your language of
choice has builtin priority queues.
- It's lazier than Kruskal's, so it does less unnecessary work.
- Heaps don't have much locality, and the edge frontier can grow to contain
the majority of the edges in the graph. If it's too big to fit in cache, your
access times can become unacceptably slow.
- It's still doing log(E) work per edge once the heap gets large, though
that can be reduced to log(V) if you only keep one edge to each vertex as
- This algorithm is inherently sequential. It might be possible to start
from multiple vertices at once, but dealing with contention would be
Boruvka's is less well-known than Prim's and Kruskal's, but it was the original
MST algorithm and is inherently parallel.
- Start out with every vertex in your graph being a singleton group.
- While there are multiple groups in your graph:
- Find the minimum-weight edge out of every group, and add some or all of
them to the MST.
- For every edge you added, merge its endpoints into a single group.
The correctness of this is the easiest to see: the only edges ever added are the
lightest edges out of a group, so they're all guaranteed to be in the MST, no
matter which edges you choose to add in a round (or even if you choose to
add all of them). It's also not too hard to show that adding these edges doesn't
create a cycle, but I won't here.
Boruvka's is less precisely stated than Kruskal's or Prims, so I had to make a
couple of choices about how to structure my code that can't be summarized here.
See the section on Boruvka's below.
- Boruvka's has the most theoretical parallelism of the three classic MST
- If implemented carefully, boruvka's can do less work per-edge than the
other MST solvers.
- Boruvka's has a lot of synchronization (numerous rounds, each of which has
multiple synchronization points).
- Boruvka's has the least locality of any of the solvers: due to the
unstructure nature of my graphs (see the Challenges section), each processor has
to be aware of most of the vertices all the time, which can strain your memory
bandwidth and limit parallelism if the graph is too big.
Discussion of Boruvka's
I thought of boruvka's algorithm as consisting of a series of passes, each of
which is made of of two phases: the search
phase and the merge
phase. During the search phase some number of candidate edges are identified,
each of which is the lightest out-edge from a group of vertices. During the
merge phase, some (or all) of those edges are added to the MST, and the groups
that were newly connected by an edge are merged. Almost all the complexity in
Boruvka's comes from trying to implement these efficiently.
How to Search
For now, don't worry about how to merge groups; focus on how to identify the
lightest out-edge from each group efficiently. Two options come to mind: we
could either find the lightest edge out of each vertex, bin those by group, and
find the min for each group, or we could find the lightest out-edge for each
group independently (maybe by keeping a list of the vertices in each group, for
example). I'll call the first way vertex-based
and the second
. The vertex-based search has the advantage of being simple to
code and cache-efficient (since you're looping over the vertices in order),
without any complex data structures, but doing binning in parallel is a
challenging problem in its own right. The group-based approach on the other hand
has much less contention and allows for more sophisticated data structures like
a heap of edges for each group, but it often does so at the expense of locality,
which is precious for I/O-bound applications.
How to Merge
The real question here isn't how to merge per se, but how to choose which edges
to add in the merge step. One might think the best option is to always add every
candidate edge; this is called tree contraction
because the edges you add
form trees. Tree contraction is guaranteed to reduce the number of groups by at
least a factor of two each round, and does far better than that in practice, so
this method keeps the number of rounds small, while also not wasting any work.
On the other hand, merging the groups in a single tree into one big group isn't
a trivial task, so tree contraction doesn't scale well to large numbers of
cores. One common alternative is called star contraction
: choose a
random subset of your groups to be sources, and call the rest sinks. At each
merge phase, only add an edge if it's from a source to a sink. This is much easier to
do in parallel, since a group that's merging into another group can't themself
be the target of a merge, but it also causes many more rounds to occur (it only
reduces the number of groups by a quarter on average), leading to more
Choice of Data Structures
The operations required of the core data structures are as follows:
- Get the lightest out-edge from a vertex (or group).
- Merge two groups.
It's important to remember that most of the edges examined by your algorithm
will be bad edges, and that your data structure should do as little work on
those edges as possible. Also, note that merging needs to be efficient even if
one or both groups is very large, and doable concurrently with other merges,
ideally to the same groups. This is a very tough set of restrictions, but my
solution was simple; sort the out-edges of each vertex by weight, and track the
next edge from each vertex to examine. This is space-efficient, examines bad
edges only once, and requires no modifications on a merge. The downside is that
as groups sizes get large, you have to look at a lot of values to determine the
min out-edge, but it turns out to be worth it for the scale I'm working at.
The Versions I Wrote
The two major design choices I outlined above suggest four different
implementations; I wrote three of them:
hereafter called btv, this had multiple threads
write to a shared map of candidate edges, then read that candidate map and do
concurrent updates on a shared lock-free union-find data structure. This data
structure wrote out all its group information to a shared array after every
round, so that threads could more quickly look up the group of vertices during
the search phase.
hereafter called bsv, this also had multiple
threads writing to a single shared candidate map, but replaced the fancy
union-find with a simple array that only supported batch updates. Only
supporting batch updates also allowed me to maintain a list of existing groups;
it would get updated at the same time as the other array. Having a list of
groups meant I didn't have to loop through as much of the candidate map, which
was very important since star contraction does so many iterations and for most
of the time the candidate map is sparse.
hereafter called bsg, this ditched the shared
candidate map search phase in favor of giving each thread a chunk of groups and
having them look through those groups' vertices directly. I stored the vertices
for each group in a space-optimized circular linked-list, but didn't move the
edges at all. This way I get very fast (concurrent) merges, and don't have to
keep copying the edges around.
Why No btg?
I considered writing a tree-based group-based implementation, but I was low on
time and ended up deciding against it for the following reasons:
- The main advantage of a group-based approach is that you can use a
sophisticated data structure to store the out-edges of each group, and I didn't
want to think of one that could be efficiently tree-contracted.
- Preliminary results had suggested that the group-based approach was going
to do worse than vertex-based, and I didn't want to waste any time.
Other Relevant Code
I wrote a lot of library-style code for my solvers. Most of it is explained
above, but I wanted to highlight some pieces I didn't get a chance to talk
I used no external libraries other than pthreads for this project, which made
dealing with parallelism messy in places. To combat this, I made a class called
Executor which starts up a fixed-size pool of worker threads on construction,
takes in encapsulated functions I called Tasks, and executes them on the worker
threads. Once I realized I needed to do this it dramatically increased the
clarity of my code.
Since I didn't use any libraries except pthreads, I had to write my parallel
sort for Kruskal's myself. I wrote both a mergesort and samplesort, and the
mergesort was decidedly better for low core counts so that's what I used.
Random Number Generation:
I used the standard library random number generators to determine the source set
in star contraction, but generated ints instead of bools and treated each bit as
a bool to speed up the generation. I also ran the random number generators
concurrently with other code to take them off the critical path of the program.
Results and Discussion
The results of running my implementations of MST on a graph with 5 million
vertices and 50 million edges, where the edges and their weights were selected
uniformly at random, are as follows:
Here are my observations:
- Prim's was comparable in speed to Kruskal's even though it uses a heap
(heaps generally perform worse than sorting), because it does extra work to
limit the number of edges kept in the heap. Across my 11 test cases, prim's does
better than kruskal's on the graphs with higher average degree, and worse than
kruskal's on the graphs with low average degree.
- Parallel Kruskal's running on 1 core was only 10% slower than the serial
version, because my parallel mergesort nearly degenerated into a serial sort in
- Parallel Kruskal's achieved a speedup of 2X on 6 cores. My timing data
suggests that about 1/4 of the runtime is sequential, so this makes sense (the
sort parallelizes around 5.5X-5.75X)
Here are my observations:
- Tree contraction performed at least twice as well as star contraction on
one core. This is because each round of contraction contains a loop which
examines at least one edge for each vertex, so when the edge list doesn't fit in
cache it involves a ton of slow memory accesses. Star contraction has 4-5 times
more rounds than tree contraction, so even though its logic is simpler it has to
read too much memory to be competitive.
- Group-based looping was worse than vertex-based looping for star
contraction on one core. I think this was because traversing the group data
structures added more memory accesses, and the vertex-based approach has no
contention on one core anyway.
- Vertex-based tree contraction (btv) got a speedup of just over 3X on 6
cores. I looked through my time logs, and came up with the following reasons:
- Memory allocation: about 5-10% of the runtime was spent allocating and
initializing memory, which was completely sequential.
- The search phase gets around a 3X speedup, I think because of
- The merge phase parallelized poorly, especially when the number of
groups was small, because this drove up contention. In particular, my lock-free
merge avoided deadlock by always merging the larger-numbered group into the
smaller-numbered one, so all the groups at the end are likely to be near the
start of the array.
- Flattening the union-find after each round parallelized around 3X too;
my original thought was that the problem was too small to parallelize more
(which may still be true), but I also noticed that it suffers from workload
imbalance because smaller-numbered groups will traverse fewer pointers on
average to find their representative element (see previous).
- Vertex-based star contraction (bsv) got a speedup of about 3X on 6 cores
as well. I was expecting it to get a higher speedup ratio than tree contraction,
since the merge phase has much less contention; here are some reasons why it
didn't get better speedup:
- There were many more rounds, increasing the amount of synchronization
- A noticeable portion of the runtime was made up of phases that were too
small to parallelize well.
- The merge, which should parallelize nearly perfectly, gets closer to 4X
speedup. I think this is because of memory bandwidth issues.
- Updating the union-find parallelizes well, but also now requires a
parallel scan, which caps its speedup at around 3X.
- Group-based star contraction (bsg) was unexpectedly bad: not only is it
the slowest on one core, but its speedup rate was only 2X on 6 cores. On
investigation, I've discovered that this was because of worst-case workload
imbalance in the find phase: it seems that near the end of star contraction, one
group will end up orders of magnitude larger than any other group in the graph.
My code was only parallelizing the find step across groups, not within groups,
so half the time in the 6 core version was spent with one thread working on a
large group and the rest idle. Here are two musings on this:
- Vertex-based looping is the ultimate solution to this problem; it
trades perfect intra-group parallelism for higher contention. Maybe it's
possible for a hybrid of the two approaches to work better than either alone?
- Another way of phrasing this problem is that my code was too lazy;
out of fear of doing too much extra work, it limited its parallelism. A more
eager data structure would do more work earlier, but (hopefully) continue to
parallelize well at the end.
- Both of the above ideas involve adapting my algorithms and data
structures to an unfortunate property of star contraction (i.e. that one group
gets much larger than any other). If I instead changed how I did the contraction
itself, for example favoring smaller groups, I might get better results.