Zachary Pomper, Maxwell Johnson
Over the course of our project, we implemented and parallelized two
heuristic algorithms to solve the bin packing problem. We chose one
commonly used deterministic algorithm, Best-Fit Decreasing (BFD),
and one randomized algorithm, WalkPack. We first implemented both
algorithms sequentially in C++, then parallelized both using
Nvidia's CUDA. Neither algorithm is designed to be fast, and the
different techniques used make a comparison between the speed of
each algorithm misleading. Instead, our project focuses on comparing
the quality of the solutions generated by the two algorithms and how
amenable each is to parallelization. WalkPack was parallelized for
the GPU using a Java library in a research paper from RIT . Over
the course of our project, we attempt to improve upon the speedup
they achieved by using CUDA directly.
Bin packing is a known NP-hard problem with applications in task
scheduling, logistics, and virtual machine allocation, and a host of
other areas. Briefly put, the bin packing problem is this: given a
set of objects of various sizes and an unlimited supply of
uniformly-sized bins, what is the assignment of objects to bins that
minimizes the number of bins required to hold all objects? Several
approximation algorithms, including Best-Fit Decreasing, have a
proven upper bound of 1 + 11/9 · optimal bins. The same guarantee
does not hold for randomized algorithms such as WalkPack, although
they consistently have averages much closer to optimal than 11/9 ·
Best-Fit Decreasing is simple to describe and implement, making it a
popular choice for solving the bin packing problem. Its proven upper
bound is also an advantage, especially in fields where reliability
is critical, such as in real-time scheduling. BFD sorts the objects
in decreasing size, then places each object in the first bin into
which it fits. The pseudocode can be written as follows:
Sort objects by decreasing size
bins = 
For object O in objects
For bin B in bins
If O fits in B
Put O in B
If O did not fit in any bin
Add a new bin A to the end of bins
Put O in A
BFD performs poorly on certain adversarial test vectors. These are
coined the "2/3 Case" in , although what the authors describe
actually results in a worst-case utilization closer to 3/4. In these
adversarial cases, BFD is supplied with objects whose sizes fall
close to 1/4 the size of the bins, with some variation. BFD will first
place all the objects larger than 1/4, which will result in many bins
filled with three objects too large to fit a fourth. Figure 1 shows
an example of such a suboptimal packing, using four bins of size 10.
Figure 1: BFD's solution to the "2/3 Case". The number in each object
indicate its size. By comparison, an optimal packing for these
objects uses only three bins, as shown in figure 2.
Figure 2: Optimal solution to the "2/3 Case".
The randomized algorithm we worked with is WalkPack. It is described
in  with the goal of solving BFD's weakness to adversarial
inputs. The algorithm employs randomization with repeated trials to
improve on the quality of BFD in these cases. A notable first step
in WalkPack is to order the bins using a seed algorithm. In  as
well as in our own work, Next-fit is used. Next-fit produces
solutions of reasonable quality in linear time. WalkPack reduces the
bin count slowly, so starting with a decent packing speeds the
process considerably. The basic sequential pseudocode for WalkPack
is as follows. See the accompanying visuals below for a graphical
example of one iteration.
Assign objects to bins using seed algorithm (fig. 3)
Repeat for number of iterations
Pick a random bin S (fig. 4)
For each object O in S (fig. 5)
Pick a random bin D != S
Move O to D
Destroy the now-empty S
For each bin A with occupancy > bin capacity (fig. 6)
Create a new bin B
While A.occupancy > bin capacity
Pick a random object P in A
Move P to B
Figure 3: The objects after running Next-fit to seed the bins. The
numbers in each object correspond to their index in the input.
Figure 4: The selected bin S.
Figure 5: Bin S is emptied and deleted, its objects distributed randomly.
Figure 6: Bin A, previously overfull, is constrained by moving a
random object into a new bin B.
To achieve reasonable quality, heuristics must be added to the
selection of bins S and D. When selecting S, the bin to be
destroyed, we must favor bins with more empty space to avoid
destroying well-filled bins (see figure 7). Likewise, D should be
chosen in a way that favors bins with full space so as to create
better-filled bins (see figure 8). Care must be taken when choosing
D not to pick a bin that cannot fit O, unless O does not fit in any
bins. This prevents needlessly creating overfull bins, which create
additional bins when constrained. These heuristics improve quality
significantly and make WalkPack a much more interesting problem to
Figure 7: Heuristic: The choice for a bin S to delete is weighted
by empty space.
Figure 8: Heuristic: The choice for each bin D to move an object to
is weighted by full space.
A final improvement to WalkPack that the authors of the RIT paper
included is to run the algorithm repeatedly, taking the solution
that uses the fewest bins once all trials have completed. This
improves the quality and reliability of WalkPack by making it less
likely to be caught in a low-quality local minimum.
We started by implementing the framework described in our reference
paper in single-threaded C++. Once we had this working with acceptable
quality, we moved on to a CUDA implementation of this highly dynamic
code. This is where we spent much of our time, dealing with the
difficulties of writing this kind of code in CUDA (for the GTX 1080)
as we went. For WalkPack, after writing the sequential code, we knew
that there wouldn't be much room for parallelization directly, due to
the intense data dependencies involved, so we decided to do multiple
searches in parallel over different thread blocks and do a reduction
to get the best result back. We were only able to find minor
optimizations in execution time from increasing thread counts within
BFD benefits from parallelizing the search for a destination bin for
each object. To perform this search in parallel, we have two
phases. In the first, we do a simple tabulation: Each index is
mapped to itself if the bin at that index can fit the target object,
otherwise it is mapped to INT_MAX. We perform a parallel reduction
on the resulting array of indices to find the lowest index that fits
the target bin. We achieve this parallelism by mapping the problem
onto a single block with many threads (see "Results" for further
 Bauer, Andrew; Kaminsky, Alan. GPU Parallel Program for the Bin Packing Problem, May 10, 2017. https://www.cs.rit.edu/~ark/students/amb4757/report.pdf
At the poster session, we will demonstrate speedup graphs and
quality statistics for our implementations. These can be compared
with the similar graphs from  to establish the quality of our
work. We have created a visualizer that will give some intuition
about the progress of the algorithm and the ultimate form of the
|| Write project proposal. Create framework and test input generator.
|| Write sequential implementation of Best-Fit Decreasing. Start sequential implementation of WalkPack. Create visualizer and output analyzer.
|| Refine sequential implementation of WalkPack. Start parallel implementation of WalkPack. Submit checkpoint.
|| Improve sequential implementation of WalkPack to match paper's quality.
|| Improve parallel implementation.
|| Finish parallel implementation of WalkPack.
|| Start parallel implementation of Best-Fit Decreasing.
|| Finish parallel implementation of Best-Fit Decreasing.
|| Optimize parallel implementation of Best-Fit Decreasing.
|| Write final report, finishing touches, polish visualizer, submit final report.