
An Exploration of the Performance of Parallel Bin Packing AlgorithmsZachary Pomper, Maxwell Johnson Over the course of our project, we implemented and parallelized two heuristic algorithms to solve the bin packing problem. We chose one commonly used deterministic algorithm, BestFit Decreasing (BFD), and one randomized algorithm, WalkPack. We first implemented both algorithms sequentially in C++, then parallelized both using Nvidia's CUDA. Neither algorithm is designed to be fast, and the different techniques used make a comparison between the speed of each algorithm misleading. Instead, our project focuses on comparing the quality of the solutions generated by the two algorithms and how amenable each is to parallelization. WalkPack was parallelized for the GPU using a Java library in a research paper from RIT [1]. Over the course of our project, we attempt to improve upon the speedup they achieved by using CUDA directly. BackgroundBin packing is a known NPhard problem with applications in task scheduling, logistics, and virtual machine allocation, and a host of other areas. Briefly put, the bin packing problem is this: given a set of objects of various sizes and an unlimited supply of uniformlysized bins, what is the assignment of objects to bins that minimizes the number of bins required to hold all objects? Several approximation algorithms, including BestFit Decreasing, have a proven upper bound of 1 + 11/9 · optimal bins. The same guarantee does not hold for randomized algorithms such as WalkPack, although they consistently have averages much closer to optimal than 11/9 · optimal. BestFit Decreasing is simple to describe and implement, making it a popular choice for solving the bin packing problem. Its proven upper bound is also an advantage, especially in fields where reliability is critical, such as in realtime scheduling. BFD sorts the objects in decreasing size, then places each object in the first bin into which it fits. The pseudocode can be written as follows:
BFD performs poorly on certain adversarial test vectors. These are coined the "2/3 Case" in [1], although what the authors describe actually results in a worstcase utilization closer to 3/4. In these adversarial cases, BFD is supplied with objects whose sizes fall close to 1/4 the size of the bins, with some variation. BFD will first place all the objects larger than 1/4, which will result in many bins filled with three objects too large to fit a fourth. Figure 1 shows an example of such a suboptimal packing, using four bins of size 10. Figure 1: BFD's solution to the "2/3 Case". The number in each object indicate its size. By comparison, an optimal packing for these objects uses only three bins, as shown in figure 2. Figure 2: Optimal solution to the "2/3 Case". The randomized algorithm we worked with is WalkPack. It is described in [1] with the goal of solving BFD's weakness to adversarial inputs. The algorithm employs randomization with repeated trials to improve on the quality of BFD in these cases. A notable first step in WalkPack is to order the bins using a seed algorithm. In [1] as well as in our own work, Nextfit is used. Nextfit produces solutions of reasonable quality in linear time. WalkPack reduces the bin count slowly, so starting with a decent packing speeds the process considerably. The basic sequential pseudocode for WalkPack is as follows. See the accompanying visuals below for a graphical example of one iteration.
Figure 3: The objects after running Nextfit to seed the bins. The numbers in each object correspond to their index in the input. Figure 4: The selected bin S. Figure 5: Bin S is emptied and deleted, its objects distributed randomly. Figure 6: Bin A, previously overfull, is constrained by moving a random object into a new bin B. To achieve reasonable quality, heuristics must be added to the selection of bins S and D. When selecting S, the bin to be destroyed, we must favor bins with more empty space to avoid destroying wellfilled bins (see figure 7). Likewise, D should be chosen in a way that favors bins with full space so as to create betterfilled bins (see figure 8). Care must be taken when choosing D not to pick a bin that cannot fit O, unless O does not fit in any bins. This prevents needlessly creating overfull bins, which create additional bins when constrained. These heuristics improve quality significantly and make WalkPack a much more interesting problem to parallelize. Figure 7: Heuristic: The choice for a bin S to delete is weighted by empty space. Figure 8: Heuristic: The choice for each bin D to move an object to is weighted by full space. A final improvement to WalkPack that the authors of the RIT paper included is to run the algorithm repeatedly, taking the solution that uses the fewest bins once all trials have completed. This improves the quality and reliability of WalkPack by making it less likely to be caught in a lowquality local minimum. ApproachWe started by implementing the framework described in our reference paper in singlethreaded C++. Once we had this working with acceptable quality, we moved on to a CUDA implementation of this highly dynamic code. This is where we spent much of our time, dealing with the difficulties of writing this kind of code in CUDA (for the GTX 1080) as we went. For WalkPack, after writing the sequential code, we knew that there wouldn't be much room for parallelization directly, due to the intense data dependencies involved, so we decided to do multiple searches in parallel over different thread blocks and do a reduction to get the best result back. We were only able to find minor optimizations in execution time from increasing thread counts within each search. BFD benefits from parallelizing the search for a destination bin for each object. To perform this search in parallel, we have two phases. In the first, we do a simple tabulation: Each index is mapped to itself if the bin at that index can fit the target object, otherwise it is mapped to INT_MAX. We perform a parallel reduction on the resulting array of indices to find the lowest index that fits the target bin. We achieve this parallelism by mapping the problem onto a single block with many threads (see "Results" for further details). References[1] Bauer, Andrew; Kaminsky, Alan. GPU Parallel Program for the Bin Packing Problem, May 10, 2017. https://www.cs.rit.edu/~ark/students/amb4757/report.pdfPresentation DeliverablesAt the poster session, we will demonstrate speedup graphs and quality statistics for our implementations. These can be compared with the similar graphs from [1] to establish the quality of our work. We have created a visualizer that will give some intuition about the progress of the algorithm and the ultimate form of the solution. Schedule
