Team members:

Final Report

SUMMARY

We built a parallelized checkers AI. It delivers 4.73x speedup over a sequential AI on 6 cores. It is undefeated in the 10+ games our testers have played against it.

BACKGROUND

We started with a basic MiniMax algorithm, which considers the tree of possible outcomes within k moves. It evaluates how good each outcome is using some heuristic, and moves towards the best outcome assuming the opponent plays optimally. This algorithm is naturally parallel, as the each subtree can be traversed in parallel.

Despite its parallelism, MiniMax is not incredibly fast. The next step is to add alpha-beta pruning, which allows us to skip certain branches using the results already found. In particular, we keep track of the best outcome the maximizer and minimizer can guarantee along the path to the root, and do not explore branches where we know the best cannot be better than what we already found. Unlike MiniMax, this algorithm is inherently sequential, as we use previous results to prune.

The final algorithm used was Jamboree, which combines MiniMax with alpha-beta pruning to reach a good balance between pruning and parallelism to obtain the fastest runtime. Jamboree runs alpha-beta pruning sequentially on the first j possible moves, using the outputted values to prune when evaluating the remaining moves in parallel.

APPROACH

For the heuristic, we simply counted how many pieces each side had, counting kings double. The code has a modular design so that any board game could be dropped in to replace checkers, as long as its heuristic and list_moves functions are provided, as well as methods for users inputting moves and the board being printed.

We targeted 6-core machines with c++ and pthreads. After tuning for an 8x8 board using six threads, we settled on sequentially computing alpha/beta bounds for the first two next-moves, then using these bounds to analyze the remaining next-moves in parallel. We scheduled work onto the threads by having them read from a work queue. Initially we spawned pthreads during each jamboree call, but this resulted in far too many threads, so we decided to switch to using a thread pool that is created at the start of the program. Concurrency control is accomplished with a coarse lock on the work queue and by using conditional variables to wake threads up when work is available and keep them descheduled when there is nothing for them to do.

Any subtree evaluation to be performed in parallel does not further divide the workload. We favored maintaining relatively large subproblems over improved workload balance in order to minimize overall work through pruning. For the first two sequential calls, we experimented with calling jamboree recursively as opposed to simply calling the sequential alpha-beta pruning, and found that recursively calling up to max_depth/2 gave the best performance. Such a cutoff is useful because sequential alpha-beta pruning is faster than jamboree on one thread, so once the work has been divided up well enough for all threads to have something to do, it is better for them to use sequential alpha-beta pruning.

RESULTS

Results were obtained by running a game between two AI and determining the average time per turn across 30-40 turns. We chose to limit the number of turns because towards the end of the game we saw a great deal of variability in time per turn, and we were able to get consistent results by doing this.

We evaluated the performance using depths 12 and 15 on an 8x8 board, comparing the parallel AI running on 1, 2, 4, and 6 threads to the sequential version.

A few things of note:

  1. The poor performance of the single threaded parallel implementation as opposed to the sequential one is not purely due to overhead (which we think alone would be pretty small), but rather can primarily be explained by the fact that less pruning is done in the parallel version, as initial move options are still queued independently.
  2. We notice a dip at 4 threads likely because of workload imbalance - the usual number of possible moves is usually around 7-8, so after the first 2 are evaluated sequentially we can distribute the remaining ones onto 6 threads well, but with 4 threads we have some doing more work than others.
  3. While we cared about speedup for any number of threads, we focused on tuning for 6 threads. We are very happy with the 4.73x speedup obtained at depth 12, and still happy with 3.75x speedup for depth 15. We attribute the difference here again to workload imbalance - with depth 15, there is greater potential for variability in subtree sizes.

Overall, speedup was limited by the workload imbalance created by trying to maintain large subtrees to be run sequentially, as we wanted to maximize pruning. Dividing into more parallel subproblems might help balance the workload, but creates more work overall. After tuning our system, we are happy with the balance we found and with the speedup obtained.

REFERENCES

https://www.youtube.com/watch?v=xBXHtz4Gbdo
http://en.wikipedia.org/wiki/Alpha%E2%80%93beta_pruning

LIST OF WORK BY EACH STUDENT

Equal work was performed by both project members. In the beginning, Andrew focused more on the checkers framework while Jason focused more on the sequential AI. Afterwards, most of the work was done while pair programming.