I am going to implement the PVSplit chesss-solver algorithm on the quad-core GHC machines. I will quantify speedup and the factors affecting it, such as fundamentally serial sections, workload imbalance, and various locking schemes. I will explore how this speedup translates into a competitive edge.
The program does actually run in parallel.
Done naively, in achieves 1.4x speedup.When using the principle variation splitting algorithm, a single subtree is searched in its entirety before other subtrees are searched, providing good baseline alpa-beta values.
I correctly implemented this algorithm, and now speedup is 2.5 on eight cores.
The non-linear speedup is caused primarily by workload imbalance and search overhead. Search overhead was reduced when we switched to the principal variation splitting, but it can't be completely avoided.
I have strong quantitative data on search overhead and workload imbalance
In addition to measuring speedup, I have measured the search depth when processing time is held constant, like you'd see in a real chess match.
Last but certainly not least, I have simulated hundreds of games to show that the parallelized version does actually perform better, specifically it wins 15% more games than the serial version. Equivalently, it performs better in 30% of games; sometimes it changes a loss to a draw, sometimes it changes a draw to a win.
I will use my remaining time before the cometition to implement work stealing to reduce workload imbalance
Like I said, I had two final projects in other classes due last week, and no significant projects in other classes between now and the end of the year. Nonetheless, I have been hard at work during this time, spending many days in their entirety on this project .Some things I have done:
The biggest challenge that I have run into is that in the theory of this algorithm, the branches of the minimax tree are supposed to be independent. However, the activities on these branches are coded in a non-independent way in the starter code. They tend to read a global variable, make changes to it, run a heuristic on those changes, then undo the changes. This doesn't work in parallel. For the past three days my plan has to be to replace these functions with functions that utilize a copy of the data. However this plan has two issues. The first is that it is long and tedious (I can get over that though). The second is that it is easy to make errors and hard to isolate those errors when they get made (this may prove too difficult an obstacle to overcome). I'm trying to fix the bugs on this front tonight, but if I can't do that I'll probably abandon Cilk entirely and look into using fork() system calls to copy all of the global variables, but that will take a little research. I am probably going to talk to Kayvon about it in the next couple days.
The game of chess has long been considered a great test of intellect. Accordingly developing software that can master it has been of interest since the early days of artificial intelligence.
In order to make decisions in chess you need a system for evaluating the strategic value of the board state, and there are well-known heuristics for this, such as summing the point values of the pieces. But chess is all about planning several moves ahead. For an AI, this is acclmplished with an algorithm called the minimax. The minimax algorithm looks at a tree where each path is a possible sequence of future moves. A smart player will always pick the move resulting in the best board state. But, because the opponent is also intelligent and wants to minimize your chances of winning, the value of that board state is actually equal to the minimum of the values of the possible board states that follow it. To find the best move, we need to repeatedly perform this minimizing and maximizing search pattern until we find a path guaranteeing victory or we run out of resources and are forced to use a heuristic.
A technique called alpha-beta pruning is used to dramatically speed up this search. With alpha-beta pruning we can skip many possible moves because we know that no player will ever have an incentive to choose them.
Parallelism has been shown to improve the performance of chess AI because one can search deeper into the minimax tree in the same amount of time. Alpha-beta pruning is parallelized through a process known as Principle Variation Splitting. With principle variation splitting, a single branch of the tree is searched in serial to provide alpha-beta bounds for the following searches. Then each subtree is placed in a subtree. At this point they are mostly independent so many threads can each grab one subtree and solve it in serial.
A number of challenges are present. Alpha-beta pruning causes searches to abruptly stop, so there is substantial workload imbalance. Also, the alpha-beta bounds on what need searched become tighter as the algorithm progresses. This means a parallel algorithm might use older alpha-beta values to determine something needs searched, even though a serial algorithm would have used up-to-date alpha-beta values to elminate that work; this is called serach overhead. There is also substantial contention over the global alpha and beta values. Lastly, a shared hash table is used so repeated states need not be recalculated. Contention-resolution strategies on the hash table have particularly interesting trade-offs because errors in the hash table don't cause incorrect behavior, just sub-optimal behavior.
I have starter code called Marcel's Simple Chess Program. This program, written in open source c, has a working chess engine with an AI that does minimax solving in serial, including the heuristics, but has no parallelism.
The core structure of the project was heavily taken from this 15-418 project from last year:http://www.andrew.cmu.edu/user/arsinha/15418/418_final_report.pdf
That paper led me to this significantly more helpful paper from a student at University of Illinois Urbana Champagne:http://iacoma.cs.uiuc.edu/~greskamp/pdfs/412.pdf
After talking with Dr. Kayvon, it seems prudent to avoid looking at the CMU project while I'm working to ensure our projects aren't identical. It's less clear whether it is problematic to look at the UIUC paper, so feedback of what's acceptable on that front is appreciated (I don't think I need to look at it because I found the papers that originally published the algorithms, but if I'm stuck it might get me out of a jam a little faster...)
Correctly implement a parallel search algorithm that is at least as competitive as the original program.
I plan to get a speedup of at least 3 on a quad-core machine versus serial search, holding search depth constant.
I plan to solve contention over the hash table with locking (a similar project last year considered locking but ran out of time to implement it. This will allow for intersting comparisons).
I plan to get rigorous quantitative data on factors affecting my speedup, including contention of shared resources, workload imbalance, and search overhead ('search overhead' is additional searches performed because we aren't using the most up-to-date alpha-beta values).
I hope to also use an optimistic scheme for hash table updates so I can compare this to the locking-based, more pessimistic scheme from above.
I hope to implement my program so that it can play in online servers.
I hope to bring a ~4x speedup on the quad-core machine.
This may be a little redundant, but Piazza told me to make sure it's clear.
My primary benchmark will be speedup. I will see how long a mini-max search of a fixed depth takes with the parallel algorithm versus the original serial implemetation. I will need to put in timing code, called in the move function, to do this. I hope for close to 4x speedup.
A more ambitious benchmark is in-game performance. I will play the serial version against the parllel version. To do this, I'll need to add code that selects which algorithm it's using at the start of each turn.I'll also want a shell-script that automates the game start-up and records results.
I actually want to live-demo the parallel version of my program against the serial version of my program playing chess against the parallel version of my program. I'll have a little randomization built into the algorithm, so the parallel one won't win every time, but I suspect in the span of a few minutes it can win enough games to prove that it is superior. I already figured out how to get the chess to display into the terminal. Supposedly the starter code plugs into a graphical chess program.
In addition, I'll have speedup graphs, and detailed win-loss data.
I am planning on using the quad-core GHC machines. This algorithm has a fundamentally serial section (it searches the first branch to get an initial alpha and beta value), so its maximum speedup is limited to being around 4.
I will be using the Cilk language. Cilk was designed to shine in divide-and-conquer parallel algorithms and that's exactly what PVSplit is.
Figure out how to compile a program with the MSCP engine and Cilk commands.
Get the pseudocode solid.
Research Cilk tools.
Complete code for checking if parallel solution matches serial solution (we'll worry about making parallel do better later).
Begin coding Cilk version of serial implementation
Do midterm report.
Finish the Cilk version, and confirm it correctly matches serial implementation.
Begin parallelizing Cilk.
Finish parallelizing Cilk with most basic locking scheme possible.
Begin working on more efficient locking to increase performance.
Finish improving locking.
Insert benchmarking code to how much time we lose to the various factors inhibiting speedup.
NOTE: This schedule may look backloaded. This was intentional and done with substantial forethought. Of my remaining work in my other classes, over 85% of it is due before April 17. Thus this project will have almost 100% of my work attention after April 17.
It is very difficult for me to know in advance how long this project will take. On the one hand, my starter code is substantial and this is supposed to be a big project. On the other hand, the PVSplit algorithm is more confusing than I initially expected it to be, integrating code into the starter code will be a challenge, and everything in programming takes longer than you expect. I am definitely hesitant to bite off more than I can chew, so if this project looks too ambitious PLEASE let me know. I will not be offended. At all. I promise.
On the other hand, I don't want to be one of those people who's surprised when his project is deemed too trivial. So if my project isn't challenging enough, please let me know.