In this project, I implemented a parallel prime cluster search algorithm using CUDA and OpenMP as a reference. My goal was to find prime clusters as fast as possible and make a detailed analysis on performance characteristics.
Prime clusters are now not only an interesting number theory problem to mathematicians, they are the basis of many cryptography applications that make Internet commerce possible, including a new cryptocurrency called Riecoin.
The prime cluster searching problem of Riecoin is: given a target number t and a search limit l, find t <= n <= t+l, such that the following six numbers are all primes: n, n+4, n+6, n+10, n+12, and n+16. This kind of prime clusters is also called "sextuplet". For instance, the samllest sextuplet is 7, 11, 13, 17, 19, 23.
Typically we use "sieve" to separate prime numbers from composites. A most commonly known algorithm is the Sieve of Eratosthenes, where we iteratively mark as composite the multiples of each prime, starting with the multiples of 2, and the rest numbers are primes. While this is an efficient approach to find small primes, it contains too much redundant work when dealing with large numbers. A neat observation Wheel Factorization, on the other hand, utilizes the fixed prime pattern generated during executing the sieve to accelerate the process. For example, after sieving all multiples of 2, 3 and 5, the rest numbers share a period of 2 * 3 * 5 = 30. This means we don't have to sieve these small numbers over and over, we can just skip over them once the basic pattern is established. The term for the product of primes, which is also the period, is called a "primorial". In this manner, I can iteratively build larger patterns with less effort and rapidly eliminate a huge number of composites.
The prime cluster search algorithm is divided into two steps. In the first step, I use Wheel Factorization to generate prime cluster candidates up to a specific primorial. In the second step, I use Fermat primality test to further delete composites.
The key idea of the first step is to iteratively build larger prime patterns. For each iteration, the input is the prime pattern, which is an array of candidate primes, up to a primorial PnX corresponding to prime X. And the output is a larger prime pattern up to the next promorial after PnX (corresponding to the next prime after X). The following figure shows a simple example of building the prime pattern of Pn5 based on that of Pn3. The input is prime pattern 100010 (1 represents a candidate number while 0 represents a sieved out number). Every 6 (Pn3) numbers would demonstrate the same pattern. Then we extend it to 30 (Pn5) by copying this pattern 4 times (excluding the numbers 1 to 6) and sieve those candidate numbers which are multiples of 5. After that we get the output, prime pattern up to Pn5.
The final output of step 1 would be a very large array of candidate numbers, then in the second step, I use Fermat primality test to check those candidates. The pseudo-codes are as follows.
In the first step, iterations are dependent because the input of each iteration is based on the output of the last iteration, so iterations cannot be executed in parallel. however, within each iteration, the processes of numbers are independent, they can be executed simultaneously. In the second step, Fermat test would be run on every candidate number independently, so this part can be fully parallelized. I decided to use CUDA to explore the parallelism.
The challenges come from two aspects. Firstly, in order to deal with large numbers on GPU, I have to deploy a general-purpose library for multiple precision integers. Thanks to the work by AJ Kaufman & David Matlack, I got a fully functional multiple precision integer library. It supports all the basic arithmetic operations including addition, subtraction, multiplication, division, comparison, and power and modulus. I optimized its power and modulus function (which I would heavily use) to get ~25% speedup, and I added Fermat prime test functionality into the library. But since numbers have large size (40 bytes), it means cache lines can now only hold smaller amount of numbers which may lead to bad locality, and on the other side, it means I would have to transfer more data to and from the device, which would definitely cause the communication process more costly. So I have to figure out a way to solve the possible locality and communication issue. The second challenge is that in each Wheel Factorization iteration, the size of input/output arrays would change, so it's hard for me manage device resources, and it may produce high contention when I construct the output array.
My target machine is the lab computer with NVIDIA GeForce GTX 480. I started by implementing a sequential version of prime cluster search running the above algorithm, and then turned to parallelize it. As I mentioned in the previous part, within each iteration of Wheel Factorization, the processes of numbers are independent, so I fire off one kernel call in the GPU to sieve numbers for each iteration. When the array size is getting so large that it might exceed the global memory size of GPU, I split the array into subarrays and process one subarray at a time. For Fermat primality test, if the array size is small enough, I would call the corresponding kernel function once to test all possible primes in parallel, otherwise I would also serially deal with subarrays. In order to achieve better performance, I adopted the following engineering decisions.
A naive way to represent prime patterns is to use arrays containing only candidate numbers for both input and output of every Wheel Factorization iteration. However, in this case, for the purpose of constructing the output array, I would have to maintain a global index in my CUDA codes and each CUDA thread would fetch and increment that index to write data to the output array. It would lead to high contention accessing the index.
An alternative is to use arrays with one slot for every number and use zeros to mask those sieved out numbers. In this manner, each CUDA thread can just update the slot corresponding to a specific number, so there would be no contention among threads. However, with primes getting sparser and sparser with larger values, this prime pattern representation method would waste a lot of memory, and also result in poor locality.
I came up with a hybrid way to represent prime patterns. For each iteration (kernel call) in Wheel Factorization, the input is an array containing only candidate numbers, while the output is an array of all numbers with zero masking composites. And in my host codes, I use another sequential function to condense the output array and prepare the input for next iteration. A major advantage of this mechanism is that there is no contention constructing the output and meanwhile the memory usage is kept at a reasonable level. It also has an obvious disadvantage that the extra sequential part of each iteration in the host would slower overall performance according to Amdahl's law. To deal with it. I introduced a pipelining method to hide latency of this sequential part. Another disadvantage is that since input and output arrays have different format, I cannot reuse data in the device and have to transfer data back and forth. But later I'll show this CUDA program is computation-intensive, so the communication overhead is not a big issue here. After measuring the performance of all three possible prime pattern representation methods I decided to adopt the hybrid one in my implementation.
To hide latency of the sequential part, I interleaved the execution of Wheel Factorization and Fermat test, i.e., after each iteration in Wheel Factorization, I call the Fermat test kernel function to process the input array of the current iteration, and the output array would not contain the first period corresponding to the input, so that I could avoid redundant computation in Fermat test. At the same time, I run the sequential part in the host to prepare input for next iteration. In this manner, the program would not stall waiting for the sequential function and thus hide its latency. The following figure shows a rough timeline of pipelining Wheel Factorization and Fermat test.
There are two choices for CUDA thread granularity, one thread for each cluster (coarse-grained), or one thread for each possible prime number (fine-grained). The former one would enjoy better locality because the program would load all six numbers in the cache so that later accesses would be cache hits. While the latter choice may produce more cache misses compared to the former one. However, one thread for each possible prime number can lead to better workload balance, while one thread for each cluster may result in serious thread divergence as is shown below. Therefore, it's a tradeoff between locality and workload balance. I implemented both methods and will refer them as "CUDA-coarse" and "CUDA-fine" respectively below.
I measured performance of both CUDA-coarse and CUDA-fine on NVIDIA GeForce GTX 480 with 128 threads in each block, and I also implemented an OpenMP version with 12 threads hyper-threading as a reference. The speedup mentioned below is calculated based on a sequential version of prime cluster search running on GHC3000 machines.
The overall speedup of generating sextuplets is shown in the following figure. X-axis refers to the size of prime patterns. I generate all sextuplets from 1 to the value of x. In the long run, the speedup of CUDA-fine is reaching up to 8 times, and CUDA-coarse 7 times. But when x is small, the speedup of both CUDA implementations is less than 1. This is because of the GPU management overhead during the first invocation of kernel call.
To get rid of the impact of overhead of first CUDA invocation, I compare the speedup of each iteration in Wheel Factorization and also the speedup of Fermat test. The result is more convincing regarding to the effect of parallel execution in CUDA. We can see from the figure below that Fermat test has a better speedup using CUDA than Wheel Factorization.
Then I use the CUDA profiling tool "nvprof" to check my programs (both CUDA-coarse and CUDA-fine). Firstly I found that the program is computation-intensive, the communication-computation ratio is really slow. For example, for CUDA-fine running Pn29, the time cost of "memcpy" (both direction) takes only 0.25% of the overall execution time. The ratio is growing with larger array sizes, but still the value is negligible. So I don't worry about the disadvantage of my hybrid prime pattern representation that data cannot be reused in the device and have to be transfered back and forth.
The second thing I try to understand is the tradeoff between locality and workload balance in my CUDA-coarse and CUDA-fine implementations. I checked the L1 cache global hit rate of both implementations and found the result (bottom left) similar to what I expected. Since CUDA-coarse deals with a cluster in each thread, the cache line would load all six numbers, so that we can achieve high hit rate in CUDA-coarse (~80% for Fermat test and 67%~87% for Wheel factorization). On the other side, CUDA-fine has a decreasing L1 cache hit rate when array sizes getting larger. For Pn29, CUDA-fine's hit rate is 35% lower in Wheel Factorization and 53% lower in Fermat test. However, as is shown in the previous figures, CUDA-fine has a higher speedup than CUDA-coarse under any condition, this is because the effect of workload balance outweighs that of locality. In the bottom right figure, I compare the warp execution efficiency of both CUDA implementations in Fermat test. The value is ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor expressed as percentage. We can see from this figure that CUDA-fine has better warp efficiency, about 34% more than CUDA-coarse during Pn23 and Pn29. I believe this is the reason why CUDA-fine is having an overall higher speedup than CUDA-coarse.
In summary, the performance of my CUDA implementations is not as good as I expected. I found a possible reason is that I didn't get a good tradeoff between locality and workload balance. If I could figure out a way to combine good locality and balanced workload, then maybe I can get higher speedup than what I currently have.