Vectorizing Python for concurrent SIMD execution.

Overall Progress

Over the past two weeks I have made significant progress on VecPy. Starting with a bare-bones proof of concept (which I created during spring break), I determined that the project, though ambitious, was feasible. This early implementation consisted of three main components: a Python parser to decode the user's kernel function, a language-independent kernel builder to hold the kernel logic, and a compiler to generate C++ code from that logic. I extended the parser to handle all Python floating-point operators and most calls to built-in math functions. I modified the kernel builder to accommodate the more sophisticated parsing logic. I made significant changes to the compiler, which are discussed in more detail below. I also implemented build script generation and run-time compilation of the native module via system calls to the C++ compiler (g++). Finally, I created a basic test harness to check for program correctness and to assess speedups using the native modules generate by VecPy.

One of the main goals I had for the first half of the project was to make the division of labor more robust. Previously I required the input data for the native module to be a multiple of the vector size and thread count, but I was able to relax that requirement; now arrays of any length can be passed to the native kernel. Similarly, I upgraded to a variable number of threads (either specified by the user or determined though the Python runtime based on the number of CPU cores) instead of using two hard-coded threads. A more distant goal is to support SSE4 and AVX2 architectures, and I was able to get a little bit ahead of schedule by finishing the SSE4 implementation. One of the goals on my wish-list was to include support for data types other than 32-bit floats, and I made some pretty significant architectural changes to the codebase to accommodate the future inclusion of additional data types, if I get that far. For now though, VecPy has full support for 32-bit floats, and extremely basic support for 32-bit unsigned integers.



I have accomplished all of the goals that I set for this checkpoint: the project was determined to be feasible, workload has been distributed evenly among a variable number of threads, and all Python built-in floating-point math operators and functions (limited to the subset that also exists in the C++ standard library) have been implemented. I also made progress toward the next goal by finishing up support for SSE4. I do not anticipate any major hurdles with the remaining goals, and I may be able to complete my wish-list item of adding integer support if I have enough time. See my updated schedule below for an updated and more detailed list of goals moving forward.


Competition Plans

For the parallelism competition I primarily plan to show two things: an analysis (graphs and maybe tables) of the speedup granted by VecPy on sample problems and a "hello world" VecPy program. If the presentation format permits, I will run the code on my laptop; otherwise, I will provide a static code listing to highlight the most interesting features of VecPy.


Preliminary Results

The most basic result that I have at this point is a working "hello world" program, which is currently displayed as VecPy's usage example on GitHub. I also have preliminary speedups for various debugging kernels. For example, consider the following kernel function which calculates the volume of a sphere and an icosahedron given a radius and an edge length:

def shapes(radius, edge, vol_sphere, vol_icos): vol_sphere = (4/3 * math.pi) * (radius ** 3) vol_icos = (5/12) * (3 + math.sqrt(5)) * (edge ** 3) return (vol_sphere, vol_icos)

Calling the pure Python kernel with an input of one million radii and edges takes 1.390 seconds. The native module generated by VecPy runs in 0.090 seconds on the same dataset using two threads and SSE4 SIMD intrinsics, a speedup of 15.462x. When using a single thread and no SIMD instructions, the native module (which is in essence a plain C++ implementation of the kernel) takes 0.152 seconds. Threading and SIMD give a speedup of about 1.7x, indicating that there is still room for improvement.


Known Issues

The less than ideal speedup of 1.7x described above illustrates one problem I've run into: I haven't found a way to efficiently perform operations on individual elements of a SIMD vector. The current implementation uses a very simple fallback strategy of operating on a temporary buffer when appropriate SIMD instructions aren't available. For example, the pow function calls (below) severely impact performance in the C++ code generated by VecPy for the shapes kernel shown above. I'm almost positive there is a much better way to implement this, but I've focused on correctness over performance so far and haven't found a better method yet.

//>>> vol_sphere = (4/3 * math.pi) * (radius ** 3) var005 = _mm_div_ps(lit006, lit007); var004 = _mm_mul_ps(var005, lit008_PI); _mm_store_ps(tempA, radius); _mm_store_ps(tempB, lit007); tempA[0] = pow(tempA[0], tempB[0]); tempA[1] = pow(tempA[1], tempB[1]); tempA[2] = pow(tempA[2], tempB[2]); tempA[3] = pow(tempA[3], tempB[3]); var009 = _mm_load_ps(tempA); vol_sphere = _mm_mul_ps(var004, var009);



April 4 Project Proposal + Proof of Concept Finalize the proposal! Also, verify that there are no significant obstacles that would make this project infeasible.
April 11 Finalize VecPy Core The PoC code spwans a hard-coded number of threads; spawn an appropriate number of threads and verify that workload is evenly distributed.
April 18 Wrap up operators and math functions The PoC code only handles a small subset of the intended functionality. I want to have all operators and functions from the math library implemented by this date. Do checkpoint writeup.
April 25 Implement Additional Architectures The PoC code only uses a very small subset of SSE. I want to have SSE4 and AVX2 fully operational at this point.
April 29 Basic Branching There is no support for branching yet. I plan to add support for if statements, which will require basic comparison operators to be implemented.
May 2 Basic Looping I plan to add support for while loops. This will allow for significantly more useful kernel functions.
May 6 Example Problems I plan to implement several sample kernels for the purpose of illustrating VecPy's performance and utility. I will include the shapes kernel (see checkpoint), a mandelbrot set kernel, and a julia set kernel. On the wish-list is the SHA2 hash function final compression kernel.
May 9 Finished Product Wrap up any outstanding todo items, and make the final project writeup and presentation. Polish up the code and make sure there is sufficient documentation for others to use VecPy.