Vectorizing Python for concurrent SIMD execution.


Python has recently become one of the most popular high-level programming languages, with many unique features making it very appealing to a wide range of programmers. In contrast with older programming languages, Python is very programmer-friendly and is well-suited for rapid prototyping and testing. Unfortunately, Python is slow. But there are myriad third-party projects which aim to bring Python up-to-speed.

One widely used approach is symmetric multiprocessing. Due to Python's Global Interpreter Lock (GIL) though, simultaneous multithreading isn't possible in pure Python. Many projects, including pp and dispy, solve this problem by using multiple Python processes (similar to forking in other languages). In these types of solutions, multiple instances of the Python interpreter are launched, and true multiprocessing is possible because each instance has a unique GIL. This fork-like approach has also been used to implement cluster-based, cloud-based, and grid-based distributed processing in Python. In many of these projects though, the fundamental issue remains: Python is interpreted and executed slower than machine code.

A more direct approach is to write native modules, which are simply wrappers around native libraries and allow performance-critical code to be written in other languages which can be compiled to machine code and called from Python. Native modules are critical for the fastest-possible processing and are used in many high-performance libraries, like NumPy and SciPy. While this type of solution offers very high performance on a set of common functions, it does not enable native execution of arbitrary Python code.

As opposed to writing a module in C++, a hybrid approach is to include C++ source code in the original Python program, which is then compiled at runtime and executed natively. Projects taking this approach include PyCUDA and PyOpenCL, and have the advantage of being very flexible in what functionally can be parallelized. The disadvantage to this approach is that the programmer must write both Python and C++ code.

The paradoxical solution for writing fast Python code seems to be: rely on fixed functionality exposed by third-party libraries, or use a different language. VecPy exists as a solution to this problem. Using Python, a programmer can quickly prototype and test an arbitrary kernel function, and VecPy will compile that kernel into a native module which can take advantage of multi-threading and SIMD instructions on modern processors. This is similar to the approach taken by the copperhead project, which dynamically compiles Python functions and targets multicore CPUs (via OpenMP) and Nvidia GPUs (via CUDA). In this context, what sets VecPy apart is its utilization of SIMD vector intrinsics, which (to my knowledge) are not used in any of the previously mentioned projects. Another project that is similar in spirit is py2llvm, which translates Python into low-level LLVM code, taking advantage of SIMD when appropriate. One of the distinguishing features of VecPy is the creation of a portable, reusable native module, which is not achieved by either of the aforementioned projects.



The challenges of this project are numerous. At the most basic level, VecPy will have to correctly translate Python code to C++, adding various entry points for each supported language binding. More importantly, the generated code will need to be able to vectorize operations where appropriate by using SIMD intrinsics. VecPy will target many architectures (SSE2, SSE4, AVX, etc.), and making good use of the hardware will require careful usage of the available features of each target architecture. Furthermore, VecPy itself will execute across multiple threads, and strategies for division of labor must be employed to avoid workload imbalance.

Another challenge will be to support a reasonably sized feature set of Python so that useful kernels can be vectorized. A trivial implementation could support common arithmetic operators, but a meaningful implementation will need to be able to handle many additional functions (such as those from Python's built-in math library). Finally, there is the issue of data types, which by default are not explicitly set in a Python program. Therefore, VecPy will need to be able to distinguish between integral and floating-point types and generate appropriate C++ code for the kernel being compiled.



Although VecPy will be developed from scratch, Python ships with a couple of very useful built-in modules for source code manipulation. The AST module, for instance, takes as input Python source code and generates an abstract syntax tree. In essence, this eliminates the need for a custom parser, and has the added benefit of catching Python syntax errors early in the process. Another very useful built-in module is the Inspect module, which, among other things, takes as input a live handle to an arbitrary Python function and returns its source code. These two modules will together make it possible to generate native libraries for functions even when the source code isn't immediately available.

In terms of computing, no special hardware is required. Because multi-core x86 processors supporting SIMD vector instructions are near-ubiquitous, VecPy can be developed and run on available commodity hardware. As for software, VecPy will be written in Python 3, and will require at minimum a suitable C++ compiler. For linking to the Python and Java APIs, the appropriate SDKs containing header files and static libraries need to be available when VecPy is invoked. Once the native library has been compiled, it will (at least, in theory) be able to be called on different machines as long as the requisite runtime environments are available.



The primary goal is to develop a Python module compiler that leverages SIMD vector intrinsics on modern processors. There are many available architectures, and I plan to support at least SSE4 and AVX2. VecPy is primarily designed to generate modules called from Python, but I also plan to add API bindings for C++ and Java. The data types that VecPy will support include at least 32-bit floats. In terms of a feature set, I aim to support all Python arithmetic and logic operators and the subset of functions that exist in both the Python math library and the standard C++ math library.

If all goes well and I am able to make more progress than originally planned, there are several extra features I can implement: more target architectures (including potentially GPUs), additional programming language bindings, alternative data types (64-bit doubles, integer types), and a more diverse feature set.

I plan to demonstrate the correctness, performance, and utility of VecPy by implementing several example programs and comparing naive Python performance with VecPy performance. In particular, I will reimplement the first class homework, generating fractals. Another potential comparison could illustrate a speedup in calculating cryptographic hashes in parallel. Although maximum speedup is problem-dependent, for trivial kernels a close to perfect speedup should be possible.



As mentioned previously, Python is an immensely popular programming language but suffers from poor performance on data parallel tasks. As a result, common practice is to code performance-critical sections in another language and call the resulting module from Python. Python has several built-in libraries that greatly simplify source code manipulation though, and the process of module generation can be automated by tools like VecPy. The powerful prototyping features of Python can save valuable time, and VecPy can help bridge the gap between Python, where the kernel is written, and other languages, like C++ and Java, where the kernel is called.



April 4 Project Proposal + Proof of Concept Finalize this document! Also, verify that there are no significant obstacles that would make this project infeasible.
April 11 Finalize VecPy Core The PoC code spwans a hard-coded number of threads; spawn an appropriate number of threads and verify that workload is evenly distributed.
April 18 Wrap up operators and math functions The PoC code only handles a small subset of the intended functionality. I want to have all operators and functions from the math library implemented by this date.
April 25 Implement remaining architectures The PoC code only uses a very small subset of SSE. I want to have SSE4 and AVX2 fully operational at this point.
May 2 Branching The PoC code can't handle branching. This is when I want to add support for if statements and while loops. This will allow for significantly more useful kernel functions.
May 9 Finished Product Wrap up any outstanding todo items, and make the final project writeup. Implement test programs and show speedup gained by using VecPy.