Example: Parallel Computation with Pools
To demonstrate some of the tools available for parallel computation, I decided to try out my Mandelbrot set example again, this time doing as much as possible in parallel. The Mandelbrot calculation is embarrassingly parallel, so it was very easy to modify. Since almost all of the work is parallelizable, Amdahl's law predicts good speed up, approximately proportionate to the number of cores. Or threads.
In each case, this was done with the thread pool pattern: creating a bunch of threads (or processes) that pick jobs from a queue and execute them. All of the tests were done on an Intel Ivy Bridge 4 core/8 thread processor.
Python Implementation
The Python multiprocessing
module makes working with process pools easy: it creates a process pool and just starts feeding it tasks.
My parallel implementation in mandel-parallel.py
adds relatively little to my original implementation. I mostly just had to split the calculation of a single row off into a separate function that could be called by a pool process.
The speedup was 3.9x with four cores.
Also note that there is absolutely no problem using code from a Cython module in a multiprocessing
job: you just have to import the module and call the function. As long as you have pure Cython functions, this should be very fast and easy.
Results with PyPy were a little better: close to a 4.5x speedup (and much faster baseline time).
Java Implementation
The Java implementation comes with a little more Java-ish syntax overhead, but can do thread pools relatively easily using the ThreadPoolExecutor
class. The code that does the work must be in a separate class that implements Runnable
.
My implementation MandelParallel.java
does all of this. The speedup was 5.8x with the parallel implementation.
Haskell Implementation
The Haskell parallel package provides some convenient tools for parallelizing functional code. I used this in my mandel-parallel.hs
. (Compare this with my other parallel Haskell example.)
The Haskell code was the easiest to modify to run in parallel: I simply replaced a map
with parMap rseq
(and used deepseq
to kill off some lazy evaluation of the inner lists). The parMap
function does the same thing as map
, but does it in parallel. The first argument, rseq
, indicates a “strategy”: it basically says not to be lazy about the calculation and that parMap
should actually compute the values.
GHC does take some convincing to actually run the code in parallel. It must be compiled and run like this:
ghc -O2 --make -with-rtsopts="-N8" -threaded mandel-multi.hs
./mandel-multi
(The -threaded
argument tells GHC to include the multi-thread stuff. The -N8
argument tells the runtime to use eight threads.)
The speedup was 7.1x over the single-threaded version..
Chunk Size
The size of the piece of work that is given to each thread is important.
If the chunk is too small, then the threads spend an excessive time communicating with each other. As an example, I modified the Python implementation so it gives each pixel to a worker (instead of each row) in mandel-parallel-tiny.py
. This implementation took around 30% longer than the above (which was actually better than I thought, presumably because the amount of work required to calculate each pixel is relatively large).
If the chunk size is too large, we risk not using our available cores effectively. For example, if we split our job into exactly one chunk for each core, we might find that one of the jobs completes earlier (because it was easier, or that core didn't get interrupted by other work). That will leave the other cores working to finish while one sits idle.
See also my parallel Haskell example for another look at the chunk size given to each thread/process.