Example: Parallel Computation with Pools
To demonstrate some of the tools available for parallel computation, I decided to try out my Mandelbrot set example again, this time doing as much as possible in parallel. The Mandelbrot calculation is embarrassingly parallel, so it was very easy to modify. Since almost all of the work is parallelizable, Amdahl's law predicts good speed up, approximately proportionate to the number of cores. Or threads.
In each case, this was done with the thread pool pattern: creating a bunch of threads (or processes) that pick jobs from a queue and execute them. All of the tests were done on an Intel Ivy Bridge 4 core/8 thread processor.
My parallel implementation in
mandel-parallel.py adds relatively little to my original implementation. I mostly just had to split the calculation of a single row off into a separate function that could be called by a pool process.
The speedup was 3.9x with four cores.
Also note that there is absolutely no problem using code from a Cython module in a
multiprocessing job: you just have to import the module and call the function. As long as you have pure Cython functions, this should be very fast and easy.
Results with PyPy were a little better: close to a 4.5x speedup (and much faster baseline time).
The Java implementation comes with a little more Java-ish syntax overhead, but can do thread pools relatively easily using the
ThreadPoolExecutor class. The code that does the work must be in a separate class that implements
MandelParallel.java does all of this. The speedup was 5.8x with the parallel implementation.
The Haskell code was the easiest to modify to run in parallel: I simply replaced a
parMap rseq (and used
deepseq to kill off some lazy evaluation of the inner lists). The
parMap function does the same thing as
map, but does it in parallel. The first argument,
rseq, indicates a “strategy”: it basically says not to be lazy about the calculation and that
parMap should actually compute the values.
GHC does take some convincing to actually run the code in parallel. It must be compiled and run like this:
ghc -O2 --make -with-rtsopts="-N8" -threaded mandel-multi.hs ./mandel-multi
-threaded argument tells GHC to include the multi-thread stuff. The
-N8 argument tells the runtime to use eight threads.)
The speedup was 7.1x over the single-threaded version..
The size of the piece of work that is given to each thread is important.
If the chunk is too small, then the threads spend an excessive time communicating with each other. As an example, I modified the Python implementation so it gives each pixel to a worker (instead of each row) in
mandel-parallel-tiny.py. This implementation took around 30% longer than the above (which was actually better than I thought, presumably because the amount of work required to calculate each pixel is relatively large).
If the chunk size is too large, we risk not using our available cores effectively. For example, if we split our job into exactly one chunk for each core, we might find that one of the jobs completes earlier (because it was easier, or that core didn't get interrupted by other work). That will leave the other cores working to finish while one sits idle.
See also my parallel Haskell example for another look at the chunk size given to each thread/process.