Example: Parallel Computation with Pools

To demonstrate some of the tools available for parallel computation, I decided to try out my Mandelbrot set example again, this time doing as much as possible in parallel. The Mandelbrot calculation is embarrassingly parallel, so it was very easy to modify. Since almost all of the work is parallelizable, Amdahl's law predicts good speed up, approximately proportionate to the number of cores. Or threads.

In each case, this was done with the thread pool pattern: creating a bunch of threads (or processes) that pick jobs from a queue and execute them. All of the tests were done on an Intel Ivy Bridge 4 core/8 thread processor.

Python Implementation

The Python multiprocessing module makes working with process pools easy: it creates a process pool and just starts feeding it tasks.

My parallel implementation in mandel-parallel.py adds relatively little to my original implementation. I mostly just had to split the calculation of a single row off into a separate function that could be called by a pool process.

The speedup was 3.9x with four cores.

Also note that there is absolutely no problem using code from a Cython module in a multiprocessing job: you just have to import the module and call the function. As long as you have pure Cython functions, this should be very fast and easy.

Results with PyPy were a little better: close to a 4.5x speedup (and much faster baseline time).

Java Implementation

The Java implementation comes with a little more Java-ish syntax overhead, but can do thread pools relatively easily using the ThreadPoolExecutor class. The code that does the work must be in a separate class that implements Runnable.

My implementation MandelParallel.java does all of this. The speedup was 5.8x with the parallel implementation.

Haskell Implementation

The Haskell parallel package provides some convenient tools for parallelizing functional code. I used this in my mandel-parallel.hs. (Compare this with my other parallel Haskell example.)

The Haskell code was the easiest to modify to run in parallel: I simply replaced a map with parMap rseq (and used deepseq to kill off some lazy evaluation of the inner lists). The parMap function does the same thing as map, but does it in parallel. The first argument, rseq, indicates a “strategy”: it basically says not to be lazy about the calculation and that parMap should actually compute the values.

GHC does take some convincing to actually run the code in parallel. It must be compiled and run like this:

ghc -O2 --make -with-rtsopts="-N8" -threaded mandel-multi.hs
./mandel-multi

(The -threaded argument tells GHC to include the multi-thread stuff. The -N8 argument tells the runtime to use eight threads.)

The speedup was 7.1x over the single-threaded version..

Chunk Size

The size of the piece of work that is given to each thread is important.

If the chunk is too small, then the threads spend an excessive time communicating with each other. As an example, I modified the Python implementation so it gives each pixel to a worker (instead of each row) in mandel-parallel-tiny.py. This implementation took around 30% longer than the above (which was actually better than I thought, presumably because the amount of work required to calculate each pixel is relatively large).

If the chunk size is too large, we risk not using our available cores effectively. For example, if we split our job into exactly one chunk for each core, we might find that one of the jobs completes earlier (because it was easier, or that core didn't get interrupted by other work). That will leave the other cores working to finish while one sits idle.

See also my parallel Haskell example for another look at the chunk size given to each thread/process.

Updated Mon Feb. 24 2025, 11:51 by ggbaker.

Simon Fraser University
Engaging the World

CourSys

Example: Parallel Computation with Pools

Python Implementation

Java Implementation

Haskell Implementation

Chunk Size