Not logged in. Login

Example: Concurrent Haskell Fibonacci

This example involves calculating Fibonacci numbers recursively. I'm doing it the stupid way that leads to an explosion of recursive calls, but that's nice specifically because it takes some time to compute. (Also known as the worst algorithm in the world.)

Since we're going to be writing parallel code, we'll need Haskell's parallel module:

cabal install parallel

I did the following tests with GHC 8.6.5 on a six-core (twelve thread) processor with Ubuntu Linux.

For each test, the code was compiled and executed with these commands:

ghc -O2 -with-rtsopts="-N12" -threaded concurrent.hs
time ./concurrent

You can have a look at the complete code used; I have excerpted key chunks below.

Non-Parallel Version

The first version of the code, fibSer, is naïve and non-parallel. The hope is that we can do the work in a quarter of the time using four cores (or an eighth the time using eight threads).

Calculating fibSer 46 took 23 seconds without the -with-rtsopts compilation flag.

Adding the -with-rtsopts switch gives GHC permission to use many cores: it parallelized somehow and used two cores (showed ~200% processor usage) but took 24 seconds to do the same calculation. Not a good start.

Parallel Version 1: very parallel

The first parallel version of the function, fibPar1, always calculates both recursive calls in parallel:

fibPar1 n = (term1 `par` term2) `pseq` (term1 + term2)
      term1 = fibPar1 (n-1)
      term2 = fibPar1 (n-2)

This version apparently used twelve threads (~1200% processor usage) and took 10 seconds to do the same calculations. That's a disappointing speedup: it's actually much longer than the non-parallel version when you count the amount of processor being used. The problem here is too much parallelism: too many operations are sent off to separate threads, making the system spend a lot of time synchronizing everything.

Parallel Version 2: minimally parallel

So, for the next version of the function, fibPar2, we'll try only parallelizing the top level call:

fibPar2 n = (term1 `par` term2) `pseq` (term1 + term2)
      term1 = fibSer (n-1)
      term2 = fibSer (n-2)

Notice that the “recursive” calls there are to fibSer. For fibPar2 46, this code calculates fibSer 45 and fibSer 44 in parallel, but each of those calculations is done single-threaded.

This version took 15 seconds: faster but still close to the original version. This code isn't parallel enough. It isn't creating enough threads to really keep the cores busy.

Parallel Version 3: appropriately parallel

So, for the next version of the function, fibPar3, we will try to be parallel when appropriate:

cutoff = 20
fibPar3 n
  | n<cutoff     = fibSer (n-1) + fibSer (n-2)
  | n>=cutoff    = (term1 `par` term2) `pseq` (term1 + term2)
      term1 = fibPar3 (n-1)
      term2 = fibPar3 (n-2)

This code does the same thing as fibPar1 for values of n larger than cutoff, but falls back to the non-parallel version of the function for smaller n. That is, it is parallel when it makes sense, but not when it doesn't

The value for cutoff was determined by experiment, but values 10–30 all seem to be just about as good.

This version took 3.5 seconds. We finally have the speedup we should get for four cores. The actual process is showing ~1200% processor usage: it is firing off 12 threads, but losing a lot of that to overhead, thread contention in hyperthreading, etc.

Chunk Size Exploration

After the success of fibPar3, it occurred to me that it actually generalized to mimic the other versions of the code. With cutoff==0, the function is equivalent to fibPar1; with cutoff==n, it is fibPar2; and with cutoff>=n+1, it is fibSer.

So, I decided to test out the more generalized version and time how long the calculation takes with each cutoff. I have put the code for these experiments in a separate file to avoid clutter. This was run in the same conditions described above.

Here are the times taken for each cutoff value: chart of cutoff timing results

The Lesson

Writing parallel code and using all of your cores can make things faster, but you can't just blindly throw concurrency at some code and expect it to do magic.

Updated Mon Aug. 31 2020, 09:00 by ggbaker.