Lab Exercise 13

Provided code: lab13.zip.

There are no marks associated with this "lab". It cannot be submitted and will not be marked. It is not a bonus activity. You will not be tested on it. It simply exists.

You have been provided with a file lab13.h with headers for the functions described, and a partially-completed lab13.cpp. The tests.cpp contains reasonably thorough tests of the functions as described, and timing.cpp contains code to time the implementations.

Note: Async Calls

In these questions, you'll use std::async to start function calls that solve pieces of the problem, each in a different thread. The C standard doesn't guarantee that async calls will each be in their own thread: there may be a limited number of threads created and async tasks are queued up to run in them (or it's possible no extra threads are created if that's not possible, maybe in an embedded system, but we're not in that case).

You may assume that the size of the problem (array size or b-a, respectively) is divisible by the number of tasks. Test sizes have been chosen so they are divisible by likely choices you might make: 2, 4, 6, 8, 12, 24.

A constant TASKS has been defined in lab13.cpp: you're free to change it. Double your number of physical cores is probably a good default.

These will probably be useful commands:

g++ -Wall -Wpedantic -std=c++17 -march=haswell -O1 lab13.cpp tests.cpp
g++ -Wall -Wpedantic -std=c++17 -march=haswell -O3 lab13.cpp timing.cpp

Note: Timing Threads

In previous code, to count the running time, we used clock_gettime like this:

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);

That measures the amount of processor time used by the process: number of CPU seconds. If there are two concurrent threads each running for one second, measuring this way will record two seconds.

In timing.cpp here, clock_gettime is called like this:

clock_gettime(CLOCK_REALTIME, &start);

That measures real time (clock time). We're hoping to use approximately the same CPU time, but wait a fraction as long by using multiple cores. Here, two threads each running for one second will count as one second of time, which is what we're interested in.

The overall timing with CLOCK_REALTIME will likely be noisier: it's much more sensitive to whatever else is happening on your computer, but it's what we need to measure here.

Dot Product

The dot_single_c from previous exercises has been included.

Write a function dot_single_threaded that produces the same result (possibly with different rounding error), but with a multi-threaded calculation.

Your dot_single_threaded should make TASKS calls to dot_single_c with std::async to start parts in separate threads. You will need to store the std::future values it returns (in an array or vector).

Then .get from each std::future to get the results as they're done and sum those to get the final sum.

How much speedup do you get? Try it with an without -funsafe-math-optimizations. ❓

Hailstone

When timing the hailstone_length function in lab 3, I did it by calling the function on values 1 to n and summing the results to get a "total" hailstone length for all values in that range. Let's use that as a calculation-heavy problem that won't be limited by memory speed.

The provided hailstone_total(a, b) will calculate the total hailstone length for values a to b-1.

Write a function hailstone_total_threaded that makes TASKS calls to std::async(hailstone_total, ...), then .gets the results and returns their sum.

How much speedup do you get? ❓

Number of Tasks

Try running timing.cpp with different numbers of TASKS:

the exact number of threads your processor is capable of running, we'll call it \(t\). (The number beside "CPU(s)" in lscpu output.)
\(10t\)
\(1000t\)

Questions

How much did multi-threading change the running time of the dot product calculation? Did that change with -funsafe-math-optimizations?
Same question, for the hailstone calculation.
How did the number of threads affect the running time? Did you see any speedup/slowdown with more tasks than your processor could run concurrently?

Answers

I did these tests on a 16-core, 24-thread processor.

For me, the dot product code sped up by about 4.1 times with -O3 but only around 2.2 times when I added -funsafe-math-optimizations. The threaded result took about the same amount of time with either compilation: I think I'm seeing the limit of my memory bandwidth. Your results will vary depending on processor speed, number of cores, and memory speed.
I got a speedup around 14.1 here and that's closer to what I'd expect: around one thread worth of work done per core. I might have expected a little more than one (because of hyperthreading) but didn't see it on this workload.
For me, the timing between 24 and 240 tasks was indistinguishable. For 12000 tasks, everything got much slower: trying to queue too many tasks cost more than the calculation itself.

Code Solutions

My code for the two threaded problems:

float dot_single_threaded(float* arr1, float* arr2, uint64_t length) {
    assert(length % TASKS == 0);
    float total = 0.0;
    std::vector<std::future<float>> partials;
    uint64_t p = length/TASKS;
    for (uint64_t i = 0; i < TASKS; i++) {
        partials.push_back(std::async(dot_single_c, arr1 + i*p, arr2 + i*p, p));
    }
    for (uint64_t i = 0; i < TASKS; i++) {
        total += partials[i].get();
    }
    return total;
}

uint64_t hailstone_total_threaded(uint64_t a, uint64_t b) {
    assert((b-a) % TASKS == 0);
    uint64_t total = 0;
    uint64_t step = (b-a)/TASKS;
    std::vector<std::future<uint64_t>> partials;
    for (uint64_t i = 0; i < TASKS; i++) {
        partials.push_back(std::async(hailstone_total, a + i*step, a + i*step + step));
    }
    for (uint64_t i = 0; i < TASKS; i++) {
        total += partials[i].get();
    }
    return total;
}

Updated Fri April 19 2024, 11:17 by ggbaker.

Simon Fraser University
Engaging the World

CourSys