Lab Exercise 13
Provided code: lab13.zip.
There are no marks associated with this "lab". It cannot be submitted and will not be marked. It is not a bonus activity. You will not be tested on it. It simply exists.
You have been provided with a file lab13.h
with headers for the functions described, and a partially-completed lab13.cpp
. The tests.cpp
contains reasonably thorough tests of the functions as described, and timing.cpp
contains code to time the implementations.
Note: Async Calls
In these questions, you'll use std::async
to start function calls that solve pieces of the problem, each in a different thread. The C standard doesn't guarantee that async calls will each be in their own thread: there may be a limited number of threads created and async tasks are queued up to run in them (or it's possible no extra threads are created if that's not possible, maybe in an embedded system, but we're not in that case).
You may assume that the size of the problem (array size or b-a
, respectively) is divisible by the number of tasks. Test sizes have been chosen so they are divisible by likely choices you might make: 2, 4, 6, 8, 12, 24.
A constant TASKS
has been defined in lab13.cpp
: you're free to change it. Double your number of physical cores is probably a good default.
These will probably be useful commands:
g++ -Wall -Wpedantic -std=c++17 -march=haswell -O1 lab13.cpp tests.cpp
g++ -Wall -Wpedantic -std=c++17 -march=haswell -O3 lab13.cpp timing.cpp
Note: Timing Threads
In previous code, to count the running time, we used clock_gettime
like this:
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
That measures the amount of processor time used by the process: number of CPU seconds. If there are two concurrent threads each running for one second, measuring this way will record two seconds.
In timing.cpp
here, clock_gettime
is called like this:
clock_gettime(CLOCK_REALTIME, &start);
That measures real time (clock time). We're hoping to use approximately the same CPU time, but wait a fraction as long by using multiple cores. Here, two threads each running for one second will count as one second of time, which is what we're interested in.
The overall timing with CLOCK_REALTIME
will likely be noisier: it's much more sensitive to whatever else is happening on your computer, but it's what we need to measure here.
Dot Product
The dot_single_c
from previous exercises has been included.
Write a function dot_single_threaded
that produces the same result (possibly with different rounding error), but with a multi-threaded calculation.
Your dot_single_threaded
should make TASKS
calls to dot_single_c
with std::async
to start parts in separate threads. You will need to store the std::future
values it returns (in an array or vector).
Then .get
from each std::future
to get the results as they're done and sum those to get the final sum.
How much speedup do you get? Try it with an without -funsafe-math-optimizations
. ❓
Hailstone
When timing the hailstone_length
function in lab 3, I did it by calling the function on values 1 to n and summing the results to get a "total" hailstone length for all values in that range. Let's use that as a calculation-heavy problem that won't be limited by memory speed.
The provided hailstone_total(a, b)
will calculate the total hailstone length for values a
to b-1
.
Write a function hailstone_total_threaded
that makes TASKS
calls to std::async(hailstone_total, ...)
, then .get
s the results and returns their sum.
How much speedup do you get? ❓
Number of Tasks
Try running timing.cpp
with different numbers of TASKS
:
- the exact number of threads your processor is capable of running, we'll call it \(t\). (The number beside "CPU(s)" in
lscpu
output.) - \(10t\)
- \(1000t\)
Questions
- How much did multi-threading change the running time of the dot product calculation? Did that change with
-funsafe-math-optimizations
? - Same question, for the hailstone calculation.
- How did the number of threads affect the running time? Did you see any speedup/slowdown with more tasks than your processor could run concurrently?
Answers
I did these tests on a 16-core, 24-thread processor.
- For me, the dot product code sped up by about 4.1 times with
-O3
but only around 2.2 times when I added-funsafe-math-optimizations
. The threaded result took about the same amount of time with either compilation: I think I'm seeing the limit of my memory bandwidth. Your results will vary depending on processor speed, number of cores, and memory speed. - I got a speedup around 14.1 here and that's closer to what I'd expect: around one thread worth of work done per core. I might have expected a little more than one (because of hyperthreading) but didn't see it on this workload.
- For me, the timing between 24 and 240 tasks was indistinguishable. For 12000 tasks, everything got much slower: trying to queue too many tasks cost more than the calculation itself.
Code Solutions
My code for the two threaded problems:
float dot_single_threaded(float* arr1, float* arr2, uint64_t length) {
assert(length % TASKS == 0);
float total = 0.0;
std::vector<std::future<float>> partials;
uint64_t p = length/TASKS;
for (uint64_t i = 0; i < TASKS; i++) {
partials.push_back(std::async(dot_single_c, arr1 + i*p, arr2 + i*p, p));
}
for (uint64_t i = 0; i < TASKS; i++) {
total += partials[i].get();
}
return total;
}
uint64_t hailstone_total_threaded(uint64_t a, uint64_t b) {
assert((b-a) % TASKS == 0);
uint64_t total = 0;
uint64_t step = (b-a)/TASKS;
std::vector<std::future<uint64_t>> partials;
for (uint64_t i = 0; i < TASKS; i++) {
partials.push_back(std::async(hailstone_total, a + i*step, a + i*step + step));
}
for (uint64_t i = 0; i < TASKS; i++) {
total += partials[i].get();
}
return total;
}