Lab Exercise 11
Provided code: lab11.zip.
As usual, you have been provided with a file
lab11.h with headers for the functions described,
lab11c.cpp with analogous C implementations,
tests.cpp with reasonably thorough tests of the functions as described.
We're switching to C++ this week so we can use vectorclass.
Now that we know about the x86 vector/SIMD instructions, let's use them with problems we're familiar with.
lab11.S, copy your dot product and polynomial evaluation implementations from last week:
Write functions that produce the same results, but using the SIMD instructions and
map_poly_single_vec. You can assume the array length is divisible by the 4 (for double-precision) or 8 (for single-precision).
You will need the
vbroadcastss instructions (which weren't mentioned in lecture) which broadcast one double-/single-precision value to all of the fields in a vector register.
SIMD with Vectorclass
Create a C++ file
lab11_vc.cpp. In it, repeat the above, creating C++ implementations
map_poly_single_vc that use the vectorclass library to implement this logic using SIMD instructions accessed from C++.
timing.cpp provides some timing tests on reasonably-sized arrays. Have a look. How does your code compare to what the compiler wrote? (Use
-O3 to give the compiler its best chance.) ❓
This lab exercise is a little shorter than usual, to leave some time for you to get some of the Mini-Project done. Do that.
- Relative to your assembly code last week, how much did the "dot product" and "map polynomial" implementations speed up when using the vector instructions?
- On the two problems, what was the relative speedup of vectorized implementations on single-precision floating point values, over double-precision?
- When timing your assembly (and vectorclass) implementations and the implementations created by the compiler, you likely saw that for the "dot product" problem, the C implementation performed more like the non-vectorized assembly. For the "map polynomial" problem, the C implementation performed more like the vectorized assembly. Why was the compiler able to vectorize one but not the other?
Submit your work to Lab 11 in CourSys.