SIMD Code Generation in the LLVM Back-End
Expressing SIMD operations in LLVM IR
LLVM IR has some very natural representations for SIMD processing.
- The LLVM vector type naturally specifices a set of "multiple data" elements that can be processed together.
<4 x i32>
is the llvm vector type for a four 32-bit integers treated as a single (128-bit) value.<8 x float>
is the llvm vector type for eight 32-bit floating point values treated as a single (256-bit) value.<64 x i8>
is the llvm vector type for sixty-four 8-bit integers treated together as a single 512-bit value.
- Most SIMD vertical operations are expressed very naturally in LLVM IR.
op <M x iN> %x, %y
is a SIMD vertical operation yielding a<M x iN>
result for(iN, iN) -> iN
operators- add, and, sub, or, mul, udiv, urem, sdiv, srem, shl, lshr, ...
- This also applies to llvm instrinsics such as
llvm.ctpop
. op <M x float> %x, %y
is a SIMD vertical operation yielding a<M x float>
result for(float, float) -> float
operators
- SIMD comparison to produce masks of all 1s or all 0s involves two IR operations:
%c = icmp ult <M x iN> %a, %b
gives a<M x i1>
result%d = sext <M x i1> %c to <M x iN>
sign-extends/ each one-bit field to an N-bit field of all 1s or all 0s.- BUT, this semantics is not supported correctly by current code generators!
- Possible project opportunity for this course: improve back-end code generators to support masking correctly in all cases.
- Another possible project: add
icmpmsk
instructions to LLVM IR.
- Horizontal SIMD expansion is expressed with extension operations:
sext
sign-extendzext
zero-extendfpext
floating-point extend
- Horizontal SIMD packing is supported by truncate operations.
trunc
integer truncationfptrunc
floating-point truncation
- Arbitrary field movement is supported by the
shufflevector
operation.
With these facilities, language implementers targetting SIMD operations can, in general, very easily express the desired SIMD outputs as LLVM IR.
The Back-End SIMD Compilation Problem
The richness and expressivity of LLVM IR creates challenges for the writers of compiler back-ends, however. Most target architectures support very few of the actual vector operations expressible in LLVM IR. The problem of choosing good machine-specific instructions for particular IR operations is called instruction selection.
Type Legalization
One of the key initial steps in compiling LLVM IR for a particular back-end is the process of type legalization. Essentially, the "legal" types for a given back-end are the ones that are directly supported by the instruction set architecture. We can thus expect i32
to be a legal type on most architecures, while i23
is not. In the case of SIMD, <4 x i32>
will quite often be considered legal (x86 SSE registers, ARM Neon registers, Power PC Altivec registers), while <7 x i11>
and <64 x i64>
are not.
Type legalization is the process of transforming LLVM IR so that it only uses the "legal" types that the back-end compiler writer chooses to support.
Fortunately, the process of type legalization is mostly generic. That is, given a set of legal LLVM types, the operation to transform IR to use these types should be machine-independent. In general, three techniques are used:
- splitting: split vector types that are too big into smaller ones
- widening: add elements to vector types to reach a legal vector size
- scalarizing: turn vector instructions into a sequence of instructions on individual fields.
Project area: SWARization.
See the SelectionDAG LegalizeTypes Phase.
Operation Legalization
Once we have IR confined to a suitable set of types, the next phase is to "legalize" operations. This is to transform the IR to only use the operations actually supported by the underlying architecture.
Exploring Back-End Translations with llc
LLVM has a handy back-end compiler llc
that works directly with LLVM IR, translating it machine-code for your desired target architecture. It can even be built for cross-compilation, e.g., generating ARM code from an x86 platform.
For example, consider the following add16i8
function.
define <16 x i8> @add16i8(<16 x i8> %a, <16 x i8> %b) { entry: %0 = add <16 x i8> %a, %b ret <16 x i8> %0 }
By putting this function in a file add16i8.ll
and compiling it
with llc
, we can see the results of code generation.
llc -filetype=asm add16i8.ll
With the file-type=asm
specification, we get assembler output for our target machine generated the add16i.s
file.
.text .file "add16i8.ll" .globl add16i8 .align 16, 0x90 .type add16i8,@function add16i8: # @add16i8 .cfi_startproc # BB#0: # %entry paddb %xmm1, %xmm0 retq .Lfunc_end0: .size add16i8, .Lfunc_end0-add16i8 .cfi_endproc .section ".note.GNU-stack","",@progbits
Now, let's see what happens if we compile the add32i8
function.
define <32 x i8> @add32i8(<32 x i8> %a, <32 x i8> %b) { entry: %0 = add <32 x i8> %a, %b ret <32 x i8> %0 }
llc -filetype=asm add32i8.ll
With the default x86 architecture as the target, the 256-bit vectors are first split into two, so we see the generated code using 2 instructions to implement the 256-bit add (boilerplate removed).
add32i8: # @add32i8 # BB#0: # %entry paddb %xmm2, %xmm0 paddb %xmm3, %xmm1 retq
But we can try again, asking specifically for 256-bit AVX-2 instructions.
llc -filetype=asm add32i8.ll -mattr=+avx2
Now we see that the code generator has taken advantage of AVX-2 native support for 256-bit vector adds.
add32i8: # @add32i8 # BB#0: # %entry vpaddb %ymm1, %ymm0, %ymm0 retq
Now consider what happens for the unusual case of <25 x i5> vectors.
define <25 x i5> @add25i5(<25 x i5> %a, <25 x i5> %b) { entry: %0 = add <25 x i5> %a, %b ret <25 x i5> %0 }
llc -filetype=asm add25i5.ll
What do you get? (Try it yourself.) The compilation result goes through steps of widening, splitting and scalarizing to give us a pretty ugly looking result! Definitely room for improvement!