Not logged in. Login

SIMD Code Generation in the LLVM Back-End

Expressing SIMD operations in LLVM IR

LLVM IR has some very natural representations for SIMD processing.

  • The LLVM vector type naturally specifices a set of "multiple data" elements that can be processed together.
    • <4 x i32> is the llvm vector type for a four 32-bit integers treated as a single (128-bit) value.
    • <8 x float> is the llvm vector type for eight 32-bit floating point values treated as a single (256-bit) value.
    • <64 x i8> is the llvm vector type for sixty-four 8-bit integers treated together as a single 512-bit value.
  • Most SIMD vertical operations are expressed very naturally in LLVM IR.
    • op <M x iN> %x, %y is a SIMD vertical operation yielding a <M x iN> result for (iN, iN) -> iN operators
    • add, and, sub, or, mul, udiv, urem, sdiv, srem, shl, lshr, ...
    • This also applies to llvm instrinsics such as llvm.ctpop.
    • op <M x float> %x, %y is a SIMD vertical operation yielding a <M x float> result for (float, float) -> float operators
  • SIMD comparison to produce masks of all 1s or all 0s involves two IR operations:
    • %c = icmp ult <M x iN> %a, %b gives a <M x i1> result
    • %d = sext <M x i1> %c to <M x iN> sign-extends/ each one-bit field to an N-bit field of all 1s or all 0s.
    • BUT, this semantics is not supported correctly by current code generators!
    • Possible project opportunity for this course: improve back-end code generators to support masking correctly in all cases.
    • Another possible project: add icmpmsk instructions to LLVM IR.
  • Arbitrary field movement is supported by the shufflevector operation.

With these facilities, language implementers targetting SIMD operations can, in general, very easily express the desired SIMD outputs as LLVM IR.

The Back-End SIMD Compilation Problem

The richness and expressivity of LLVM IR creates challenges for the writers of compiler back-ends, however. Most target architectures support very few of the actual vector operations expressible in LLVM IR. The problem of choosing good machine-specific instructions for particular IR operations is called instruction selection.

Type Legalization

One of the key initial steps in compiling LLVM IR for a particular back-end is the process of type legalization. Essentially, the "legal" types for a given back-end are the ones that are directly supported by the instruction set architecture. We can thus expect i32 to be a legal type on most architecures, while i23 is not. In the case of SIMD, <4 x i32> will quite often be considered legal (x86 SSE registers, ARM Neon registers, Power PC Altivec registers), while <7 x i11> and <64 x i64> are not.

Type legalization is the process of transforming LLVM IR so that it only uses the "legal" types that the back-end compiler writer chooses to support.

Fortunately, the process of type legalization is mostly generic. That is, given a set of legal LLVM types, the operation to transform IR to use these types should be machine-independent. In general, three techniques are used:

  • splitting: split vector types that are too big into smaller ones
  • widening: add elements to vector types to reach a legal vector size
  • scalarizing: turn vector instructions into a sequence of instructions on individual fields.

Project area: SWARization.

See the SelectionDAG LegalizeTypes Phase.

Operation Legalization

Once we have IR confined to a suitable set of types, the next phase is to "legalize" operations. This is to transform the IR to only use the operations actually supported by the underlying architecture.

Exploring Back-End Translations with llc

LLVM has a handy back-end compiler llc that works directly with LLVM IR, translating it machine-code for your desired target architecture. It can even be built for cross-compilation, e.g., generating ARM code from an x86 platform.

For example, consider the following add16i8 function.

define <16 x i8> @add16i8(<16 x i8> %a, <16 x i8> %b)  {
entry:
  %0 = add <16 x i8> %a, %b
  ret <16 x i8> %0
}

By putting this function in a file add16i8.ll and compiling it with llc, we can see the results of code generation.

llc -filetype=asm add16i8.ll

With the file-type=asm specification, we get assembler output for our target machine generated the add16i.s file.

	.text
	.file	"add16i8.ll"
	.globl	add16i8
	.align	16, 0x90
	.type	add16i8,@function
add16i8:                                # @add16i8
	.cfi_startproc
# BB#0:                                 # %entry
	paddb	%xmm1, %xmm0
	retq
.Lfunc_end0:
	.size	add16i8, .Lfunc_end0-add16i8
	.cfi_endproc


	.section	".note.GNU-stack","",@progbits

Now, let's see what happens if we compile the add32i8 function.

define <32 x i8> @add32i8(<32 x i8> %a, <32 x i8> %b)  {
entry:
  %0 = add <32 x i8> %a, %b
  ret <32 x i8> %0
}
llc -filetype=asm add32i8.ll

With the default x86 architecture as the target, the 256-bit vectors are first split into two, so we see the generated code using 2 instructions to implement the 256-bit add (boilerplate removed).

add32i8:                                # @add32i8
# BB#0:                                 # %entry
	paddb	%xmm2, %xmm0
	paddb	%xmm3, %xmm1
	retq

But we can try again, asking specifically for 256-bit AVX-2 instructions.

llc -filetype=asm add32i8.ll -mattr=+avx2

Now we see that the code generator has taken advantage of AVX-2 native support for 256-bit vector adds.

add32i8:                                # @add32i8
# BB#0:                                 # %entry
	vpaddb	%ymm1, %ymm0, %ymm0
	retq

Now consider what happens for the unusual case of <25 x i5> vectors.

define <25 x i5> @add25i5(<25 x i5> %a, <25 x i5> %b)  {
entry:
  %0 = add <25 x i5> %a, %b
  ret <25 x i5> %0
}
llc -filetype=asm add25i5.ll

What do you get? (Try it yourself.) The compilation result goes through steps of widening, splitting and scalarizing to give us a pretty ugly looking result! Definitely room for improvement!

Updated Wed Jan. 10 2018, 11:43 by cameron.