Not logged in. Login

Horizontal SIMD

Horizontal SIMD operations refer to those which move data fields across SIMD registers, possibly combining them in some way.

Data Rearrangement

The simplest type of horizontal SIMD operations are data rearrangement operations, which rearrange the data fields in some way, but do not change the values within fields.

LLVM provides a very general data rearrangement operation called shufflevector. This operation permits arbitrary selection of fields from two input vectors to produce an output vector. A selection mask specifies, for each element of the output vector, which field from the input vectors is taken.

For example, consider the following operation on two <4 x i64> vectors.

shufflevector <4 x i64> <A, B, C, D>, <4 x i64> <E, F, G, H>, <4 x i32> <3, 7, 1, 6>

Here, the constant selection mask specifies a shuffled selection of elements from the two vectors, producing the following result.

<4 x i64> <D, H, B, G>

Elements are numbered left to right across both vectors, starting from 0. The selection mask must be a constant mask of i32 values.

We can see how such an operation can be compiled into actual SIMD operations using llc. Consider the LLVM function:

define <4 x i64> @myshuffle(<4 x i64> %a, <4 x i64> %b)  {
entry:
  %0 = shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32> <i32 3, i32 7, i32 1, i32 6>
  ret <4 x i64> %0
}

Now we can use llc specifying an AVX2 target.

llc myshuffle.ll -filetype=asm +mattr=avx2

The assembler output shows the actual horizontal SIMD operations generated in myshuffle.s.

myshuffle:                                      # @myshuffle
# %bb.0:                                        # %entry
	vpermpd	$172, %ymm1, %ymm1              # ymm1 = ymm1[0,3,2,2]
	vpermpd	$215, %ymm0, %ymm0              # ymm0 = ymm0[3,1,1,3]
	vblendps $204, %ymm1, %ymm0, %ymm0      # ymm0 = ymm0[0,1],ymm1[2,3],ymm0[4,5],ymm1[6,7]
	retq

Note that the generated code actually required three AVX2 operations to implement the shufflevector in this case. Shufflevector can be used to express arbitrary rearrangements of vector elements, but particular SIMD instruction set architectures may confine their instruction set to the most common cases.

Merge

An important special case of data rearrangement are merge operations. These operation select alternating elements from two vectors, merging the values together in a single result. For each different vector type, two different operations are required: mergel for merging fields from the low half of two input registers and mergeh for merging from the high half.

The following LLVM functions illustrate how merge operations can be expressed using shufflevector.

define <4 x i32> @mergel(<4 x i32> %a, <4 x i32> %b)  {
entry:
  %0 = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
  ret <4 x i32> %0
}

define <4 x i32> @mergeh(<4 x i32> %a, <4 x i32> %b)  {
entry:
  %0 = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
  ret <4 x i32> %0
}

Compiling with llc on SSE2, we can see that these operations are directly supported as single SIMD instructions unpcklps and unpckhps.

	.file	"merge.ll"
mergel:                                 # @mergel
# %bb.0:                                # %entry
	unpcklps	%xmm1, %xmm0    # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	retq

mergeh:                                 # @mergeh
# %bb.0:                                # %entry
	unpckhps	%xmm1, %xmm0    # xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	retq

Pack

Another class of important horizontal operations are pack operations. These operations essentially pack two vectors into one by taking every second element.

define <4 x i32> @packl(<4 x i32> %a, <4 x i32> %b)  {
entry:
  %0 = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  ret <4 x i32> %0
}

define <4 x i32> @packh(<4 x i32> %a, <4 x i32> %b)  {
entry:
  %0 = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  ret <4 x i32> %0
}

Another way of viewing these pack operations is that packh takes the high-half of each 64-bit field, while packl takes the low half of each 64-bit field.

Pack operations on 8-, 16-, 32- and 64-bit fields are fairly important in SIMD data rearrangement, so they tend to be directly implemented as shown by the following SSE2 code generated by llc.

packl:                                  # @packl
# %bb.0:                                # %entry
	shufps	$136, %xmm1, %xmm0              # xmm0 = xmm0[0,2],xmm1[0,2]
	retq

packh:                                  # @packh
# %bb.0:                                # %entry
	shufps	$221, %xmm1, %xmm0              # xmm0 = xmm0[1,3],xmm1[1,3]
	retq

Pack and Merge in the Parabix Framework

Pack and merge are fundamental operations for the Parabix framework, implementing the Parabix Transform and the inverse Parabix Transform, respectively. Note that the Parabix transform actually would benefit from direct support of narrow bit-fields (1-bit, 2-bit, 4-bit), but these need to be simulated by multiple operations on current architectures.

Other Special Shufflevector Patterns

While support for fully general shuffle vectors may be lacking, or expensive on various SIMD architectures, there are many special cases that might be supported. A good SIMD compiler project would be to identify the important cases that are supported by particular architectures and to ensure that the LLVM code generator for the given architecture produces optimal code in each case. That is the goal of the Shuffle Pattern Library project.

Reduction

Horizontal reduction operations not only move data fields horizontally across SIMD registers, but also combine those fields in some way.

A simple but important example is to add all the fields together to produce a single reduced sum.

LLVM has a set of Vector Reduction Intrinsics for such operations. While SIMD instruction sets may directly implement some of these intrinsics, others require code generators to simulate the intrinsic through a series of operations.

Updated Tue Dec. 17 2024, 05:25 by cameron.