Not logged in. Login

Lane Crossing Issues in Wide SIMD Instruction Sets

Pack and Merge: Key Operations in Transposition

  • The key operation in Parabix transposition (s2p) is the Horizontal SIMD pack operation: pack \(2n\)-bit fields into \(n\) bits.
  • The key operation in Parabix reverse transposition (p2s) is the | IDISA Expansion merge operation.

Pure Merge in SSE2

The SSE2 instruction set has an operation punpcklwd which is a pure merge operation. The Intel Intrinsic guide has the following definition.

__m128i _mm_unpacklo_epi16 (__m128i a, __m128i b)
#include "emmintrin.h"
Instruction: punpcklwd xmm, xmm
CPUID Flags: SSE2
INTERLEAVE_WORDS(src1[127:0], src2[127:0]){
	dst[15:0] := src1[15:0] 
	dst[31:16] := src2[15:0] 
	dst[47:32] := src1[31:16] 
	dst[63:48] := src2[31:16] 
	dst[79:64] := src1[47:32] 
	dst[95:80] := src2[47:32] 
	dst[111:96] := src1[63:48] 
	dst[127:112] := src2[63:48] 
	RETURN dst[127:0]
}	
dst[127:0] := INTERLEAVE_WORDS(a[127:0], b[127:0])

AVX2: Merge/Unpack Within Two 128-bit Lanes

With AVX2, the unpack instruction was extended to 256 bits, but was defined to operate within separate lanes.

__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)
#include "immintrin.h"
Instruction: vpunpcklwd ymm, ymm, ymm
CPUID Flags: AVX2
dst[127:0] := INTERLEAVE_WORDS(a[127:0], b[127:0])
dst[255:128] := INTERLEAVE_WORDS(a[255:128], b[255:128])
dst[MAX:256] := 0

Because it is defined within lanes, it cannot be used to directly implement the simd_merge operation. In order to implement the simd_merge operation, additional work is necessary to swap the low 128 bits of the first register with the high 128 bits of the second register.

AVX-512BW: Merge/Unpack Within Four 128-bit Lanes

AVX-512 continues the extension of SIMD operations to 512 bits, but now has 4 128-bit lanes. This adds further cost to the implementation of simd-merge.

__m512i _mm512_unpacklo_epi16 (__m512i a, __m512i b)
#include "immintrin.h"
Instruction: vpunpcklwd
CPUID Flags: AVX512BW
dst[127:0] := INTERLEAVE_WORDS(a[127:0], b[127:0])
dst[255:128] := INTERLEAVE_WORDS(a[255:128], b[255:128])
dst[383:256] := INTERLEAVE_WORDS(a[383:256], b[383:256])
dst[511:384] := INTERLEAVE_WORDS(a[511:384], b[511:384])
Updated Wed March 07 2018, 08:13 by cameron.