An Example MultiBlock Kernel: Expand3_4 Kernel

The MultiBlockKernel interface in Parabix is the normal method for programming kernels directly in terms of IR. It is used for processing multiple data blocks on each call to the kernel logic.

Block size in Parabix is equal to the number of bits in the current SIMD registers.
For AVX-2, block size is 256. When processing a byte stream, each block is 256 bytes.
Don't confuse data blocks with LLVM IR Basic Blocks!

Let us consider the expand3_4kernel used in the base 64 program as an example. The purpose of this kernel is to expanding an input byte stream by duplicating every third byte. Thus, each 3 bytes of the input abc produces a 4 byte output abcc.

From include/kernel/util/radix64.h we can see the class declaration for this kernel.

class expand3_4Kernel final : public MultiBlockKernel {
public:
    expand3_4Kernel(KernelBuilder & b, StreamSet * input, StreamSet * expandedOutput);
private:
    void generateMultiBlockLogic(KernelBuilder & b, llvm::Value * const numOfStrides) override;
};

The constructor implementation in lib/kernel/util/radix64.cpp declares the name of the kernel, and its input and output streams. It also specifies processing rates specifying that 4 output elements are generated for every three input elements.

expand3_4Kernel::expand3_4Kernel(KernelBuilder & b, StreamSet *input, StreamSet *expandedOutput)
: MultiBlockKernel(b, "expand3_4",
{Binding{"sourceStream", input, FixedRate(3)}},
{Binding{"expand34Stream", expandedOutput, FixedRate(4)}},
{}, {}, {}) {

}

The main logic for generating the IR is programmed in the generateMultiBlockLogic method. The input parameter numOfStrides identifies the number of blocks of input that the kernel must process before returning control to the pipeline.

The generateMultiBlockLogic function must program a loop to iterate through the required number of strides.

// This kernel produces an expanded input stream by duplicating every third byte.
// It is implemented using SIMD shufflevector operations.  With 16-byte registers,
// a single shufflevector operation produces 16 bytes of output data from the
// 12 bytes of input data.   With 32-byte registers, 32 bytes of output data are
// produced from 24 bytes of input data.
//
// Using aligned SIMD loads, an inner loop processes three registers full of input
// data (i.e., three BytePacks) to produce four registers full of output.   This is
// a 3 step process.
// Step 1:  Load input_pack0, apply the shuffle operation to produce output_pack0.
//          At this point 3/4 of the data in input_pack0 has been processed.
// Step 2:  Load input_pack1, apply a shuffle operation to use the remaining
//          1/4 of input_pack0 and 1/2 of input_pack1 to produce output_pack1.
//          At this point 1/2 of the data in input_pack1 has been processed.
// Step 3:  Load input_pack2, apply a shuffle operation to use the remaining 1/2
//          of input_pack1 and 1/4 of input_pack2 to produce output_pack2.
//          Then apply a further shuffle opertaion to use the remaining 3/4 of
//          input_pack2 to produce output_pack3.

// The MultiBlockLogic is based on a natural stride taking 3 packs at a time.
// In this case, the output produced is exactly 4 packs or 4 blocks, with no pending
// data maintained in the kernel state.
//
// When processing the final partial stride of data, the kernel performs full
// triple-pack processing for each full or partial triple-pack remaining,
// relying on the MultiBlockKernel builder to only copy the correct number
// of bytes to the actual output stream.

void expand3_4Kernel::generateMultiBlockLogic(KernelBuilder & b, Value * const numOfStrides) {

    BasicBlock * expand2_3entry = b.GetInsertBlock();
    BasicBlock * expand_3_4_loop = b.CreateBasicBlock("expand_3_4_loop");
    BasicBlock * expand3_4_exit = b.CreateBasicBlock("expand3_4_exit");

    // Determine the require shufflevector constants.
    const unsigned PACK_SIZE = b.getBitBlockWidth()/8;

    ConstantInt * const ZERO = b.getSize(0);
    ConstantInt * const ONE = b.getSize(1);
    ConstantInt * const THREE = b.getSize(3);
    ConstantInt * const FOUR = b.getSize(4);
    ConstantInt * const SEVEN = b.getSize(7);

    // Construct a list of indexes in  the form
    // 0, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 8, ...
    unsigned sourceByteIndex = 0;
    SmallVector<unsigned, 64> expand3_4_index(PACK_SIZE);
    for (unsigned i = 0; i < PACK_SIZE; i++) {
        expand3_4_index[i] = sourceByteIndex;
        if (i % 4 != 2) sourceByteIndex++;
    }
    unsigned const expand3_4_offset[4] = {PACK_SIZE, 3*PACK_SIZE/4, PACK_SIZE/2, PACK_SIZE/4};
    Value * expand_3_4_shuffle[4];
    for (unsigned j = 0; j < 4; j++) {
        std::vector<Constant *> Idxs;
        for (unsigned i = 0; i < PACK_SIZE; i++) {
            Idxs.push_back(ConstantInt::get(b.getInt32Ty(), expand3_4_offset[j] + expand3_4_index[i]));
        }
        expand_3_4_shuffle[j] = ConstantVector::get(Idxs);
    }

    UndefValue * undefPack = UndefValue::get(b.fwVectorType(8));
    Value * const numOfBlocks = b.CreateMul(numOfStrides, b.getSize(8));
    // The main loop processes 3 packs of data at a time.
    b.CreateBr(expand_3_4_loop);

    b.SetInsertPoint(expand_3_4_loop);
    PHINode * strideOffset = b.CreatePHI(b.getSizeTy(), 2);
    strideOffset->addIncoming(ZERO, expand2_3entry);

    Value * const baseInputOffset = b.CreateMul(strideOffset, THREE);
    Value * const baseOutputOffset = b.CreateMul(strideOffset, FOUR);
    Value * carryOver = undefPack;
    for (unsigned i = 0; i < 3; ++i) {
        ConstantInt * const index = b.getSize(i);
        Value * const inputOffset = b.CreateAdd(baseInputOffset, index);
        Value * const inputPackIndex = b.CreateAnd(inputOffset, SEVEN);
        Value * const inputBlockOffset = b.CreateLShr(inputOffset, THREE);
        Value * const input = b.fwCast(8, b.loadInputStreamPack("sourceStream", ZERO, inputPackIndex, inputBlockOffset));
        Value * const expanded = b.CreateShuffleVector(carryOver, input, expand_3_4_shuffle[i]);
        Value * const outputOffset = b.CreateAdd(baseOutputOffset, index);
        Value * const outputPackIndex = b.CreateAnd(outputOffset, SEVEN);
        Value * const outputBlockOffset = b.CreateLShr(outputOffset, THREE);
        b.storeOutputStreamPack("expand34Stream", ZERO, outputPackIndex, outputBlockOffset, b.bitCast(expanded));
        carryOver = input;
    }
    Value * expanded = b.CreateShuffleVector(carryOver, undefPack, expand_3_4_shuffle[3]);
    Value * outputOffset = b.CreateAdd(baseOutputOffset, THREE);
    Value * const outputPackIndex = b.CreateAnd(outputOffset, SEVEN);
    Value * const outputBlockOffset = b.CreateLShr(outputOffset, THREE);
    b.storeOutputStreamPack("expand34Stream", ZERO, outputPackIndex, outputBlockOffset, b.bitCast(expanded));

    Value * const nextStrideOffset = b.CreateAdd(strideOffset, ONE);
    strideOffset->addIncoming(nextStrideOffset, expand_3_4_loop);
    Value * continueLoop = b.CreateICmpULT(nextStrideOffset, numOfBlocks);
    b.CreateCondBr(continueLoop, expand_3_4_loop, expand3_4_exit);

    b.SetInsertPoint(expand3_4_exit);
}

Using -ShowIR we can see the actual LLVM IR generated.

; Function Attrs: norecurse
define void @expand3_4_DoSegment(i64 %numOfStrides, i64 %fixedRateFactor, i8* %sourceStream, i64 %sourceStream_processed, i8* %expand34Stream, i64 %expand34Stream_produced) #0 {
entry:
  %0 = alloca { [1 x [8 x <2 x i64>]]*, i64 }, align 8
  %1 = alloca { [1 x [8 x <2 x i64>]]*, i64 }, align 8
  %2 = icmp eq i64 %numOfStrides, 0
  %3 = select i1 %2, i64 1, i64 %numOfStrides
  %4 = bitcast i8* %sourceStream to [1 x [8 x <2 x i64>]]*
  %5 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %1, i32 0, i32 0
  store [1 x [8 x <2 x i64>]]* %4, [1 x [8 x <2 x i64>]]** %5, align 8
  %6 = add i64 %sourceStream_processed, %fixedRateFactor
  %7 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %1, i32 0, i32 1
  store i64 %6, i64* %7, align 8
  %8 = bitcast i8* %expand34Stream to [1 x [8 x <2 x i64>]]*
  %9 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %0, i32 0, i32 0
  store [1 x [8 x <2 x i64>]]* %8, [1 x [8 x <2 x i64>]]** %9, align 8
  %10 = mul i64 %fixedRateFactor, 4
  %11 = udiv i64 %10, 3
  %12 = urem i64 %10, 3
  %13 = icmp ne i64 %12, 0
  %14 = zext i1 %13 to i64
  %15 = add i64 %11, %14
  %16 = add i64 %expand34Stream_produced, %15
  %17 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %0, i32 0, i32 1
  store i64 %16, i64* %17, align 8
  %18 = mul i64 %3, 8
  br label %expand_3_4_loop

expand_3_4_loop:                                  ; preds = %expand_3_4_loop, %entry
  %19 = phi i64 [ 0, %entry ], [ %88, %expand_3_4_loop ]
  %20 = mul i64 %19, 3
  %21 = mul i64 %19, 4
  %22 = add i64 %20, 0
  %23 = and i64 %22, 7
  %24 = lshr i64 %22, 3
  %25 = lshr i64 %sourceStream_processed, 7
  %26 = add i64 %25, %24
  %27 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %1, i32 0, i32 0
  %28 = load [1 x [8 x <2 x i64>]]*, [1 x [8 x <2 x i64>]]** %27, align 8
  %29 = getelementptr inbounds [1 x [8 x <2 x i64>]], [1 x [8 x <2 x i64>]]* %28, i64 %26, i64 0, i64 %23
  %30 = load <2 x i64>, <2 x i64>* %29, align 16
  %31 = bitcast <2 x i64> %30 to <16 x i8>
  %32 = shufflevector <16 x i8> undef, <16 x i8> %31, <16 x i32> <i32 16, i32 17, i32 18, i32 18, i32 19, i32 20, i32 21, i32 21, i32 22, i32 23, i32 24, i32 24, i32 25, i32 26, i32 27, i32 27>
  %33 = add i64 %21, 0
  %34 = and i64 %33, 7
  %35 = lshr i64 %33, 3
  %"expand3_4.expand34Stream[0]" = bitcast <16 x i8> %32 to <2 x i64>
  %36 = lshr i64 %expand34Stream_produced, 7
  %37 = add i64 %36, %35
  %38 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %0, i32 0, i32 0
  %39 = load [1 x [8 x <2 x i64>]]*, [1 x [8 x <2 x i64>]]** %38, align 8
  %40 = getelementptr inbounds [1 x [8 x <2 x i64>]], [1 x [8 x <2 x i64>]]* %39, i64 %37, i64 0, i64 %34
  store <2 x i64> %"expand3_4.expand34Stream[0]", <2 x i64>* %40, align 16
  %41 = add i64 %20, 1
  %42 = and i64 %41, 7
  %43 = lshr i64 %41, 3
  %44 = lshr i64 %sourceStream_processed, 7
  %45 = add i64 %44, %43
  %46 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %1, i32 0, i32 0
  %47 = load [1 x [8 x <2 x i64>]]*, [1 x [8 x <2 x i64>]]** %46, align 8
  %48 = getelementptr inbounds [1 x [8 x <2 x i64>]], [1 x [8 x <2 x i64>]]* %47, i64 %45, i64 0, i64 %42
  %49 = load <2 x i64>, <2 x i64>* %48, align 16
  %50 = bitcast <2 x i64> %49 to <16 x i8>
  %51 = shufflevector <16 x i8> %31, <16 x i8> %50, <16 x i32> <i32 12, i32 13, i32 14, i32 14, i32 15, i32 16, i32 17, i32 17, i32 18, i32 19, i32 20, i32 20, i32 21, i32 22, i32 23, i32 23>
  %52 = add i64 %21, 1
  %53 = and i64 %52, 7
  %54 = lshr i64 %52, 3
  %"expand3_4.expand34Stream[0]1" = bitcast <16 x i8> %51 to <2 x i64>
  %55 = lshr i64 %expand34Stream_produced, 7
  %56 = add i64 %55, %54
  %57 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %0, i32 0, i32 0
  %58 = load [1 x [8 x <2 x i64>]]*, [1 x [8 x <2 x i64>]]** %57, align 8
  %59 = getelementptr inbounds [1 x [8 x <2 x i64>]], [1 x [8 x <2 x i64>]]* %58, i64 %56, i64 0, i64 %53
  store <2 x i64> %"expand3_4.expand34Stream[0]1", <2 x i64>* %59, align 16
  %60 = add i64 %20, 2
  %61 = and i64 %60, 7
  %62 = lshr i64 %60, 3
  %63 = lshr i64 %sourceStream_processed, 7
  %64 = add i64 %63, %62
  %65 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %1, i32 0, i32 0
  %66 = load [1 x [8 x <2 x i64>]]*, [1 x [8 x <2 x i64>]]** %65, align 8
  %67 = getelementptr inbounds [1 x [8 x <2 x i64>]], [1 x [8 x <2 x i64>]]* %66, i64 %64, i64 0, i64 %61
  %68 = load <2 x i64>, <2 x i64>* %67, align 16
  %69 = bitcast <2 x i64> %68 to <16 x i8>
  %70 = shufflevector <16 x i8> %50, <16 x i8> %69, <16 x i32> <i32 8, i32 9, i32 10, i32 10, i32 11, i32 12, i32 13, i32 13, i32 14, i32 15, i32 16, i32 16, i32 17, i32 18, i32 19, i32 19>
  %71 = add i64 %21, 2
  %72 = and i64 %71, 7
  %73 = lshr i64 %71, 3
  %"expand3_4.expand34Stream[0]2" = bitcast <16 x i8> %70 to <2 x i64>
  %74 = lshr i64 %expand34Stream_produced, 7
  %75 = add i64 %74, %73
  %76 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %0, i32 0, i32 0
  %77 = load [1 x [8 x <2 x i64>]]*, [1 x [8 x <2 x i64>]]** %76, align 8
  %78 = getelementptr inbounds [1 x [8 x <2 x i64>]], [1 x [8 x <2 x i64>]]* %77, i64 %75, i64 0, i64 %72
  store <2 x i64> %"expand3_4.expand34Stream[0]2", <2 x i64>* %78, align 16
  %79 = shufflevector <16 x i8> %69, <16 x i8> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 6, i32 7, i32 8, i32 9, i32 9, i32 10, i32 11, i32 12, i32 12, i32 13, i32 14, i32 15, i32 15>
  %80 = add i64 %21, 3
  %81 = and i64 %80, 7
  %82 = lshr i64 %80, 3
  %"expand3_4.expand34Stream[0]3" = bitcast <16 x i8> %79 to <2 x i64>
  %83 = lshr i64 %expand34Stream_produced, 7
  %84 = add i64 %83, %82
  %85 = getelementptr inbounds { [1 x [8 x <2 x i64>]]*, i64 }, { [1 x [8 x <2 x i64>]]*, i64 }* %0, i32 0, i32 0
  %86 = load [1 x [8 x <2 x i64>]]*, [1 x [8 x <2 x i64>]]** %85, align 8
  %87 = getelementptr inbounds [1 x [8 x <2 x i64>]], [1 x [8 x <2 x i64>]]* %86, i64 %84, i64 0, i64 %81
  store <2 x i64> %"expand3_4.expand34Stream[0]3", <2 x i64>* %87, align 16
  %88 = add i64 %19, 1
  %89 = icmp ult i64 %88, %18
  br i1 %89, label %expand_3_4_loop, label %expand3_4_exit

expand3_4_exit:                                   ; preds = %expand_3_4_loop
  ret void
}

; Function Attrs: norecurse
define void @expand3_4_Finalize() #0 {
entry:
  ret void
}

Updated Tue Dec. 17 2024, 05:25 by cameron.

Simon Fraser University
Engaging the World

CourSys

An Example MultiBlock Kernel: Expand3_4 Kernel