Parabix Kernels
Kernels + Stream Sets = Parabix Programs
Parabix programs are composed of kernels operating on stream sets of parallel streams.
A stream is an arbitrary-length vector of iN
fields, where N = 1 in the case of bit streams and N = 8 in the case of byte streams. The Parabix framework is generally concerned with streams in which the field widths are powers of 2.
A stream set is a set of streams all of the same field width and same length, designed to be processed together in one-to-one correspondence.
Kernels are computational abstractions that take one or more stream sets as inputs and produce one or more stream sets as outputs.
Source Kernels and Sink Kernels
Source Kernels
Parabix pipeline programs always begin with a kernel that produces an output stream set from an external source. Such a kernel is called a source kernel to reflect its role in the Parabix pipeline: it has no input stream set, but generates an output stream set as the first step in processing.
MMapSourceKernel
: file input usingmmap
ReadSourceKernel
: file input usingread
(including stdin)FDSourceKernel
: unified kernel using eithermmap
orread
MemorySourceKernel
: input from a memory buffer
Sink Kernels
The final kernel in a Parabix pipeline program is always a sink kernel, one that takes a stream set as input, but produce no stream sets as output. Sink kernels direct their output externally.
StdOutKernel
FileSinkKernel
Fixed-Rate Bitstream Kernels: Pablo
Pablo is designed to automatically handle all buffering and boundary-crossing when kernels are restricted as follows:
- There is a one-to-one correspondence between input stream sets and output stream sets.
- The fundamental operations are bitwise logic, stream shift (
Advance
) and addition (MatchStar
,ScanThru
).
- Well-suited to search problems: 1 bit per position marks locations of found items.
- regular expression matching
- edit distance
- parsing
- Pablo Extensions (work in progress)
- Handle arbitrary
iN
streams.- s2p in Pablo
- Direct CC builder
Sequential Bitscan Kernels
After bitstreams have been computed in parallel, sequential processing of identified positions may be required.
If bit stream density is low, sequential bit scanning is very fast.
bsf
instruction identifies the next one bit in a 64-bit word.- Additional
BMI
(bit manipulation instructions) introduced by Intel in 2013.
ScanMatch Kernel
For example, icgrep has a parallel regular expression matching phase followed by a sequential output phase.
- Find matches within lines and mark the ends of matched lines with a bitstream (bitwise data parallel)
- Scan through line breaks and matched lines bitstreams to produce output.
Early Termination: The UntilN
Kernel
- Shortens a bit stream to include only those positions up to and including the Nth one bit.
To Do:
- Modify
ScanMatch
to print lines surrounding a match (grep context parameters-A
,-B
,-C
)
- Create a programming model/language for generalized sequential scanning.
Parabix/LLVM Kernel Programming
Processing Rates
When stream sets are not processed in strict one-to-one correspondence, processing rate attributes must be declared.
Example: base64
application.
- This application produces an ASCII-encoded representation of arbitrary binary files.
- Four output bytes for every three input bytes.
- The application is build with three kernels with rate declarations:
expand3_4
input:FixedRate(3)
, outputFixedRate(4)
.radix64
default (FixedRate(1)
) for input and outputbase64
input:FixedRate(1)
, outputRoundUpTo(4)
Example: u8u16
application
- Input: UTF-8 code units (bytes)
- Output: UTF-16 code units (doublebytes)
BoundedRate(0, 1)
used inP2SKernelWithCompressedOutput
Output rate attribute: Add1()
- Used when final carry-out bit is significant.
- Example:
icgrep
Block-Oriented Kernel Interface
- Simplest interface for kernel programming at LLVM level.
- Programmer writes the logic for processing a single block of input.
- Programmer must explicitly store any state information that must be carried from block to block.
Multiblock Kernel Interface
- General efficient interface for kernel programming.
- Always called with either
- a whole multiple number of blocks of data, or
- a final partial block.
- Pipeline makes sure that there is sufficient data on all input buffers and sufficient space on all output buffers according to declared processing rates.
- Programmer must updated processed and produced item counts for any streams with inexact processing rates.
To Do: Generalized Filtering/Merging Kernels
Deletion/Extraction
- A deletion mask marks positions that are to be deleted from one or more input streams.
- Output stream sets are compressed, including only those bits at non-deleted positions.
- An extraction mask is the logical complement of a deletion mask.
- Define the
popcount(extraction_stream)
processing rate for output. - Note Intel
BMI
instructions includePEXT
.
- Define the
- Parabix contains several implementations involving deletion.
- Different algorithmic choices.
- NEEDS unification/simplification
Expansion/Deposit
- An deposit mask is the used to spread bits of a stream out.
- Can be used to merge two stream sets:
- 0 bits select items from stream set 1
- 1 bits select items from stream set 2
- To do: use the
popcount(deposit_stream)
processing rate for input. - Note Intel
BMI
instructions includePDEP
.
- Partially implemented, but needs work.