Not logged in. Login

PyParabix Tool Chain

Parallel Bit Stream Programming/Prototyping with the PyParabix Tools

The Python tool chain for Parabix programming makes it easy to explore parallel bit stream programming both for prototyping and for building interesting applications.

The tool chain consists of the following components.

  • The Python character set compiler, which you can check out of svn with the command:
svn co http://parabix.costar.sfu.ca/svn/proto/charsetcompiler
  • The Python Pablo compiler, which you can check out of svn with the command:
svn co http://parabix.costar.sfu.ca/svn/proto/Compiler
  • The IDISA simd libraries which you can check out of svn with the command:
svn export http://parabix.costar.sfu.ca/svn/trunk/lib simd-lib
  • C++ template driver programs such as that for parabix2
svn export http://parabix.costar.sfu.ca/svn/proto/parabix2/pablo_template.cpp pablo_template.cpp
  • g++

Example Application: UTF-8 Validation

UTF-8 is an encoding of Unicode codepoints using sequences of 8-bit code units.

UTF-8 classifies the bytes according to the following rules.

  • Bytes in the range 0x00-0x7F are single-byte UTF-8 sequences.
  • Bytes in the range 0xC2-0xDF are prefix (i.e., first) bytes of two-byte UTF-8 sequences.
  • Bytes in the range 0xE0-0xEF are prefix bytes of three-byte UTF-8 sequences.
  • Bytes in the range 0xF0-0xF4 are prefix bytes of four-byte UTF-8 sequences.
  • Bytes in the range 0x80-0xBF are suffix bytes, used for the second byte of two-byte UTF-8 sequences, the second and third bytes of three-byte UTF-8 sequences and the second, third and fourth bytes of four-byte UTF-8 sequences.
  • Bytes in the range 0xC0-0xC1 and 0xF5-0xFF are invalid prefix bytes.

To determine that a byte sequence is valid UTF-8, the following rules must be checked.

  • No invalid prefix bytes are allowed.
  • After an n-byte UTF-8 prefix, the immediately following n-1 bytes must be suffix bytes.
  • Suffix bytes may never be used except in the n-1 byte positions following a n-byte UTF-8 prefix.
  • Suffix bytes in the range 0x80-0x9F are not allowed immediately after a 0xE0 prefix byte.
  • Suffix bytes in the range 0xA0-0xBF are not allowed immediately after a 0xED prefix byte.
  • Suffix bytes in the range 0x80-0x8F are not allowed immediately after a 0xF0 prefix byte.
  • Suffix bytes in the range 0x90-0xBF are not allowed immediately after a 0xF4 prefix byte.
  • The file may not terminate with an incomplete n-byte UTF-8 sequence.

Defining the Character Classes

u8.unibyte = [\x00-\x7F]
u8.prefix = [\xC0-\xFF]
u8.prefix2 = [\xC0-\xDF]
u8.prefix3 = [\xE0-\xEF]
u8.prefix4 = [\xF0-\xFF]
u8.suffix = [\x80-\xBF]
# For 2 byte-sequence validation
u8.badprefix2 = [\xC0-\xC1]
# For 3 byte-sequence validation
u8.xE0 = [\xE0]
u8.xED = [\xED]
u8.xA0_xBF = [\xA0-\xBF]
u8.x80_x9F = [\x80-\x9F]
# 4 byte sequence validation
u8.badprefix4 = [\xF5-\xFF]
u8.xF0 = [\xF0]
u8.xF4 = [\xF4]
u8.x90_xBF = [\x90-\xBF]
u8.x80_x8F = [\x80-\x8F]
python ../charsetcompiler/charset_compiler.py ../charsetcompiler/inputs/UTF8

Here is the output

        u8.unibyte = (~basis_bits.bit_0)
        u8.prefix = (basis_bits.bit_0 & basis_bits.bit_1)
        u8.prefix2 = (u8.prefix &~ basis_bits.bit_2)
        temp1 = (basis_bits.bit_2 &~ basis_bits.bit_3)
        u8.prefix3 = (u8.prefix & temp1)
        temp2 = (basis_bits.bit_2 & basis_bits.bit_3)
        u8.prefix4 = (u8.prefix & temp2)
        u8.suffix = (basis_bits.bit_0 &~ basis_bits.bit_1)
        temp3 = (basis_bits.bit_2 | basis_bits.bit_3)
        temp4 = (u8.prefix &~ temp3)
        temp5 = (basis_bits.bit_4 | basis_bits.bit_5)
        temp6 = (temp5 | basis_bits.bit_6)
        u8.badprefix2 = (temp4 &~ temp6)
        temp7 = (basis_bits.bit_6 | basis_bits.bit_7)
        temp8 = (temp5 | temp7)
        u8.xE0 = (u8.prefix3 &~ temp8)
        temp9 = (basis_bits.bit_4 & basis_bits.bit_5)
        temp10 = (basis_bits.bit_7 &~ basis_bits.bit_6)
        temp11 = (temp9 & temp10)
        u8.xED = (u8.prefix3 & temp11)
        u8.xA0_xBF = (u8.suffix & basis_bits.bit_2)
        u8.x80_x9F = (u8.suffix &~ basis_bits.bit_2)
        temp12 = (basis_bits.bit_5 & temp7)
        temp13 = (basis_bits.bit_4 | temp12)
        u8.badprefix4 = (u8.prefix4 & temp13)
        u8.xF0 = (u8.prefix4 &~ temp8)
        temp14 = (basis_bits.bit_5 &~ basis_bits.bit_4)
        temp15 = (temp14 &~ temp7)
        u8.xF4 = (u8.prefix4 & temp15)
        u8.x90_xBF = (u8.suffix & temp3)
        u8.x80_x8F = (u8.suffix &~ temp3)

The Complete Pablo Program

To build the complete pablo program for UTF-8 validation, we need the following components.

  • Class definitions for the basis bit streams (the 8 parallel bit streams resulting from transposition), the u8 character class streams, and the error stream.
  • The Pablo function classify_bytes(basis_bits, u8) which incorporates the logic produced by the character class compiler.
  • The Pablo function Validate_utf8(u8, error) which applies the UTF-8 validation rules.
  • The Main program which performs transposition, byte classification and UTF-8 validation.

The complete u8check.pablo shows the details.

The PyParabix C++ Template

Given the Pablo source code such as that shown above, the Pablo compiler can build a c++ program by inserting code into a c++ template. The template has a number of insertion points marked as follows.

  • @global: The point for inserting global declarations.
  • @decl: The point for inserting local declarations.
  • @stream_stmts: The point for inserting stream processing statements that are executed only once (for initialization).
  • @block_stmts: The point for inserting the translated pablo Main program for block-by-block processing.
  • @final_block_stmts: The point for inserting the translated block-by-block statements with end-of-file masking (for the final block of input data).

The complete u8check_template.cpp shows the template for this application.

Generating a cpp file

Given the pablo program and the template we can now build the standalone cpp program.

python ../Compiler/pablomain.py u8check.pablo -t u8check_template.cpp -o u8check.cpp

Compiling the cpp

Compiling with g++ is straightforward, assuming the simd-lib is available in the working directory.

g++ u8check.cpp -I. -o u8check

Debug Output

If the -a flag is added when using the Python pablo compiler, the code will be instrumented with logic to print out every bitstream as each block is processed. Note that the output is in hexadecimal and the bitstreams are represented in little-endian form (i.e., right-to-left).

Download and Testing

The full application can be downloaded with the command: svn export http://parabix.costar.sfu.ca/svn/proto/u8check u8check

The TestFiles directory contains useful test data files.

Updated Fri Jan. 19 2018, 08:27 by cameron.