UTF Compiler
The UTF Compiler is an important component for compiling sets of Unicode codepoints into Parabix Pablo code.
The UTF Compiler is designed to generate Pablo code that works with a basis bits representation of a Unicode data stream in one of the standard Unicode representations such as UTF-8.
The Pablo code produced by the UTF compiler can be displayed
using the icgrep program. For example,
icgrep -c '\p{Greek}' -ShowPablo file
will show the Pablo code generated for recognizing Greek
characters.
Dynamic If-Hierarchy
The UTF compiler generates code that achieves good performance primarily through an if-hierarchy. This is based on the notion that input data in each block will generally be confined to a small number of consecutive ranges of Unicode code points. Code for characters in ranges that lie outside of those in the current data block is thus skipped. The current if-hierarchy is based on a static hand-developed structure.
A better approach would be to have the if-hierarchy used based on the actual set of characters being compiled in a given case. The goal would be to use if-blocks only when the number of actual characters captured by the if-block is sufficiently large to justify the cost of the if-test and the potential branch misprediction penalty.