The Parabix Log2 Bounded Repetition Technique

Bounded Repetition

A bounded repetition is a regular expression of the form \(E\{m,n\}\) where:

\(E\) is a regular expression and \(m\) and \(n\) are non-negative integers such that \(m \leq n\).
\(E\{m,n\}\) denotes a set of strings that may be formed by \(k\) substrings such that \(m \leq k \leq n\) and each substring matches \(E\).
The integers \(m\) and \(n\) are respectively known as the lower bound and the upper bound of the expression.

For example, the bounded repetition [0-9]{3,8} denotes the set of strings of digits whose length is at least 3 and at most 8. When \(m=n\), the notation \(E\{m\}\) may be used. When \(m=0\), the notation \(E\{,n\}\) may be used. The notation \(E\{m,\}\) denotes a repetition with a lower bound but no upper bound, equivalent to \(E\{m\}E*\).

Bounded Repetition of Single Character Expressions

Consider a bounded repetition is of the form \(C\{m,n\}\), where \(C\) is a single character or character class.

In this event, Parabix methods allow an efficient implementation involving only \(\lceil \log_2 m \rceil + \lceil \log_2{n-m}\rceil\) steps.
The technique is fully bitwise data parallel, allowing all matching instances to be found using a fixed number of long-stream operations.

We reformulate \(C\{m,n\}\) as \(C\{m\}C\{0,n-m\}\) and describe matching steps for each component expression.

Lower bound matching for \(C\{m\}\)
Upper bound matching for \(C\{0,n-m\}\)

Each of these involve the determination of runs of consecutive one bits.

Runs of Consecutive Bits

Define \(\text{Run}(C, m)\).

The stream which denotes those positions in the input stream such that the immediately prior \(m\) characters are all within the character class \(C\).
In other words, the immediately prior \(m\) bits of the character class stream \(C\) are all one bits.
Calculated recursively, as follows: \[\text{Run}(C, 1) = C \,\text{<<}\, 1\] \[\text{Run}(C, k) = \text{Run}(C, \lceil k/2\rceil) \wedge (\text{Run}(C, \lceil k/2\rceil) \,\text{<<}\, (k - \lceil k/2 \rceil))\]

For example, consider a run of 13 characters within matching \(C\). \[\text{Run}(C, 13) = \text{Run}(C, 7) \wedge (\text{Run}(C, 7) \,\text{<<}\, 6)\] This stream will have a one bit at (and only at) those positions that have 1 bits in all 7 immediately prior positions (\(\text{Run}(C, 7)\)), as well has having a one bit in the in each of the 7 positions from 6 positions prior to 13 positions prior. Three recursive steps complete the calculation. \[\text{Run}(C, 7) = \text{Run}(C, 4) \wedge (\text{Run}(C, 4) \,\text{<<}\, 3)\] \[\text{Run}(C, 4)= \text{Run}(C, 2) \wedge (\text{Run}(C, 2) \,\text{<<}\, 2)\] \[\text{Run}(C, 2) = \text{Run}(C, 1) \wedge (\text{Run}(C, 1) \,\text{<<}\, 1)\]

Lower-Bound Matching

Consider the lower bound expression \(C\{m\}\).
Assume that we have a stream of starting marker positions \(M_0\).
Goal: compute \(M_1 = \text{Match}(M_0, C\{m\})\).
These are just the positions that are both \(m\) positions past a marker position in \(M_0\) and are at the end of a Run of \(m\) members of \(C\). \[M_1 = (M_0 \,\text{<<}\, m) \wedge \text{Run}(C, m)\]

Note: this calculation is fully data parallel!

Upper-Bound Matching

Now consider the upper-bound expression: \(C\{0,n-m\}\).
Let \(u = n - m\).
Goal: compute \(M_2 = \text{Match}(M_1, C\{0,u\})\).

Method.

Find the positions that are reachable by any number of repetitions of the character class \(C\) from positions marked in \(M_1\).
Filter out from these positions those that are more than \(u\) positions past match positions identified by \(M_1\).

\(\text{MatchStar}(M_1, C)\) produces the set of all positions that are reachable from a position in \(M_1\) through a continous run of characters in \(C\).
Positions that are more than \(u\) positions past match positions in \(M_1\) are those that are preceded by at least \(u\) consecutive zero bits in \(M_1\).
Positions satisfying both constraints are calculated as follows. \[M_2 = \text{MatchStar}(M_1, C) \wedge \neg \text{Run}(\neg M_1, u)\]

Bounded Repetitions of UTF-8 Sequences

The bounded repetition technique relies on runs of consecutive zeroes or ones. How can we handle expressions such as .{1000} where . matches any Unicode character encoded as a UTF-8 sequence?

Consider the UTF-8 final stream marking only the final positions of UTF-8 sequences.
We need to calculate consecutive runs of final positions.
Method #1:
- Apply stream extraction to the character class stream, to select only the bits at final positions.
- Use the \(log2\) technique.
- Reverse the extraction using stream deposit.
Method #2:
- Create an indexed advance operation that advances markers through \(m\) positions through a marker stream.

Bounded Repetitions of More Complex Expressions

Can we extend the technique to more complex expressions?

Yes, if the expression includes at least one element that must occur exactly once.
Yes, if the expression is fixed length.

Updated Fri Feb. 02 2018, 09:22 by cameron.

Simon Fraser University
Engaging the World

CourSys