The Parabix Log2 Bounded Repetition Technique
Bounded Repetition
A bounded repetition is a regular expression of the form \(E\{m,n\}\) where:
- \(E\) is a regular expression and \(m\) and \(n\) are non-negative integers such that \(m \leq n\).
- \(E\{m,n\}\) denotes a set of strings that may be formed by \(k\) substrings such that \(m \leq k \leq n\) and each substring matches \(E\).
- The integers \(m\) and \(n\) are respectively known as the lower bound and the upper bound of the expression.
For example, the bounded repetition [0-9]{3,8}
denotes the set of strings of digits whose length is at least 3 and at most 8.
When \(m=n\), the notation \(E\{m\}\) may be used. When \(m=0\), the notation \(E\{,n\}\) may be used. The notation \(E\{m,\}\) denotes a repetition
with a lower bound but no upper bound, equivalent to \(E\{m\}E*\).
Bounded Repetition of Single Character Expressions
Consider a bounded repetition is of the form \(C\{m,n\}\), where \(C\) is a single character or character class.
- In this event, Parabix methods allow an efficient implementation involving only \(\lceil \log_2 m \rceil + \lceil \log_2{n-m}\rceil\) steps.
- The technique is fully bitwise data parallel, allowing all matching instances to be found using a fixed number of long-stream operations.
We reformulate \(C\{m,n\}\) as \(C\{m\}C\{0,n-m\}\) and describe matching steps for each component expression.
- Lower bound matching for \(C\{m\}\)
- Upper bound matching for \(C\{0,n-m\}\)
Each of these involve the determination of runs of consecutive one bits.
Runs of Consecutive Bits
Define \(\text{Run}(C, m)\).
- The stream which denotes those positions in the input stream such that the immediately prior \(m\) characters are all within the character class \(C\).
- In other words, the immediately prior \(m\) bits of the character class stream \(C\) are all one bits.
- Calculated recursively, as follows: \[\text{Run}(C, 1) = C \,\text{<<}\, 1\] \[\text{Run}(C, k) = \text{Run}(C, \lceil k/2\rceil) \wedge (\text{Run}(C, \lceil k/2\rceil) \,\text{<<}\, (k - \lceil k/2 \rceil))\]
For example, consider a run of 13 characters within matching \(C\). \[\text{Run}(C, 13) = \text{Run}(C, 7) \wedge (\text{Run}(C, 7) \,\text{<<}\, 6)\] This stream will have a one bit at (and only at) those positions that have 1 bits in all 7 immediately prior positions (\(\text{Run}(C, 7)\)), as well has having a one bit in the in each of the 7 positions from 6 positions prior to 13 positions prior. Three recursive steps complete the calculation. \[\text{Run}(C, 7) = \text{Run}(C, 4) \wedge (\text{Run}(C, 4) \,\text{<<}\, 3)\] \[\text{Run}(C, 4)= \text{Run}(C, 2) \wedge (\text{Run}(C, 2) \,\text{<<}\, 2)\] \[\text{Run}(C, 2) = \text{Run}(C, 1) \wedge (\text{Run}(C, 1) \,\text{<<}\, 1)\]
Lower-Bound Matching
- Consider the lower bound expression \(C\{m\}\).
- Assume that we have a stream of starting marker positions \(M_0\).
- Goal: compute \(M_1 = \text{Match}(M_0, C\{m\})\).
- These are just the positions that are both \(m\) positions past a marker position in \(M_0\) and are at the end of a Run of \(m\) members of \(C\). \[M_1 = (M_0 \,\text{<<}\, m) \wedge \text{Run}(C, m)\]
- Note: this calculation is fully data parallel!
Upper-Bound Matching
- Now consider the upper-bound expression: \(C\{0,n-m\}\).
- Let \(u = n - m\).
- Goal: compute \(M_2 = \text{Match}(M_1, C\{0,u\})\).
Method.
- Find the positions that are reachable by any number of repetitions of the character class \(C\) from positions marked in \(M_1\).
- Filter out from these positions those that are more than \(u\) positions past match positions identified by \(M_1\).
- \(\text{MatchStar}(M_1, C)\) produces the set of all positions that are reachable from a position in \(M_1\) through a continous run of characters in \(C\).
- Positions that are more than \(u\) positions past match positions in \(M_1\) are those that are preceded by at least \(u\) consecutive zero bits in \(M_1\).
- Positions satisfying both constraints are calculated as follows. \[M_2 = \text{MatchStar}(M_1, C) \wedge \neg \text{Run}(\neg M_1, u)\]
Bounded Repetitions of UTF-8 Sequences
The bounded repetition technique relies on runs of consecutive zeroes or ones. How can we handle expressions such as .{1000}
where .
matches any Unicode character encoded as a UTF-8 sequence?
- Consider the UTF-8 final stream marking only the final positions of UTF-8 sequences.
- We need to calculate consecutive runs of final positions.
- Method #1:
- Apply stream extraction to the character class stream, to select only the bits at final positions.
- Use the \(log2\) technique.
- Reverse the extraction using stream deposit.
- Method #2:
- Create an indexed advance operation that advances markers through \(m\) positions through a marker stream.
Bounded Repetitions of More Complex Expressions
Can we extend the technique to more complex expressions?
- Yes, if the expression includes at least one element that must occur exactly once.
- Yes, if the expression is fixed length.