Not logged in. Login

The Parabix Log2 Bounded Repetition Technique

Bounded Repetition

A bounded repetition is a regular expression of the form \(E\{m,n\}\) where:

  • \(E\) is a regular expression and \(m\) and \(n\) are non-negative integers such that \(m \leq n\).
  • \(E\{m,n\}\) denotes a set of strings that may be formed by \(k\) substrings such that \(m \leq k \leq n\) and each substring matches \(E\).
  • The integers \(m\) and \(n\) are respectively known as the lower bound and the upper bound of the expression.

For example, the bounded repetition [0-9]{3,8} denotes the set of strings of digits whose length is at least 3 and at most 8. When \(m=n\), the notation \(E\{m\}\) may be used. When \(m=0\), the notation \(E\{,n\}\) may be used. The notation \(E\{m,\}\) denotes a repetition with a lower bound but no upper bound, equivalent to \(E\{m\}E*\).

Bounded Repetition of Single Character Expressions

Consider a bounded repetition is of the form \(C\{m,n\}\), where \(C\) is a single character or character class.

  • In this event, Parabix methods allow an efficient implementation involving only \(\lceil \log_2 m \rceil + \lceil \log_2{n-m}\rceil\) steps.
  • The technique is fully bitwise data parallel, allowing all matching instances to be found using a fixed number of long-stream operations.

We reformulate \(C\{m,n\}\) as \(C\{m\}C\{0,n-m\}\) and describe matching steps for each component expression.

  • Lower bound matching for \(C\{m\}\)
  • Upper bound matching for \(C\{0,n-m\}\)

Each of these involve the determination of runs of consecutive one bits.

Runs of Consecutive Bits

Define \(\text{Run}(C, m)\).

  • The stream which denotes those positions in the input stream such that the immediately prior \(m\) characters are all within the character class \(C\).
  • In other words, the immediately prior \(m\) bits of the character class stream \(C\) are all one bits.
  • Calculated recursively, as follows: \[\text{Run}(C, 1) = C \,\text{<<}\, 1\] \[\text{Run}(C, k) = \text{Run}(C, \lceil k/2\rceil) \wedge (\text{Run}(C, \lceil k/2\rceil) \,\text{<<}\, (k - \lceil k/2 \rceil))\]

For example, consider a run of 13 characters within matching \(C\). \[\text{Run}(C, 13) = \text{Run}(C, 7) \wedge (\text{Run}(C, 7) \,\text{<<}\, 6)\] This stream will have a one bit at (and only at) those positions that have 1 bits in all 7 immediately prior positions (\(\text{Run}(C, 7)\)), as well has having a one bit in the in each of the 7 positions from 6 positions prior to 13 positions prior. Three recursive steps complete the calculation. \[\text{Run}(C, 7) = \text{Run}(C, 4) \wedge (\text{Run}(C, 4) \,\text{<<}\, 3)\] \[\text{Run}(C, 4)= \text{Run}(C, 2) \wedge (\text{Run}(C, 2) \,\text{<<}\, 2)\] \[\text{Run}(C, 2) = \text{Run}(C, 1) \wedge (\text{Run}(C, 1) \,\text{<<}\, 1)\]

Lower-Bound Matching

  • Consider the lower bound expression \(C\{m\}\).
  • Assume that we have a stream of starting marker positions \(M_0\).
  • Goal: compute \(M_1 = \text{Match}(M_0, C\{m\})\).
  • These are just the positions that are both \(m\) positions past a marker position in \(M_0\) and are at the end of a Run of \(m\) members of \(C\). \[M_1 = (M_0 \,\text{<<}\, m) \wedge \text{Run}(C, m)\]
  • Note: this calculation is fully data parallel!

Upper-Bound Matching

  • Now consider the upper-bound expression: \(C\{0,n-m\}\).
  • Let \(u = n - m\).
  • Goal: compute \(M_2 = \text{Match}(M_1, C\{0,u\})\).

Method.

  1. Find the positions that are reachable by any number of repetitions of the character class \(C\) from positions marked in \(M_1\).
  2. Filter out from these positions those that are more than \(u\) positions past match positions identified by \(M_1\).
  • \(\text{MatchStar}(M_1, C)\) produces the set of all positions that are reachable from a position in \(M_1\) through a continous run of characters in \(C\).
  • Positions that are more than \(u\) positions past match positions in \(M_1\) are those that are preceded by at least \(u\) consecutive zero bits in \(M_1\).
  • Positions satisfying both constraints are calculated as follows. \[M_2 = \text{MatchStar}(M_1, C) \wedge \neg \text{Run}(\neg M_1, u)\]

Bounded Repetitions of UTF-8 Sequences

The bounded repetition technique relies on runs of consecutive zeroes or ones. How can we handle expressions such as .{1000} where . matches any Unicode character encoded as a UTF-8 sequence?

  • Consider the UTF-8 final stream marking only the final positions of UTF-8 sequences.
  • We need to calculate consecutive runs of final positions.
  • Method #1:
    • Apply stream extraction to the character class stream, to select only the bits at final positions.
    • Use the \(log2\) technique.
    • Reverse the extraction using stream deposit.
  • Method #2:
    • Create an indexed advance operation that advances markers through \(m\) positions through a marker stream.

Bounded Repetitions of More Complex Expressions

Can we extend the technique to more complex expressions?

  • Yes, if the expression includes at least one element that must occur exactly once.
  • Yes, if the expression is fixed length.
Updated Fri Feb. 02 2018, 09:22 by cameron.