Regular Expression Matching with Glushkov Automata

Finite Automata

A finite automaton is an abstract machine that has a finite number of states, and a set of transitions between states that may be taken upon recognition of symbols from an input stream. It may be represented as a directed graph in which nodes represent states and labelled arcs represent transitions. The labels on the arcs identify the input symbols that enable a transition from one state to another.

Consider the following example automata (Cornell University).

These automata correspond to the following regular expressions.

ab*
a(b*|bcb)

In these diagrams, states are numbered sequentially. The states with arrows entering from "nowhere" are the initial states, that is, the states used to start the matching process. The double-circled states are the acceptance states, that is the states which, when entered, indicate a complete match to a regular expression.

A deterministic finite automaton (DFA) is one in which the set of transitions from any given state have at most one occurrence of a symbol. That is the choice of state transition for any given state and any given input symbol is unambiguous.

The automata above are DFAs.

A nondeterministic finite automaton (NFA) relaxes the constraint of having unambiguous transitions. Thus, a particular symbol may occur more than once on the set of transitions from a given state.

An NFA may also have an additional type of transition called an epsilon-transition. Such a transition is one that does not consume any input symbols. Epsilon transitions arise naturally from Thompson's method of NFA construction. However, the Glushkov NFA method does not create epsilon-transitions, so we will not deal further with this complication, and confine ourselves to epsilon-free NFAs.

Glushkov Construction Algorithm

The Glushkov constuction algorithm is a straightforward technique that produces an NFA with a very direct correspondence to a regular expression: there is one NFA state for every occurrence of a symbol in the regular expression.

There are three steps to the Glushkov construction algorithm for a regular expression \(E\).

Linearize the regular expression, by adding a sequential subscript to every symbol in the expression. Define one Glushkov state for every such subscripted symbol, plus one initial state.
Add the transitions, by determining:
- The set Start of start symbols of the regular expression.
- The set Final of final symbols of the regular expression.
- The set Pairs of symbol pairs that can occur consecutively.
Remove the subscripts.

Consider the regular expression example (a(ab)*)*|(ba)* used in the Wikipedia article on Glushkov's algorithm.

Linearization

We can linearize this example by adding sequential position numbers to each symbol (character), as follows.

\((\mathtt{a}_1(\mathtt{a}_2\mathtt{b}_3)*)*|(\mathtt{b}_4\mathtt{a}_5)*\)

Glushkov States

The Glushkov automata will have one state for each subscripted symbol in the linearized expression, plus the initial state \(S_0\). Thus, the states are \({S_0, \mathtt{a}_1, \mathtt{a}_2, \mathtt{b}_3, \mathtt{b}_4, \mathtt{a}_5}\). Alternatively, we can represent each of the state by its position index alone, i.e., \(\{0, 1, 2, 3, 4, 5\}\).

Start Symbols

The set of Start symbols is just the set of symbols that can occur first in a string matched by the expression.

\(\mathtt{Start} = \{\mathtt{a}_1, \mathtt{b}_4\}\)

Task 1: Given a Haskell regular expression object, write a function which determines the numeric symbol indices for all the start symbols.

Final Symbols

The set of Final symbols is just the set of symbols that can occur last in a string matched by the expression.

\(\mathtt{Final} = \{\mathtt{a}_1, \mathtt{b}_3, \mathtt{a}_5\}\)

Task 2: Given a Haskell regular expression object, write a function which determines the numeric symbol indices for all the final symbols.

Symbol Pairs

The set of symbol pairs are the symbols that can appear consecutively in a string matched by the expression.

\(\mathtt{Pairs} = \{(\mathtt{a}_1, \mathtt{a}_1), (\mathtt{a}_1, \mathtt{a}_2), (\mathtt{a}_2, \mathtt{b}_3), (\mathtt{b}_3, \mathtt{a}_1), (\mathtt{b}_3, \mathtt{a}_2), (\mathtt{b}_4, \mathtt{a}_5), (\mathtt{a}_5, \mathtt{b}_4)\}\)

Task 3: Given a Haskell regular expression object, write a function which determines the pairs of numeric symbol indices for all the symbol Pairs.

Constructing the NFA

Given the Start, Final and Pair information an NFA diagram can now be constructed. One node is created for each state. One arc from the initial state to each of the Start symbols is created, with the label on the arc being the character (i.e., the state label without its numerical suffix). One arc is created for each 2-tuple in Pairs, extending from the state corresponding to the first component of the 2-tuple to the state corresponding to the second component of the 2-tuple. The final states are marked as accept states.

Notice that the automaton is nondeterministic: some states have more than one outgoing transition with labelled with the same symbol. This means that, at any point in processing the text, there is more than one legal state for the NFA. However, we can represent the set of legal NFA states as a bit vector, with one bit per NFA state.

Bit-Parallel NFA Simulation

Regular expression matching using the Glushkov automaton can now be carried out using a bit-parallel method. Two tables are constructed.

The tables use the concept of a bit vector of states. For example, the state vector 100011 = 35 represents the set of NFA states \(\{0, 1, 5\}\). (Bits are numbered from right-to-left).

\(B\) is a table indexed by the characters of the alphabet indicating the bit vector of NFA states that may be reached (regardless of starting stage) after that character.

In our example, we have the following table entries.

\(B[\mathtt{a}] = \{1, 2, 5\} = \mathtt{100110}\)

\(B[\mathtt{b}] = \{3, 4\} = \mathtt{011000}\)

The second table \(D\) maps bit-vector state entries to the vectors of all legal states that can be reached by transitions in the First and Pairs relationships.

The states that can be reached by transition from state 0 (initial state) are \(\{1, 4\}\), represented by the bit vector 010010. The states that can be reached by transition from state 1 are \(\{1, 2\}\) represented 000110. From state 5, the only state that can be reached is 4 represented 010000. The entry for the state vector 100011 is just the bitwise-or (Haskell operation .|. from Data.Bits) of all these entries, 010110.

In addition to these two tables, the bit vector of final states \(F\) is computed. This bit vector includes the states corresponding to symbols in the Final set. If the regular expression can also match the empty string, then the initial state must also be included. Thus \(F = \mathtt{101011}\) in our example.

Regular Expression Matching

Now the tables can be used to perform regular expression matching as follows. The matching process proceeds from an initial state vector \(S_0 = \mathtt{000001}\), examining characters and updating the state vector each time. If the current state vector is \(S_i\) and the current input character is \(c\), then the next state \(S_{i+1}\) is computed as follows.

\(S_{i+1} = D[S_i] \wedge B[c]\)

Here the bitwise-and operation (\(\wedge\)) is available as the Haskell .&. operator from Data.Bits. The matching process concludes after the final character is processed, at which time a match is found if the the state vector includes any of the states in \(F\). That is, if \(S_n \wedge F\) is nonzero, where \(n\) is the length of the input string.

Regular Expression Search

To search for instances of a regular expression within a string rather than requiring that the entire string be matched, two modifications must be made. The first is that a match could start anywhere, so the initial state is reachable from every other state. (How are \(B\) and \(D\) modified?) The second is that we check for an intersection with the \(F\) final state vector after processing every character, that is a match exists as soon as a final state is entered.

How are \(B\) and \(D\) modified? The idea is that we can start the match anywhere in a line we are processing. In other words the initial state (\(\{0\} = \mathtt{000001}\), in our example) is possible after any character and after any state. So the required modifications to the tables are to include this state for every entry in both tables. That is every entry in both tables has a 1 bit in the state 0 position.

Handling Character Classes

If we extend the regular expression syntax to allow character classes, how can we apply the Glushkov method?

Updated Thu Nov. 29 2018, 11:59 by cameron.

Simon Fraser University
Engaging the World

CourSys