Simpleton Regular Expression Matching in Haskell

Today, we consider the topic of regular expression matching in Haskell as a first full Haskell program. In this program, we will be applying a "simpleton" approach to matching; one that is not necessarily efficient but is a simple direct matching algorithm to illustrate Haskell's symbolic data representations, recursive function definitions, list processing and pattern matching.

The RE Symbolic Data Domain

Recall our definition of basic regular expressions over an alphabet S.

Base Rule 1: ε is a regular expression that stands for the the empty string.
Base Rule 2: Any character of S is a regular expression that stands for itself.
Inductive Rule 1: If A and B are regular expressions, then A B is a regular expression that stands for the strings which are formed by concatenating a string represented by A and a string represented by B.
Inductive Rule 2: If A and B are regular expressions, then A|B is a regular expression that stands for all the strings which are represented by either A or B.
Inductive Rule 3: If A is a regular expression, then A* is a regular expression that stands for all the strings which consist of the concatentation of zero or more strings represented by A.
Inductive Rule 4: If A is a regular expression, then (A) is an equivalent regular expression.

In order to create a data representation for regular expressions in Haskell, we can use Haskells's algebraic data type mechanism to create a one-line data definition!

data RE = Epsilon | Ch Char | Seq RE RE | Alt RE RE | Star RE | Group RE deriving Show

In this data definition, we define the type RE in terms of six alternative forms, using vertical bars to separate the different forms. Each form begins with a symbolic tag, which simply names the form. In this case, the six alternative forms have names reflecting the two base rules and four inductive rules in our definition.

Some alternative forms have no subcomponents. The Epsilon form has no subcomponents because it simply represent the base case of ε, matching the empty string.

The alternative form with the tag Ch has a subcomponent of type Char, Haskell's built-in type for characters. This corresponds to the second base case in our inductive definition.

All the other forms have components of type RE. These are all recursive cases, i.e., we define the structure of an RE in terms of subexpressions which are also REs.

The deriving Show syntax is a declaration that we can add to any Haskell algebraic data type, to indicate that we want a show function to be able to print values of our type.

Using this data declaration, we can then immediately create data objects representing regular expressions. For example, the regular expression ab*d is represented Seq (Seq (Ch 'a') (Star (Ch 'b'))) (Ch 'd').

Defining a Regular Expression Matching Function

Given our definition of the RE data domain, we can now consider how we might define a function to determine whether a given regular expression matches a string.

We give our function match the following Haskell type definition.

match :: RE -> [Char] -> Bool

This is a type signature for our function meaning that, given a symbolic RE and a character string, match returns a boolean value. Here [Char] is the type of character strings in Haskell, simply lists of characters. Note that all type names in Haskell begin with a capital letter (as do the tags in algebraic data type definitions).

To implement functions in Haskell, we generally use equations and pattern matching. This often gives very concise definitions.

Based on our algebraic data type for RE, we can expect to see a series of equations, one for each alternative form. The left hand side of the equation will have Haskell patterns for the arguments.

match Epsilon s = ...
match (Ch c) s = ...
match (Alt r1 r2) s = ...
match (Seq r1 r2) s = ...
match (Star r1) s = ...
match (Group r1) s = ...

In each case the patterns here tell us the type of regular expression for the given equation, as well as defining variable names to the components of the pattern. For example, the second equation is for a Ch structure, with variable c being the specific Char inside that structure.

Now let us work through the logic of simple pattern matching to fill in the right hand sides of these equations. The first case for the Epsilon object is easy: we have a match only when the string is empty.

match Epsilon s = s == ""

Here, we are using Epsilon as a pattern for the first argument, and an ordinary variable s for the second argument.

The second base case is for matching a single character. In this case, the empty string does not match. Nonempty strings match if the first character of the string matches the given character and there are no more characters afterwards. We use two equations, with patterns for both arguments in each case.

match (Ch a) "" = False
match (Ch a) (c : more_chars) = a == c && more_chars == []

In this example, c : more_chars is a pattern for a list consisting of a first element c and remaining elements more_chars. Pattern matching does two things: check whether the input data is of the right pattern, and then bind pattern variables to components of the input. For example, in matching c : more_chars against a string "DEFG", c is bound to "D" and more_chars is bound to "EFG". More details on pattern matching can be found in the Haskell Wikibook pattern matching section.

Now consider how we match Alt forms. In this case, we have a recursive regular expression type, which leads to natural recursive calls in the corresponding equation.

match (Alt r1 r2) string = match r1 string || match r2 string

Matching the Group regular expressions is also easy. These expression just arise from embedding a regular expression in parentheses, so matching a group just requires matching the inner regular expression.

match (Group r1) s = match r1 s

The situation for Seq forms is a bit more complex. A match of Seq r1 r2 with a string "abc" succeeds in four possible cases.

r1 matches "" and r2 matches "abc"
r1 matches "a" and r2 matches "bc"
r1 matches "ab" and r2 matches "c"
r1 matches "abc" and r2 matches ""

We can implement this case using two helper functions. The first function splits returns a list of all the ways of splitting a string into two substrings. The second function match_any_split returns True if any of the splits are matched.

match (Seq r1 r2) string = match_any_split r1 r2 (splits string)

The idea of the splits function is to give us a list of all the ways of splitting a string into two concatenated substrings.

splits "abc" = [("", "abc"), ("a", "bc"), ("ab", "c"), ("abc", "")]

The signature of this function is

splits :: [Char] -> [([Char], [Char])]

To implement the splits function, consider the relationship between splits "abc" given above and the recursive call splits "bc" on the substring after the first character.

splits "bc" = [("", "bc"), ("b", "c"), ("bc", "")]

What is necessary is to concatenate "a" to the first element of each of these recursively calculated splits, and also add the split ("", "abc"). We can do this with the help of another helper function combine_all.

splits "" = [("", "")]
splits (c : chars) = combine_all c (splits chars)

combine_all :: Char -> [([Char], [Char])] -> [([Char], [Char])]
combine_all c [] = []
combine_all c (("", s): more) = ([c], s) : ("", c:s) : (combine_all c more)
combine_all c ((a:as, s): more) = (c:a:as, s) : (combine_all c more)

The match_any_split function is then straightforward. However, we now use a new Haskell feature, guarded equations, one of Haskell's elegant control structures.

match_any_split r1 r2 [] = False
match_any_split r1 r2 ((s1, s2) : more_splits) 
   | match r1 s1 && match r2 s2     = True
   | otherwise                      = match_any_split r1 r2 more_splits

Our final case to consider is how to match Star r forms. We can try the following.

match (Star r1) "" = True
match (Star r1) s = match_any_split r1 (Star r1) (splits s)

This will often work, but may lead to problems. Why?

An infinite recursion may result if r1 can match the empty string. In this case, we will end up with a recursive call that makes no progress. In order to avoid this, we need to ensure that r1 matches at least one character.

match (Star r1) "" = True
match (Star r1) s = match_any_nonempty_split r1 (Star r1) (splits s)

match_any_nonempty_split r1 r2 [] = False
match_any_nonempty_split r1 r2 ((s1, s2) : more) 
   | s1 /= "" && match r1 s1 && match r2 s2     = True
   | otherwise                                  = match_any_nonempty_split r1 r2 more

Finally, to make sure that we have a legal Haskell function, we reorganize our code to first include our data type definition, then our signature declarations for each function, followed by the individual function definitions each implemented as a complete set of equations before moving on to the next.

Updated Thu Sept. 06 2018, 15:18 by cameron.

Simon Fraser University
Engaging the World

CourSys

Simpleton Regular Expression Matching in Haskell

The RE Symbolic Data Domain

Defining a Regular Expression Matching Function