Simpleton Regular Expression Matching in Haskell
Today, we consider the topic of regular expression matching in Haskell as a first full Haskell program. In this program, we will be applying a "simpleton" approach to matching; one that is not necessarily efficient but is a simple direct matching algorithm to illustrate Haskell's symbolic data representations, recursive function definitions, list processing and pattern matching.
The RE Symbolic Data Domain
Recall our definition of basic regular expressions over an alphabet S.
- Base Rule 1: ε is a regular expression that stands for the the empty string.
- Base Rule 2: Any character of S is a regular expression that stands for itself.
- Inductive Rule 1: If A and B are regular expressions, then A B is a regular expression that stands for the strings which are formed by concatenating a string represented by A and a string represented by B.
- Inductive Rule 2: If A and B are regular expressions, then A|B is a regular expression that stands for all the strings which are represented by either A or B.
- Inductive Rule 3: If A is a regular expression, then A* is a regular expression that stands for all the strings which consist of the concatentation of zero or more strings represented by A.
- Inductive Rule 4: If A is a regular expression, then (A) is an equivalent regular expression.
In order to create a data representation for regular expressions in Haskell, we can use Haskells's algebraic data type mechanism to create a one-line data definition!
data RE = Epsilon | Ch Char | Seq RE RE | Alt RE RE | Star RE | Group RE deriving Show
In this data definition, we define the type RE
in terms of six alternative forms, using vertical bars to separate the different forms. Each form begins with a symbolic tag, which simply names the
form. In this case, the six alternative forms have names reflecting the two base rules and four inductive rules in our definition.
Some alternative forms have no subcomponents. The Epsilon
form has no subcomponents because it simply represent the base case of ε,
matching the empty string.
The alternative form with the tag Ch
has a subcomponent of type Char
, Haskell's built-in type for characters. This corresponds to the second base case in our inductive definition.
All the other forms have components of type RE
. These are all recursive cases, i.e., we define the structure of an RE
in terms of subexpressions which are also RE
s.
The deriving Show
syntax is a declaration that we can add to any Haskell algebraic data type, to indicate that we want a show
function to be able to print values of our type.
Using this data declaration, we can then immediately create
data objects representing regular expressions. For example,
the regular expression ab*d
is represented Seq (Seq (Ch 'a') (Star (Ch 'b'))) (Ch 'd')
.
Defining a Regular Expression Matching Function
Given our definition of the RE
data domain, we can now consider
how we might define a function to determine whether a given regular expression matches a string.
We give our function match
the following Haskell type definition.
match :: RE -> [Char] -> Bool
This is a type signature for our function meaning that, given a symbolic RE
and a character string, match
returns
a boolean value. Here [Char]
is the type of character strings in Haskell, simply lists of characters. Note that all type names in Haskell begin with a capital letter (as do the tags in algebraic data type definitions).
To implement functions in Haskell, we generally use equations and pattern matching. This often gives very concise definitions.
Based on our algebraic data type for RE
, we can expect to
see a series of equations, one for each alternative form.
The left hand side of the equation will have Haskell patterns for the arguments.
match Epsilon s = ...
match (Ch c) s = ...
match (Alt r1 r2) s = ...
match (Seq r1 r2) s = ...
match (Star r1) s = ...
match (Group r1) s = ...
In each case the patterns here tell us the type of regular
expression for the given equation, as well as defining
variable names to the components of the pattern. For
example, the second equation is for a Ch
structure, with variable c
being the specific Char
inside that structure.
Now let us work through the logic of simple pattern matching
to fill in the right hand sides of these equations.
The first case for the Epsilon
object is easy: we have a match only when the string is empty.
match Epsilon s = s == ""
Here, we are using Epsilon
as a pattern for the first
argument, and an ordinary variable s
for the second argument.
The second base case is for matching a single character. In this case, the empty string does not match. Nonempty strings match if the first character of the string matches the given character and there are no more characters afterwards. We use two equations, with patterns for both arguments in each case.
match (Ch a) "" = False
match (Ch a) (c : more_chars) = a == c && more_chars == []
In this example, c : more_chars
is a pattern for a list consisting of a first element c
and remaining elements more_chars
. Pattern matching does two things: check whether
the input data is of the right pattern, and then bind pattern variables to components of the input. For example, in matching
c : more_chars
against a string "DEFG", c
is bound to "D" and more_chars
is bound to "EFG".
More details on pattern matching can be found in the Haskell Wikibook pattern matching section.
Now consider how we match Alt
forms. In this case,
we have a recursive regular expression type, which leads to
natural recursive calls in the corresponding equation.
match (Alt r1 r2) string = match r1 string || match r2 string
Matching the Group
regular expressions is also easy. These
expression just arise from embedding a regular expression in parentheses, so matching a group just requires matching the inner regular expression.
match (Group r1) s = match r1 s
The situation for Seq
forms is a bit more complex.
A match of Seq r1 r2
with a string "abc" succeeds
in four possible cases.
r1
matches "" andr2
matches "abc"r1
matches "a" andr2
matches "bc"r1
matches "ab" andr2
matches "c"r1
matches "abc" andr2
matches ""
We can implement this case using two helper functions.
The first function splits
returns a list of all
the ways of splitting a string into two substrings.
The second function match_any_split
returns True
if any of the splits are matched.
match (Seq r1 r2) string = match_any_split r1 r2 (splits string)
The idea of the splits
function is to give us a list of all the ways of splitting a string into two concatenated substrings.
splits "abc" = [("", "abc"), ("a", "bc"), ("ab", "c"), ("abc", "")]
The signature of this function is
splits :: [Char] -> [([Char], [Char])]
To implement the splits
function, consider the relationship
between splits "abc"
given above and the recursive call splits "bc"
on the substring after the first character.
splits "bc" = [("", "bc"), ("b", "c"), ("bc", "")]
What is necessary is to concatenate "a" to the first
element of each of these recursively calculated splits, and also
add the split ("", "abc")
. We can do this with the
help of another helper function combine_all
.
splits "" = [("", "")]
splits (c : chars) = combine_all c (splits chars)
combine_all :: Char -> [([Char], [Char])] -> [([Char], [Char])]
combine_all c [] = []
combine_all c (("", s): more) = ([c], s) : ("", c:s) : (combine_all c more)
combine_all c ((a:as, s): more) = (c:a:as, s) : (combine_all c more)
The match_any_split
function is then straightforward. However,
we now use a new Haskell feature, guarded equations, one of Haskell's
elegant control structures.
match_any_split r1 r2 [] = False
match_any_split r1 r2 ((s1, s2) : more_splits)
| match r1 s1 && match r2 s2 = True
| otherwise = match_any_split r1 r2 more_splits
Our final case to consider is how to match Star r
forms.
We can try the following.
match (Star r1) "" = True
match (Star r1) s = match_any_split r1 (Star r1) (splits s)
This will often work, but may lead to problems. Why?
An infinite recursion may result if r1
can match the empty
string. In this case, we will end up with a recursive call that
makes no progress. In order to avoid this, we need to ensure that
r1
matches at least one character.
match (Star r1) "" = True
match (Star r1) s = match_any_nonempty_split r1 (Star r1) (splits s)
match_any_nonempty_split r1 r2 [] = False
match_any_nonempty_split r1 r2 ((s1, s2) : more)
| s1 /= "" && match r1 s1 && match r2 s2 = True
| otherwise = match_any_nonempty_split r1 r2 more
Finally, to make sure that we have a legal Haskell function, we reorganize our code to first include our data type definition, then our signature declarations for each function, followed by the individual function definitions each implemented as a complete set of equations before moving on to the next.