Parsing: From Symbolic Text to ASTs
The Parsing Problem
The parsing problem is that of transforming strings of text into suitable internal form for manipulation. In our simpleton regular expression example,
how do we convert the regular expression "ab*d
" into the symbolic form Seq (Seq (Chr 'a') (Star (Char 'b'))) (Char 'd')
?
BNF Grammars to Structure Text
In order to tackle the parsing problem, we need clear specifications of how text is to be structured. In order to do this, we rely on BNF grammars.
For our simple regular expression language, the following unambiguous grammar reflects the inductive definition given previously together with our precedence and associativity rules (Kleene-star takes precedence over sequential concatenation, concatenation takes precedence over alternation, both concatenation and alternation are left-associative).
<RE> ::= <seq> | <RE> "|" <seq> <seq> ::= <item> | <seq> <item> <item> ::= <element> | <element> "*" <element> ::= <char> | "(" <RE> ")" <char> ::= any character except "|", "*", "(", ")"
From this grammar, how do we parse the string "ab*d
" as an <RE>
? First note that there is no "|
" in the text, so the <RE>
must be a <seq>
. Can it be a single <item>
? No, because it does not end in "*" and it
does not meet the requirements of a single <element>
.
So, it must be a <seq> <item>
combination, with the
<seq>
parsing as "ab*
" and the <item>
parsing as a <char>
, i.e., "d
". The <seq>
for "ab*
" is in turn a <seq> <item>
for the substrings "a
" and "b*
".
As an <item>
, "b*
" satisfies the second alternative, i.e., as a <element> "*"
, with the <element>
parsing as a <char>
, i.e., "b
".
By this process, we can work out that the structure of
our expression which works out to be Seq (Seq (Chr 'a') (Star (Char 'b'))) (Char 'd')
as desired.
Recursive-Descent Parsers
Many programming languages are designed to be easy to be parsed using top-down parsers written as a set of mutually recursive parsing procedures. These are called recursive-descent parsers.
Wikipedia: Recursive Descent Parser
Indeed, the grammar given here satisfies the requirements for a recursive-descent parser provided that we deal with the problem of left-recursion.
Left-Recursion in Recursive Descent Parsers
The method of recursive-descent parsing must be modified to deal with grammars that have left-recursive productions.
Consider the grammar:
<expr> ::= <var> | <expr> "+" <var> <var> ::= "X" | "Y" | "Z"
Consider writing a parsing procedure expr
.
void expr (void) {
if (Something) {
var();
}
else {
expr();
expect(plus);
var();
}
}
There are two problems. One is that we don't know what
the test Something
should be to distinguish the two alternative forms of <expr>
. The second that, even if we had a test, the else clause has a problem: it immediately makes a recursive call to expr()
again! This would be infinite recursion!
Grammar Rewriting
Many references describe how left-recursive grammars can be rewritten to eliminate the left-recursions. However, the resulting grammars give different parse trees which change the associativity of the operators.
Programming for Left-Recursion
Left-recursive grammar rules can be handled directly in recursive descent parsing, as follows.
- Write the parsing action for all the non-left-recursive cases first.
- Now write a loop that repeatedly tries to extend the parse according to the left-recursive cases.
void expr (void) {
var();
while (sym == plus) {
nextsym();
var();
}
}
Returning to our grammar for regular expressions, it turns out that we have two left-recursive rules. But the same methods apply.
Recursive Descent Parsing in Haskell
A simple recursive descent parser for our regular expression
grammar has one recursive parsing function for each nonterminal:
<RE>
, <seq>
, <item>
, <element>
, <char>
.
Each function parses an instance from the input string
and returns either an error signal (if parsing failed) or an AST for the object and a remaining string to be parsed.
In order to account for errors, we use Haskell's (built-in) Maybe algebraic data type.
parseRE :: [Char] -> Maybe (RE, [Char]) parseSeq :: [Char] -> Maybe (RE, [Char]) parseItem :: [Char] -> Maybe (RE, [Char]) parseElement :: [Char] -> Maybe (RE, [Char]) parseChar :: [Char] -> Maybe (RE, [Char])
Let us deal with these in reverse order.
<RE> ::= <seq> | <RE> "|" <seq> <seq> ::= <item> | <seq> <item> <item> ::= <element> | <element> "*" <element> ::= <char> | "(" <RE> ")" <char> ::= any character except "|", "*", "(", ")"
The parseChar
function is simplest.
parseChar [] = Nothing parseChar (c:s) | c == '|' || c == '*' || c == '(' || c == ')' = Nothing | otherwise = Just ((Ch c), s)
The parseElement
function accepts a <char>
or a
parenthesized <RE>
.
parseElement ('(':more) = case parseRE(more) of Just (re, ')':yet_more) -> Just(Group re, yet_more) _ -> Nothing parseElement s = parseChar s
The parseItem
function has a common left-factor
<element>
for both cases.
<item> ::= <element> | <element> "*"
To do this we parse the <element>
first and then
test for the "*
".
parseItem s = case parseElement(s) of Just (re, '*':more) -> Just (Star re, more) Just (re, more) -> Just (re, more) _ -> Nothing
Now we move on to the left-recursive cases.
<RE> ::= <seq> | <RE> "|" <seq> <seq> ::= <item> | <seq> <item>
The parseSeq
function must deal with the left-recursion.
To do this we parse the non-recursive case (<item>
) first and then try to extend the result.
extendSeq :: (RE, [Char]) -> Maybe (RE, [Char]) parseSeq s = case parseItem(s) of Just (r, more_chars) -> extendSeq(r, more_chars) _ -> Nothing extendSeq (e1, after1) = case parseItem(after1) of Just(e2, more) -> extendSeq(Seq e1 e2, more) Nothing -> Just(e1, after1)
After parsing an item e1
, we try to extend by parsing one more item e2
. If successful, we form Seq e1 e2
and try to
continue extending. The recursion terminates when parseItem
returns Nothing
. In this case the extension was unsuccessful and we return just e1
and the remain characters (after1
).
The parseRE
procedure also uses the extension idea.
extendRE :: (RE, [Char]) -> Maybe (RE, [Char]) parseRE s = case parseSeq(s) of Just (r, more_chars) -> extendRE(r, more_chars) _ -> Nothing extendRE (e1, []) = Just (e1, []) extendRE (e1, '|' : after_bar) = case parseSeq(after_bar) of Just(e2, more) -> extendRE(Alt e1 e2, more) _ -> Nothing extendRE(e1, c:more) = Just (e1, c:more)
The Main Parsing Function
The recursive descent parsing functions are all designed
to partially process the input string, returning a
regular expression parsed and the remaining unprocessed input.
At the top-level, though, we want a parsing function that processes all of its input, or otherwise returns Nothing
. The following
parseMain
function does the job.
parseMain :: [Char] -> Maybe RE parseMain s = case parseRE s of Just (e, []) -> Just e _ -> Nothing