Parsing: From Symbolic Text to ASTs

The Parsing Problem

The parsing problem is that of transforming strings of text into suitable internal form for manipulation. In our simpleton regular expression example, how do we convert the regular expression "ab*d" into the symbolic form Seq (Seq (Chr 'a') (Star (Char 'b'))) (Char 'd')?

BNF Grammars to Structure Text

In order to tackle the parsing problem, we need clear specifications of how text is to be structured. In order to do this, we rely on BNF grammars.

For our simple regular expression language, the following unambiguous grammar reflects the inductive definition given previously together with our precedence and associativity rules (Kleene-star takes precedence over sequential concatenation, concatenation takes precedence over alternation, both concatenation and alternation are left-associative).

<RE> ::= <seq> | <RE> "|" <seq>
<seq> ::= <item> | <seq> <item>
<item> ::= <element> | <element> "*"
<element> ::= <char> | "(" <RE> ")"
<char> ::= any character except "|", "*", "(", ")"

From this grammar, how do we parse the string "ab*d" as an <RE>? First note that there is no "|" in the text, so the <RE> must be a <seq>. Can it be a single <item>? No, because it does not end in "*" and it does not meet the requirements of a single <element>. So, it must be a <seq> <item> combination, with the <seq> parsing as "ab*" and the <item> parsing as a <char>, i.e., "d". The <seq> for "ab*" is in turn a <seq> <item> for the substrings "a" and "b*". As an <item>, "b*" satisfies the second alternative, i.e., as a <element> "*", with the <element> parsing as a <char>, i.e., "b".

By this process, we can work out that the structure of our expression which works out to be Seq (Seq (Chr 'a') (Star (Char 'b'))) (Char 'd') as desired.

Recursive-Descent Parsers

Many programming languages are designed to be easy to be parsed using top-down parsers written as a set of mutually recursive parsing procedures. These are called recursive-descent parsers.

Wikipedia: Recursive Descent Parser

Indeed, the grammar given here satisfies the requirements for a recursive-descent parser provided that we deal with the problem of left-recursion.

Left-Recursion in Recursive Descent Parsers

The method of recursive-descent parsing must be modified to deal with grammars that have left-recursive productions.

Consider the grammar:

<expr> ::= <var> | <expr> "+" <var>
<var> ::= "X" | "Y" | "Z"

Consider writing a parsing procedure expr.

void expr (void) {
    if (Something) {
        var();
    }
    else {
        expr();
        expect(plus);
        var();    
    }
}

There are two problems. One is that we don't know what the test Something should be to distinguish the two alternative forms of <expr>. The second that, even if we had a test, the else clause has a problem: it immediately makes a recursive call to expr() again! This would be infinite recursion!

Grammar Rewriting

Many references describe how left-recursive grammars can be rewritten to eliminate the left-recursions. However, the resulting grammars give different parse trees which change the associativity of the operators.

Programming for Left-Recursion

Left-recursive grammar rules can be handled directly in recursive descent parsing, as follows.

Write the parsing action for all the non-left-recursive cases first.
Now write a loop that repeatedly tries to extend the parse according to the left-recursive cases.

void expr (void) {
    var();
    while (sym == plus) {
        nextsym();
        var();
    }
}

Returning to our grammar for regular expressions, it turns out that we have two left-recursive rules. But the same methods apply.

Recursive Descent Parsing in Haskell

A simple recursive descent parser for our regular expression grammar has one recursive parsing function for each nonterminal: <RE>, <seq>, <item>, <element>, <char>. Each function parses an instance from the input string and returns either an error signal (if parsing failed) or an AST for the object and a remaining string to be parsed. In order to account for errors, we use Haskell's (built-in) Maybe algebraic data type.

parseRE :: [Char] -> Maybe (RE, [Char])
parseSeq :: [Char] -> Maybe (RE, [Char])
parseItem :: [Char] -> Maybe (RE, [Char])
parseElement :: [Char] -> Maybe (RE, [Char])
parseChar :: [Char] -> Maybe (RE, [Char])

Let us deal with these in reverse order.

<RE> ::= <seq> | <RE> "|" <seq>
<seq> ::= <item> | <seq> <item>
<item> ::= <element> | <element> "*"
<element> ::= <char> | "(" <RE> ")"
<char> ::= any character except "|", "*", "(", ")"

The parseChar function is simplest.

parseChar [] = Nothing
parseChar (c:s)
  | c == '|' || c == '*' || c == '(' || c == ')'   = Nothing
  | otherwise                                      = Just ((Ch c), s)

The parseElement function accepts a <char> or a parenthesized <RE>.

parseElement ('(':more) =
    case parseRE(more) of
        Just (re, ')':yet_more) -> Just(Group re, yet_more)
        _ -> Nothing
parseElement s = parseChar s

The parseItem function has a common left-factor <element> for both cases.

<item> ::= <element> | <element> "*"

To do this we parse the <element> first and then test for the "*".

parseItem s =
   case parseElement(s) of
        Just (re, '*':more) -> Just (Star re, more)
        Just (re, more) -> Just (re, more)
        _ -> Nothing

Now we move on to the left-recursive cases.

<RE> ::= <seq> | <RE> "|" <seq>
<seq> ::= <item> | <seq> <item>

The parseSeq function must deal with the left-recursion. To do this we parse the non-recursive case (<item>) first and then try to extend the result.

extendSeq :: (RE, [Char]) -> Maybe (RE, [Char])

parseSeq s =
    case parseItem(s) of
        Just (r, more_chars) -> extendSeq(r, more_chars)
        _ -> Nothing

extendSeq (e1, after1) =
    case parseItem(after1) of 
        Just(e2, more) -> extendSeq(Seq e1 e2, more)
        Nothing -> Just(e1, after1)

After parsing an item e1, we try to extend by parsing one more item e2. If successful, we form Seq e1 e2 and try to continue extending. The recursion terminates when parseItem returns Nothing. In this case the extension was unsuccessful and we return just e1 and the remain characters (after1).

The parseRE procedure also uses the extension idea.

extendRE :: (RE, [Char]) -> Maybe (RE, [Char])
parseRE s =
    case parseSeq(s) of
        Just (r, more_chars) -> extendRE(r, more_chars)
        _ -> Nothing

extendRE (e1, []) = Just (e1, [])
extendRE (e1, '|' : after_bar) =
    case parseSeq(after_bar) of 
        Just(e2, more) -> extendRE(Alt e1 e2, more)
        _ -> Nothing
extendRE(e1, c:more) = Just (e1, c:more)

The Main Parsing Function

The recursive descent parsing functions are all designed to partially process the input string, returning a regular expression parsed and the remaining unprocessed input. At the top-level, though, we want a parsing function that processes all of its input, or otherwise returns Nothing. The following parseMain function does the job.

parseMain :: [Char] -> Maybe RE

parseMain s = case parseRE s of 
    Just (e, []) -> Just e
    _ -> Nothing

Updated Sat Sept. 01 2018, 18:24 by cameron.

Simon Fraser University
Engaging the World

CourSys