Parsing

Given a grammar for a language and an input text representing an instance of that language, parsing is the process of determining the structure of a derivation tree (parse tree) for the input.

Table-Driven Parsers and Parser Generators

There are many table-driven parser technologies depending on the characteristics of the grammar and language to be parsed. Two important classes of grammars that are commonly used as the basis of table-driven technologies are the LL(k) parsers (left-to-right parsing with leftmost derivations and k symbols of lookahead) , and the LR(k) parsers.

Recursive-Descent Parsers

Many programming languages are designed to be easy to be parsed using top-down parsers written as a set of mutually recursive parsing procedures. These are called recursive-descent parsers.

Wikipedia: Recursive Descent Parser

Left-Recursion in Recursive Descent Parsers

The method of recursive-descent parsing must be modified to deal with grammars that have left-recursive productions.

Consider the grammar:

<expr> ::= <var> | <expr> "+" <var>
<var> ::= "X" | "Y" | "Z"

Consider writing a parsing procedure expr.

void expr (void) {
    if (Something) {
        var();
    }
    else {
        expr();
        expect(plus);
        var();    
    }
}

There are two problems. One is that we don't know what the test Something should be to distinguish the two alternative forms of <expr>. The second that, even if we had a test, the else clause has a problem: it immediately makes a recursive call to expr() again! This would be infinite recursion!

Grammar Rewriting

Many references describe how left-recursive grammars can be rewritten to eliminate the left-recursions. However, the resulting grammars give different parse trees which change the associativity of the operators.

Programming for Left-Recursion

Left-recursive grammar rules can be handled directly in recursive descent parsing, as follows.

Write the parsing action for all the non-left-recursive cases first.
Now write a loop that repeatedly tries to extend the parse according to the left-recursive cases.

void expr (void) {
    var();
    while (sym == plus) {
        nextsym();
        var();
    }
}

Parsing in Haskell

A simple recursive descent parser for this example grammar consists of a lexer to produce a string of tokens and a recursive descent parser.

We describe the tokens as an algebraic data type Token. Our lexer produces a list of tokens, including possibly a BadToken for incorrect input. The lexer skips white space.

data Token = PlusOp | VarToken Char | BadToken Char deriving Show

lexer :: [Char] -> [Token]

lexer [] = []
lexer (c: more)
  | c == ' ' || c == '\n' = lexer more  -- skip whitespace
  | c == '+'  = PlusOp : (lexer more)
  | c == 'X' || c == 'Y' || c== 'Z'   = VarToken c : (lexer more)
  | otherwise   = BadToken c: (lexer more)

Now we write a parser to convert a token stream into an abstract syntax tree. First we define a Haskell algebraic data type for expressions.

data Expr = Var Char | PlusExpr(Expr, Expr) deriving Show

A recursive descent parser has one parsing function for each nonterminal. Each function parses an instance from the token stream and returns either an error signal (if parsing failed) or an AST for the object and a remaining token stream. In order to account for errors, we use Haskell's (built-in) Maybe algebraic data type.

parseExpr :: [Token] -> Maybe (Expr, [Token])
parseVar :: [Token] -> Maybe (Expr, [Token])

The parseVar function is simplest.

parseVar (VarToken vname : tokens_after_var) = Just (Var vname, tokens_after_var)
parseVar _ = Nothing

The parseExpr function must deal with the left-recursion. To do this we parse the non-recursive case (Var) first and then try to extend the result.

extendExpr :: (Expr, [Token]) -> Maybe (Expr, [Token])

parseExpr s =
    case parseVar(s) of
        Just (v, more_tokens) -> extendExpr(v, more_tokens)
        _ -> Nothing

extendExpr(e1, []) = Just (e1, [])
extendExpr(e1, PlusOp : after_plus) =
    case parseVar(after_plus) of 
        Just(v, after_var) -> extendExpr(PlusExpr(e1, v), after_var)
        _ -> Nothing
extendExpr(e1, other_token : more) = Just (e1, other_token : more)

A main parsing function applies the lexer and the recursive descent parser and ensures that all the input has been consumed.

parseMain :: [Char] -> Maybe Expr

parseMain s = case parseExpr(lexer s) of 
    Just (e, []) -> Just e
    _ -> Nothing

Updated Tue Oct. 13 2015, 10:48 by cameron.

Simon Fraser University
Engaging the World

CourSys