Parsing
Given a grammar for a language and an input text representing an instance of that language, parsing is the process of determining the structure of a derivation tree (parse tree) for the input.
Table-Driven Parsers and Parser Generators
There are many table-driven parser technologies depending on the characteristics of the grammar and language to be parsed. Two important classes of grammars that are commonly used as the basis of table-driven technologies are the LL(k) parsers (left-to-right parsing with leftmost derivations and k symbols of lookahead) , and the LR(k) parsers.
Recursive-Descent Parsers
Many programming languages are designed to be easy to be parsed using top-down parsers written as a set of mutually recursive parsing procedures. These are called recursive-descent parsers.
Wikipedia: Recursive Descent Parser
Left-Recursion in Recursive Descent Parsers
The method of recursive-descent parsing must be modified to deal with grammars that have left-recursive productions.
Consider the grammar:
<expr> ::= <var> | <expr> "+" <var> <var> ::= "X" | "Y" | "Z"
Consider writing a parsing procedure expr
.
void expr (void) {
if (Something) {
var();
}
else {
expr();
expect(plus);
var();
}
}
There are two problems. One is that we don't know what
the test Something
should be to distinguish the two alternative forms of <expr>
. The second that, even if we had a test, the else clause has a problem: it immediately makes a recursive call to expr()
again! This would be infinite recursion!
Grammar Rewriting
Many references describe how left-recursive grammars can be rewritten to eliminate the left-recursions. However, the resulting grammars give different parse trees which change the associativity of the operators.
Programming for Left-Recursion
Left-recursive grammar rules can be handled directly in recursive descent parsing, as follows.
- Write the parsing action for all the non-left-recursive cases first.
- Now write a loop that repeatedly tries to extend the parse according to the left-recursive cases.
void expr (void) {
var();
while (sym == plus) {
nextsym();
var();
}
}
Parsing in Haskell
A simple recursive descent parser for this example grammar consists of a lexer to produce a string of tokens and a recursive descent parser.
We describe the tokens as an algebraic data type Token. Our lexer produces a list of tokens, including possibly a BadToken for incorrect input. The lexer skips white space.
data Token = PlusOp | VarToken Char | BadToken Char deriving Show lexer :: [Char] -> [Token] lexer [] = [] lexer (c: more) | c == ' ' || c == '\n' = lexer more -- skip whitespace | c == '+' = PlusOp : (lexer more) | c == 'X' || c == 'Y' || c== 'Z' = VarToken c : (lexer more) | otherwise = BadToken c: (lexer more)
Now we write a parser to convert a token stream into an abstract syntax tree. First we define a Haskell algebraic data type for expressions.
data Expr = Var Char | PlusExpr(Expr, Expr) deriving Show
A recursive descent parser has one parsing function for each nonterminal. Each function parses an instance from the token
stream and returns either an error signal (if parsing failed) or an AST for the object and a remaining token stream.
In order to account for errors, we use Haskell's (built-in) Maybe
algebraic data type.
parseExpr :: [Token] -> Maybe (Expr, [Token]) parseVar :: [Token] -> Maybe (Expr, [Token])
The parseVar
function is simplest.
parseVar (VarToken vname : tokens_after_var) = Just (Var vname, tokens_after_var) parseVar _ = Nothing
The parseExpr
function must deal with the left-recursion.
To do this we parse the non-recursive case (Var
) first and then try to extend the result.
extendExpr :: (Expr, [Token]) -> Maybe (Expr, [Token]) parseExpr s = case parseVar(s) of Just (v, more_tokens) -> extendExpr(v, more_tokens) _ -> Nothing extendExpr(e1, []) = Just (e1, []) extendExpr(e1, PlusOp : after_plus) = case parseVar(after_plus) of Just(v, after_var) -> extendExpr(PlusExpr(e1, v), after_var) _ -> Nothing extendExpr(e1, other_token : more) = Just (e1, other_token : more)
A main parsing function applies the lexer and the recursive descent parser and ensures that all the input has been consumed.
parseMain :: [Char] -> Maybe Expr parseMain s = case parseExpr(lexer s) of Just (e, []) -> Just e _ -> Nothing