Lexical Analysis and Parsing

For complex symbolic notations, such as programming languages, conversion of the textual representation of a program into an internal abstract syntax form often involves an intermediate step of lexical analysis prior to the actual parsing process.

Lexical Analysis: White Space, Comments and Tokens

The grammar of programming languages is often described without explicit reference to the white space (space and tab characters, carriage returns, line feeds, etc.) that may separate the significant tokens of the language. Furthermore, it is often convenient to simplify the task of the parser by explicitly forming tokens from the input ahead of time. Most often, tokens are those that can be identified as instances of particular regular expressions. The process of lexical analysis is often called tokenization and the lexical analyzer itself may be called a tokenizer.

For example, for the domain of Small Lisp, the regular expressions that are useful for tokenization are the following:

comment: (^;;;.*$)+
numeral: -?[0-9]+
alphanumeric_symbol: [A-Za-z](-?[A-Za-z0-9])*
special_symbol: [-+*/<>=&|!@#$%?:]+
lparen: \(
rparen: \)
lbrak: \[
rbrak: \]
lbrace: \{
rbrace: \}
equal: =
semicolon: ;
arrow: -->
quote-mark: "
colon: :

We can represent these tokens using a Haskell algebraic data type.

data Token = Comment [Char] |
             Numeral Int |
             AlphaNum [Char] |
             Special [Char] |
             Lparen | Rparen | Lbrak | Rbrak | LBrace | RBrace
             Equal | Semicolon | Arrow | Quote | Colon

Most often tokenization follows the longest-match rule: tokens are formed from the left taking the longest possibly string that matches the given regular expression, while not consuming any whitespace.

For example, the Small Lisp definition

make-negation[expr] = list2["-"; expr]

could be represented using the following Haskell [Token] value.

[AlphaNum "make-negation", Lbrak, AlphaNum "expr", RBrak, Equal, 
 AlphaNum "list2", Lbrak, Quote, Special "-", Quote, Semicolon, AlphaNum "expr", RBrak]

Question: Why don't you expect to see token streams such as the following?

[AlphaNum "make", Special "-", AlphaNum "negation"]

The Tokenizer

How do you write a tokenizer? Below we show how a tokenizer for the Small Lisp programming language can be written in Haskell.

However, it is possible to consider automatically generated tokenizers. The idea is to define the tokens of a language using regular expressions (as above). Then a special symbolic computing application called a lexical-analyzer generator can be used. Such an application would automatically build a lexical analyzer from the given set of regular expressions. An example is the program Flex.

However, writing a lexical analyzer by hand is not conceptually difficult and may sometimes be necessary for various reasons. One reason is there may be ambiguities in token syntax that are not handled by the regular expression given. Another is that there may be special white-space or comment rules.

Small Lisp does have some complexity for lexical analysis. One issue is that some tokens are ambiguous: the - (Equal), --> (Arrow) and : (Colon) tokens are also legal as Special tokens. Another is that the syntax of Small Lisp comments is somewhat unusual.

Now consider the following Small Lisp tokenizer. We design it to produce a tokenization of any input stream, without wrapping the result in a Maybe type. Instead we introduce a BadToken constructor to represent illegal input tokens.

module SmLispTokenize where

import Data.Char

data Token = Comment [Char] |
             Numeral Int |
             AlphaNum [Char] |
             Special [Char] |
             Lparen | Rparen | Lbrak | Rbrak | LBrace | RBrace |
             Equal | Semicolon | Arrow | Quote | Colon | 
             BadToken Char deriving Show

tokenize [] = []
tokenize ('\n': ';' : ';' : ';' : more_chars) = 
    let (rest_of_line, more_lines) = span (/='\n') more_chars
        in (Comment rest_of_line): tokenize more_lines

tokenize (' ':more) = tokenize more            -- skip white space
tokenize ('\n':more) = tokenize more           -- skip white space
tokenize ('(':more) = Lparen: (tokenize more)
tokenize (')':more) = Rparen: (tokenize more)
tokenize ('[':more) = Lbrak: (tokenize more)
tokenize (']':more) = Rbrak: (tokenize more)
tokenize ('{':more) = LBrace: (tokenize more)
tokenize ('}':more) = RBrace: (tokenize more)
tokenize (';':more) = Semicolon: (tokenize more)
tokenize ('"':more) = Quote: (tokenize more)
tokenize t@(c:more)
   | isAlpha c = getAlphaNum [c] (span isAlphaNum more)
   | isDigit c = getNumToken (digitToInt c) more
   | otherwise = getSpecial (span (\c -> elem c "+-*/<>=&|!@#$%?:") t)

getSpecial ("", (c : more)) = BadToken c : (tokenize more)
-- Note: the following tokens may also be used as symbolic atoms;
-- the parser must recognize this situation
getSpecial ("=", more)  = Equal : (tokenize more)
getSpecial ("-->", more)  = Arrow : (tokenize more)
getSpecial (":", more)  = Colon : (tokenize more)
getSpecial ("-", c:more) 
   | isDigit c = let (Numeral n:tokens) = getNumToken (digitToInt c) more 
                     in (Numeral (-n)): tokens
   | otherwise = (Special "-") : (tokenize (c:more))
getSpecial (specials, more) = (Special specials) : (tokenize more)

getNumToken accum [] = [Numeral accum]
getNumToken accum (c:more) 
   | isDigit c  = getNumToken (accum * 10 + (digitToInt c)) more
   | otherwise  = Numeral accum : (tokenize (c:more))

getAlphaNum pfx (alphanum, s@('-':c:more))
   | isAlphaNum c   = getAlphaNum (pfx ++ alphanum ++ ['-',c]) (span isAlphaNum more)
   | otherwise      = AlphaNum (pfx ++ alphanum) : (tokenize s)
getAlphaNum pfx (alphanum, more) = AlphaNum (pfx ++ alphanum) : (tokenize more)

One issue to consider in this tokenizer is the recognition of comment lines. Under what circumstances may the tokenizer fail to correctly recognize a comment? How could that be handled?

Parsing from Token Streams

Given token streams, parsing functions have a slightly different type signature than when working with character strings directly. To parse Small Lisp programs, the main parsing functions would handle the three prinicipal subdomains.

parseSExpression :: [Token] -> Maybe (SExpression, [Token])
parseSmLispExpr :: [Token] -> Maybe (SmLispExpr, [Token])
parseSmLispProgram :: [Token] -> Maybe (SmLispProgram, [Token])

Parsing Small Lisp values: S-expressions

Our first parsing function handles Small Lisp values (S-expressions)

parseSExpression :: [Token] -> Maybe (SExpression, [Token])
parseSExpression (Numeral n: tokens) = Just (NumAtom n, tokens)
parseSExpression (AlphaNum s: tokens) = Just (SymAtom s, tokens)
parseSExpression (Special s: tokens) = Just (SymAtom s, tokens)
parseSExpression (Arrow: tokens) = Just (SymAtom "-->", tokens)
parseSExpression (Equal: tokens) = Just (SymAtom "=", tokens)
parseSExpression (Colon: tokens) = Just (SymAtom ":", tokens)
parseSExpression (Lparen: tokens) = parseSExpressionList [] tokens 
parseSExpression _ = Nothing

parseSExpressionList exprs (Rparen : more) = Just (List exprs, more)
parseSExpressionList exprs tokens =
   case parseSExpression tokens of 
       Just (e, more) -> parseSExpressionList (exprs ++ [e]) more
       _ -> Nothing

Parsing Small Lisp Expressions

type Identifier = [Char]

data SmLispExpr = SExpr SExpression |
                  Variable Identifier |
                  FnCall Identifier [SmLispExpr] |
                  CondExpr [CondClause] |
                  LetExpr [LocalDef] SmLispExpr deriving Show

data CondClause = Clause SmLispExpr SmLispExpr deriving Show
data LocalDef = Binding Identifier SmLispExpr deriving Show

parseSmLispExpr :: [Token] -> Maybe (SmLispExpr, [Token])

parseSmLispExpr (Lbrak : tokens) = parseCondExpr [] tokens
parseSmLispExpr (LBrace : tokens) = parseLetExpr [] tokens
parseSmLispExpr (AlphaNum s: Lbrak : tokens) = parseFnCall s [] tokens
parseSmLispExpr (AlphaNum s: tokens) = Just (Variable s, tokens)
parseSmLispExpr (Quote : t : Quote : tokens) = 
   case parseSExpression [t] of 
       Just (SymAtom s, []) -> Just (SExpr (SymAtom s), tokens)
       _ -> Nothing
parseSmLispExpr tokens =
   case parseSExpression tokens of 
       Just (List s, more) -> Just (SExpr (List s), more)
       Just (NumAtom n, more) -> Just (SExpr (NumAtom n), more)
       _ -> Nothing

parseCondExpr clauses tokens = 
    case parseSmLispExpr tokens of 
        Just (e, Arrow: more) -> 
            case parseSmLispExpr more of
                Just (rslt, Semicolon:yet_more) -> 
                    parseCondExpr (clauses ++ [Clause e rslt]) yet_more
                Just (rslt, Rbrak:yet_more) -> 
                    Just (CondExpr (clauses ++ [Clause e rslt]), yet_more)
                _ -> Nothing
        _ -> Nothing
        
parseFnCall s args tokens = 
    case parseSmLispExpr tokens of 
        Just (e, (Semicolon:more)) -> parseFnCall s (args ++ [e]) more
        Just (e, (Rbrak:more)) -> Just (FnCall s (args ++ [e]), more)
        _ -> Nothing


parseLetExpr defs (AlphaNum s: Equal: tokens) =
    case parseSmLispExpr tokens of 
        Just (e, Semicolon:more) -> 
            parseLetExpr (defs ++ [Binding s e]) more
        Just (e, Colon:more) -> 
            case parseSmLispExpr more of 
                Just (e2, RBrace:yet_more) -> 
                    Just (LetExpr (defs ++ [Binding s e]) e2, yet_more)
                _ -> Nothing
        _ -> Nothing

parseLetExpr defs tokens = Nothing

Parsing Small Lisp Programs (Lists of Definitions)

data Definition = ConstantDef Identifier SmLispExpr |
                  FunctionDef Identifier [Identifier] SmLispExpr deriving Show

parseSmLispProgram [] = Just []

parseSmLispProgram (Comment t: more) = parseSmLispProgram more

parseSmLispProgram (AlphaNum name: Equal : more) =
    case parseSmLispExpr more of
        Just (e, yet_more) -> 
            case parseSmLispProgram yet_more of
                Just p -> Just ((ConstantDef name e): p)
                _ -> Nothing
        _ -> Nothing

parseSmLispProgram (AlphaNum name: Lbrak : AlphaNum arg : more) =
    case parseArguments [arg] more of
        Just (args, Equal: yet_more) -> 
            case parseSmLispExpr yet_more of
                Just (e, further_more) -> 
                    case parseSmLispProgram further_more of
                        Just p -> Just ((FunctionDef name args e): p)
                        _ -> Nothing
                _ -> Nothing
        _ -> Nothing
parseSmLispProgram _ = Nothing

parseArguments args (Rbrak : more) = Just (args, more)
parseArguments args (Semicolon: AlphaNum a: more) =
    parseArguments (args ++ [a]) more
parseArguments _ _ = Nothing


parseExpressionText srctxt = 
    case parseSmLispExpr (tokenize srctxt) of 
        Just (e, []) -> Just e
        _ -> Nothing

parseProgramText srctxt = parseSmLispProgram (tokenize ('\n':srctxt))

Updated Wed Oct. 31 2018, 11:15 by cameron.

Simon Fraser University
Engaging the World

CourSys