Lexical Analysis and Parsing
Converting the textual representation of a program into an internal abstract syntax form often involves an intermediate step of lexical analysis prior to parsing.
Lexical Analysis: White Space, Comments and Tokens
The grammar of programming languages is often described without explicit reference to the white space (space and tab characters, carriage returns, line feeds, etc.) that may separate the significant tokens of the language. Furthermore, it is often convenient to simplify the task of the parser by explicitly forming tokens from the input ahead of time. Most often, tokens are those that can be identified as instances of particular regular expressions.
For example, for the domain of Small Lisp, the regular expressions that are useful for tokenization are the following:
comment: (^;;;.*$)+ numeral: -?[0-9]+ alphanumeric_symbol: [A-Za-z](-?[A-Za-z0-9])* special_symbol: [+-*/<>=&|!@#$%?:]+ lparen: \( rparen: \) lbrak: \[ rbrak: \] lbrace: \{ rbrace: \} equal: = semicolon: ; arrow: --> quote-mark: " colon: :
We can represent these tokens using a Haskell algebraic data type.
data Token = Comment [Char] | NumToken Int | AlphaNumToken [Char] | SpecialToken [Char] | Lparen | Rparen | Lbrak | Rbrak | LBrace | RBrace Equal | Semicolon | Arrow | Quote | Colon
Most often tokenization follows the longest-match rule: tokens are formed from the left taking the longest possibly string that matches the given regular expression, while not consuming any whitespace.
For example, the Small Lisp definition
make-negation[expr] = list2["-"; expr]
could be represented using the following Haskell [Token]
value.
[AlphaNumToken "make-negation", Lbrak, AlphaNumToken "expr", RBrak, Equal, AlphaNumToken "list2", Lbrak, Quote, SpecialToken "-", Quote, Semicolon, AlphaNumToken "expr", RBrak
Question: Why don't you expect to see token streams such as the following.
[AlphaNumToken "make", SpecialToken "-", AlphaNumToken "negation"]
Assignment 3a:
Write a Small Lisp tokenizer that tokenizes a Small Lisp program to produce a list of tokens.
Parsing from Token Streams
Given token streams, parsing functions have a slightly different type signature than when working with character strings directly. To parse Small Lisp programs, the main parsing functions would handle the three prinicipal subdomains.
parseSExpression :: [Token] -> (Maybe SExpression, [Token]) parseSmLispExpr :: [Token] -> (Maybe SmLispExpr, [Token]) parseSmLispProgram :: [Token] -> (Maybe SmLispProgram, [Token])
Assignment 3b: Small Lisp Parser
Write a Small Lisp parser that uses the token stream as produced in part 3a.