Lexical Analysis and Parsing

Converting the textual representation of a program into an internal abstract syntax form often involves an intermediate step of lexical analysis prior to parsing.

Lexical Analysis: White Space, Comments and Tokens

The grammar of programming languages is often described without explicit reference to the white space (space and tab characters, carriage returns, line feeds, etc.) that may separate the significant tokens of the language. Furthermore, it is often convenient to simplify the task of the parser by explicitly forming tokens from the input ahead of time. Most often, tokens are those that can be identified as instances of particular regular expressions.

For example, for the domain of Small Lisp, the regular expressions that are useful for tokenization are the following:

comment: (^;;;.*$)+
numeral: -?[0-9]+
alphanumeric_symbol: [A-Za-z](-?[A-Za-z0-9])*
special_symbol: [+-*/<>=&|!@#$%?:]+
lparen: \(
rparen: \)
lbrak: \[
rbrak: \]
lbrace: \{
rbrace: \}
equal: =
semicolon: ;
arrow: -->
quote-mark: "
colon: :

We can represent these tokens using a Haskell algebraic data type.

data Token = Comment [Char] |
             NumToken Int |
             AlphaNumToken [Char] |
             SpecialToken [Char] |
             Lparen | Rparen | Lbrak | Rbrak | LBrace | RBrace
             Equal | Semicolon | Arrow | Quote | Colon

Most often tokenization follows the longest-match rule: tokens are formed from the left taking the longest possibly string that matches the given regular expression, while not consuming any whitespace.

For example, the Small Lisp definition

make-negation[expr] = list2["-"; expr]

could be represented using the following Haskell [Token] value.

[AlphaNumToken "make-negation", Lbrak, AlphaNumToken "expr", RBrak, Equal, 
 AlphaNumToken "list2", Lbrak, Quote, SpecialToken "-", Quote, Semicolon, AlphaNumToken "expr", RBrak

Question: Why don't you expect to see token streams such as the following.

[AlphaNumToken "make", SpecialToken "-", AlphaNumToken "negation"]

Assignment 3a:

Write a Small Lisp tokenizer that tokenizes a Small Lisp program to produce a list of tokens.

Parsing from Token Streams

Given token streams, parsing functions have a slightly different type signature than when working with character strings directly. To parse Small Lisp programs, the main parsing functions would handle the three prinicipal subdomains.

parseSExpression :: [Token] -> (Maybe SExpression, [Token])
parseSmLispExpr :: [Token] -> (Maybe SmLispExpr, [Token])
parseSmLispProgram :: [Token] -> (Maybe SmLispProgram, [Token])

Assignment 3b: Small Lisp Parser

Write a Small Lisp parser that uses the token stream as produced in part 3a.

Updated Wed March 07 2018, 13:20 by cameron.

Simon Fraser University
Engaging the World

CourSys

Lexical Analysis and Parsing

Lexical Analysis: White Space, Comments and Tokens

Assignment 3a:

Parsing from Token Streams

Assignment 3b: Small Lisp Parser