Resolving Ambiguity in Small Lisp Tokenization

Recall the Small Lisp regular expressions for tokenization.

comment: (^;;;.*$)+
numeral: -?[0-9]+
alphanumeric_symbol: [A-Za-z](-?[A-Za-z0-9])*
special_symbol: [+-*/<>=&|!@#$%?:]+
lparen: \(
rparen: \)
lbrak: \[
rbrak: \]
lbrace: \{
rbrace: \}
equal: =
semicolon: ;
arrow: -->
quote-mark: "
colon: :

Issue #1

There is some ambiguity because equal, arrow and colon tokens are also acceptable as special_symbols.

Solution #1

One solution is to eliminate the Equal, Arrow, and Colon tokens as separate items from the token data type.

data Token = Comment [Char] |
             NumToken Int |
             AlphaNumToken [Char] |
             SpecialToken [Char] |
             Lparen | Rparen | Lbrak | Rbrak | Lbrace | Rbrace
             Semicolon | Quote

When parsing would actually require a an equal token, for, example, we could then check for (SpecialToken "=") instead.

Does this solve the problem completely?

Issue #2

What happens if you write:

{x =-4 : plus[x; x]

By the longest match rule, the tokenization would be

Lbrace, SpecialToken "=-", NumToken 4, SpecialToken ":", ...

But what we probably want Lbrace, SpecialToken "=", NumToken -4, SpecialToken ":", ...

How would you resolve this?

Updated Sat Sept. 01 2018, 18:24 by cameron.

Simon Fraser University
Engaging the World

CourSys

Resolving Ambiguity in Small Lisp Tokenization

Issue #1

Solution #1

Issue #2