Resolving Ambiguity in Small Lisp Tokenization
Recall the Small Lisp regular expressions for tokenization.
comment: (^;;;.*$)+
numeral: -?[0-9]+
alphanumeric_symbol: [A-Za-z](-?[A-Za-z0-9])*
special_symbol: [+-*/<>=&|!@#$%?:]+
lparen: \(
rparen: \)
lbrak: \[
rbrak: \]
lbrace: \{
rbrace: \}
equal: =
semicolon: ;
arrow: -->
quote-mark: "
colon: :
Issue #1
There is some ambiguity because equal, arrow and colon tokens are also acceptable as special_symbols.
Solution #1
One solution is to eliminate the Equal, Arrow, and
Colon tokens as separate items from the token data type.
data Token = Comment [Char] |
             NumToken Int |
             AlphaNumToken [Char] |
             SpecialToken [Char] |
             Lparen | Rparen | Lbrak | Rbrak | Lbrace | Rbrace
             Semicolon | Quote
When parsing would actually require a an equal token, for, example, we could then check for (SpecialToken "=") instead.
Does this solve the problem completely?
Issue #2
What happens if you write:
{x =-4 : plus[x; x]
By the longest match rule, the tokenization would be
Lbrace, SpecialToken "=-", NumToken 4, SpecialToken ":", ...
But what we probably want 
Lbrace, SpecialToken "=", NumToken -4, SpecialToken ":", ...
How would you resolve this?
Updated Sat Sept. 01 2018, 18:24 by cameron.