Resolving Ambiguity in Small Lisp Tokenization
Recall the Small Lisp regular expressions for tokenization.
comment: (^;;;.*$)+ numeral: -?[0-9]+ alphanumeric_symbol: [A-Za-z](-?[A-Za-z0-9])* special_symbol: [+-*/<>=&|!@#$%?:]+ lparen: \( rparen: \) lbrak: \[ rbrak: \] lbrace: \{ rbrace: \} equal: = semicolon: ; arrow: --> quote-mark: " colon: :
Issue #1
There is some ambiguity because equal
, arrow
and colon
tokens are also acceptable as special_symbol
s.
Solution #1
One solution is to eliminate the Equal
, Arrow
, and
Colon
tokens as separate items from the token data type.
data Token = Comment [Char] | NumToken Int | AlphaNumToken [Char] | SpecialToken [Char] | Lparen | Rparen | Lbrak | Rbrak | Lbrace | Rbrace Semicolon | Quote
When parsing would actually require a an equal
token, for, example, we could then check for (SpecialToken "=")
instead.
Does this solve the problem completely?
Issue #2
What happens if you write:
{x =-4 : plus[x; x]
By the longest match rule, the tokenization would be
Lbrace, SpecialToken "=-", NumToken 4, SpecialToken ":", ...
But what we probably want
Lbrace, SpecialToken "=", NumToken -4, SpecialToken ":", ...
How would you resolve this?
Updated Sat Sept. 01 2018, 18:24 by cameron.