Symbolic Data Domains: Concrete and Abstract Syntax
Concrete vs Abstract Syntax
In symbolic computing, we often make a distinction between the concrete syntax of a symbolic data domain and a simplified internal form know as the abstract syntax.
The concrete syntax is the precise textual representation of a notation as defined by its grammar. In general, we use context-free grammars using some variant of BNF or EBNF for defining the concrete syntax.
The abstract syntax is a representation that strips away some details of the concrete syntax. In particular, grammar productions that are used to resolve ambiguities related to operator precedence an d associativity are often removed.
Consider the example of simple regular expressions, with our previously defined grammar.
<RE> ::= <seq> | <RE> "|" <seq> <seq> ::= <item> | <seq> <item> <item> ::= <element> | <element> "*" <element> ::= <char> | "(" <RE> ")" <char> ::= any character except "|", "*", "(", ")"
Now consider the following Haskell algebraic data type declarations that directly reflect the structure of the BNF grammar.
data RE = Seq1 SeqRE | Alt RE SeqRE data SeqRE = Item1 ItemRE | Seq SeqRE ItemRE data ItemRE = Elem1 ElemRE | Star ElemRE data ElemRE = Ch1 CharRE | Group RE data CharRE = Ch Char
These definitions give an internal representation that very much follows the concrete syntax and nature of the parsing process, and gives rise to a representation based on derivation trees or parsing trees.
Compare this to our one-line definition of a Haskell algebraic data type given previously.
data RE = Epsilon | Ch Char | Seq RE RE | Alt RE RE | Star RE | Group RE deriving Show
This definition defines the internal representation as abstract syntax trees (ASTs).
In general, ASTs are simpler than derivation trees. Consider the regular expression "ab".
- As an AST the representation is
Seq (Ch 'a') (Ch 'b')
- As a derivation tree, the representation is
Seq1 (Seq (Item1 (Elem1 (Ch1 (Ch 'a')))) (Elem1 (Ch1 (Ch 'b'))))
Derivation or parsing trees have many more nodes, so most often we prefer to use the AST forms instead. ASTs also simplify the internal type structures and make the node construction more orthogonal. That is, when constructing a recursively-defined RE node, we can choose any type of RE for the components, and not be limited by the more specific types allowed in the derivation tree.
Unparsing
Unparsing is the process of converting the internal form of symbolic data objects back into a textual representation. In principle, this is fairly easy to do with straightforward unparsing equations.
unparse :: RE -> [Char] unparse (Alt r1 r2) = (unparse r1) ++ "|" ++ (unparse r2) unparse (Seq r1 r2) = (unparse r1) ++ (unparse r2) unparse (Star r1) = (unparse r1) ++ "*" unparse (Group r1) = '(': (unparse r1) ++ ")" unparse (Ch c) = c:[]
But what happens if we try to unparse
a structure such as
Star (Seq (Ch 'a') (Ch 'b'))
?
We get an apparently straightforward result.
*Main> unparse (Star (Seq (Ch 'a') (Ch 'b'))) "ab*"
But how does this textual representation parse? According to our grammar and parser, we get a different internal representation!
Seq (Ch 'a') (Star (Ch 'b'))
This is a very different regular expression than
Star (Seq (Ch 'a') (Ch 'b'))
!
In order to deal with this problem, we need to add parentheses systematically. One way to do this is to have functions which
add Group nodes to reflect our BNF grammar.
For example, forceSeq
is used when we require a <seq>
,
to ensure that the RE is parseable as a <seq>
.
forceSeq :: RE -> RE forceSeq (Alt r1 r2) = (Group (Alt r1 r2)) forceSeq x = x
Similarly, when an <item>
is required, we can force the
parentheses for grouping.
forceItem :: RE -> RE forceItem (Alt r1 r2) = (Group (Alt r1 r2)) forceItem (Seq r1 r2) = (Group (Seq r1 r2)) forceItem x = x
Can you define forceElement
?
Given these definitions, how do we now define a recursive function addGroups
that can be used to produce correct unparses?
addGroups :: RE -> RE
Following the constraints given by the BNF grammar, we
need to obey the restrictions for each type of
element construction. For example, in the rule
which defines Alt
structures with the "|" operator,
we have <RE> "|" <seq>
, so we must ensure that
the right hand side is a <seq>
. The left hand
side can be any RE
so there will be change it.
addGroups (Alt r1 r2) = Alt (addGroups r1) (forceSeq (addGroups r2)))
In the case of a Seq
structure, the relevant constraints
are found in the second alternative of the <seq>
BNF rule.
<seq> ::= <item> | <seq> <item>
So, the following equation performs the required transformation.
addGroups (Seq r1 r2) = Seq (forceSeq (addGroups r1)) (forceItem (addGroups r2)))
The BNF for the Star
structure requires an <element>}
as a component, so we have the following equation in this case.
addGroups (Star r) = Star (forceElem (addGroups r))
No other structures have RE
s inside, so there is no
embedding to do.
addGroups r = r
To handle the unparsing problem, then, we can simply apply the addGroups function before using the simple unparse
function above.
Another approach is to modify the unparse
equations directly
to use the force...
functions just-in-time.
unparse :: RE -> [Char]
unparse (Alt r1 r2) = (unparse r1) ++ "|" ++ (unparse (forceSeq r2))
unparse (Seq r1 r2) = (unparse (forceSeq r1)) ++ (unparse (forceItem r2))
unparse (Star r1) = (unparse (forceElem r1)) ++ "*"
unparse (Group r1) = '(': (unparse r1) ++ ")"
unparse (Ch c) = c:[]