Not logged in. Login

Symbolic Data Domains: Concrete and Abstract Syntax

Concrete vs Abstract Syntax

In symbolic computing, we often make a distinction between the concrete syntax of a symbolic data domain and a simplified internal form know as the abstract syntax.

The concrete syntax is the precise textual representation of a notation as defined by its grammar. In general, we use context-free grammars using some variant of BNF or EBNF for defining the concrete syntax.

The abstract syntax is a representation that strips away some details of the concrete syntax. In particular, grammar productions that are used to resolve ambiguities related to operator precedence an d associativity are often removed.

Consider the example of simple regular expressions, with our previously defined grammar.

<RE> ::= <seq> | <RE> "|" <seq>
<seq> ::= <item> | <seq> <item>
<item> ::= <element> | <element> "*"
<element> ::= <char> | "(" <RE> ")"
<char> ::= any character except "|", "*", "(", ")"

Now consider the following Haskell algebraic data type declarations that directly reflect the structure of the BNF grammar.

data RE = Seq1 SeqRE | Alt RE SeqRE
data SeqRE = Item1 ItemRE | Seq SeqRE ItemRE
data ItemRE = Elem1 ElemRE | Star ElemRE
data ElemRE = Ch1 CharRE | Group RE
data CharRE = Ch Char

These definitions give an internal representation that very much follows the concrete syntax and nature of the parsing process, and gives rise to a representation based on derivation trees or parsing trees.

Compare this to our one-line definition of a Haskell algebraic data type given previously.

data RE = Epsilon | Ch Char | Seq RE RE | Alt RE RE | Star RE | Group RE deriving Show

This definition defines the internal representation as abstract syntax trees (ASTs).

In general, ASTs are simpler than derivation trees. Consider the regular expression "ab".

  • As an AST the representation is Seq (Ch 'a') (Ch 'b')
  • As a derivation tree, the representation is
Seq1 (Seq (Item1 (Elem1 (Ch1 (Ch 'a')))) (Elem1 (Ch1 (Ch 'b'))))

Derivation or parsing trees have many more nodes, so most often we prefer to use the AST forms instead. ASTs also simplify the internal type structures and make the node construction more orthogonal. That is, when constructing a recursively-defined RE node, we can choose any type of RE for the components, and not be limited by the more specific types allowed in the derivation tree.

Unparsing

Unparsing is the process of converting the internal form of symbolic data objects back into a textual representation. In principle, this is fairly easy to do with straightforward unparsing equations.

unparse :: RE -> [Char]
unparse (Alt r1 r2) = (unparse r1) ++ "|" ++ (unparse r2)
unparse (Seq r1 r2) = (unparse r1) ++ (unparse r2)
unparse (Star r1) = (unparse r1) ++ "*"
unparse (Group r1) = '(': (unparse r1) ++ ")"
unparse (Ch c) = c:[]

But what happens if we try to unparse a structure such as Star (Seq (Ch 'a') (Ch 'b'))? We get an apparently straightforward result.

*Main> unparse (Star (Seq (Ch 'a') (Ch 'b')))
"ab*"

But how does this textual representation parse? According to our grammar and parser, we get a different internal representation!

Seq (Ch 'a') (Star (Ch 'b'))

This is a very different regular expression than Star (Seq (Ch 'a') (Ch 'b'))!

In order to deal with this problem, we need to add parentheses systematically. One way to do this is to have functions which add Group nodes to reflect our BNF grammar. For example, forceSeq is used when we require a <seq>, to ensure that the RE is parseable as a <seq>.

forceSeq :: RE -> RE

forceSeq (Alt r1 r2) = (Group (Alt r1 r2))
forceSeq x = x

Similarly, when an <item> is required, we can force the parentheses for grouping.

forceItem :: RE -> RE

forceItem (Alt r1 r2) = (Group (Alt r1 r2))
forceItem (Seq r1 r2) = (Group (Seq r1 r2))
forceItem x = x

Can you define forceElement?

Given these definitions, how do we now define a recursive function addGroups that can be used to produce correct unparses?

addGroups :: RE -> RE

Following the constraints given by the BNF grammar, we need to obey the restrictions for each type of element construction. For example, in the rule which defines Alt structures with the "|" operator, we have <RE> "|" <seq>, so we must ensure that the right hand side is a <seq>. The left hand side can be any RE so there will be change it.

addGroups (Alt r1 r2) = Alt (addGroups r1) (forceSeq (addGroups r2)))

In the case of a Seq structure, the relevant constraints are found in the second alternative of the <seq> BNF rule.

<seq> ::= <item> | <seq> <item>

So, the following equation performs the required transformation.

addGroups (Seq r1 r2) = Seq (forceSeq (addGroups r1)) (forceItem (addGroups r2)))

The BNF for the Star structure requires an <element>} as a component, so we have the following equation in this case.

addGroups (Star r) = Star (forceElem (addGroups r))

No other structures have REs inside, so there is no embedding to do.

addGroups r = r

To handle the unparsing problem, then, we can simply apply the addGroups function before using the simple unparse function above.

Another approach is to modify the unparse equations directly to use the force... functions just-in-time.

unparse :: RE -> [Char]
unparse (Alt r1 r2) = (unparse r1) ++ "|" ++ (unparse (forceSeq r2))
unparse (Seq r1 r2) = (unparse (forceSeq r1)) ++ (unparse (forceItem r2))
unparse (Star r1) = (unparse (forceElem r1)) ++ "*"
unparse (Group r1) = '(': (unparse r1) ++ ")"
unparse (Ch c) = c:[]
Updated Tue Oct. 09 2018, 11:16 by cameron.