Modelling Programming Languages as Symbolic Data

Programming languages are substantially more complex that expression languages that we have seen such as the domains of regular expressions and algebraic expressions. How does this affect the approach to modelling programs in such languages as symbolic data objects?

In general, we model data domains starting from some description of the domains, typically given as a grammar in BNF or EBNF.

Small Lisp

As an introduction to the modelling of programming languages as symbolic data objects, we consider the language Small Lisp. This is small, but surprisingly powerful symbolic computing language based on McCarthy's original m-Lisp notation. The Small Lisp Reference Manual defines the complete syntax and semantics in only 7 pages!

Symbolic Computing with Small Lisp

Given this overview of Small Lisp, we can now consider how to model Small Lisp programs as data objects. We begin by working through the grammar and considering the possible Haskell data representations.

Values and Identifiers

Values in Small Lisp are atoms and lists.

<S-expression> ::= <atom> | <list>
<atom> ::= <numeric-atom> | <symbolic-atom>
<list> ::= '(' {<S-expression>} ')'

Numeric-atoms and symbolic-atoms have syntax given by the following regular expressions.

<numeric-atom> = -?[0-9]+
<symbolic-atom> = [A-Za-z](-?[A-Za-z0-9])*|[+-*/<>=&|!@#$%?:]+

Note that <identifier>s for variable and functions in Small Lisp have a similar syntax to that of symbolic atoms.

<identifier> = [A-Za-z](-?[A-Za-z0-9])*

Expressions

The structure of expressions is given by the following grammar rules.

<expression> ::= <value> | <variable> | <function-call> |
                 <conditional-expression> | <let-expression>
<value> ::= <numeric-atom> | '"' <symbolic-atom> '"' | <list>

<variable> ::= <identifier>

<function-call> ::= <function-identifier> '[' <expression> {';' <expression>} ']'
<function-identifier> ::= <identifier>

<conditional-expression> ::= '[' <clause> {';' <clause>}]
<clause> ::= <expression> '-->' <expression>

<let-expression> ::= '{' <local-definition> {';' <local-definition>} ':' <expression> '}'
<local-definition> ::= <identifier> '=' <expression>

Programs: Function and Constant Definitions

<program> ::= {<definition>}

<definition> ::= <constant-definition> | <function-definition>
<constant-definition> ::= 
   [<comment>] <identifier> '=' <expression>

<function-definition> ::= 
   [<comment>] 
   <function-identifier> '[' <parameter> {';' <parameter> }']' '=' <expression>
<parameter> ::= <identifier>

Comments have a syntax given by the following regular expression.

<comment> = (^;;;.*$)*

Symbolic Subdomains

One possible approach to modeling the structure of a programming language would be to take the entire grammar of the language and create a single symbolic data type that would encompass all the constructs of the grammar.

Monolithic Domain Approach

For example, working from the Small Lisp grammar above, we could introduce a single Small Lisp domain.

data SmLisp = NumAtom Int |
              SymAtom [Char] |
              List [SmLisp] |
              Variable [Char] |
              FnCall [Char] [SmLisp] |
              CondExpr [SmLisp] |
              CondClause SmLisp SmLisp |
              LetExpr [SmLisp] SmLisp |
              LocalDef [Char] SmLisp |
              ConstantDef [Char] SmLisp |
              FunctionDef [Char] [[Char]] SmLisp

This creates a single domain structure with 11 subdomains, but has several disadvantages.

It allows inappropriate structures, such as a LocalDef inside a List.
Writing functions on SmLisp objects is potentially complex, potentially involving equations for all 11 patterns.
It mixes things that have different semantics, for examples, it mixes expressions, which can be evaluated to produce a S-expression value, with definitions, which do not.
It creates unnecessary distinctions, such as that for CondClause, which can only occur within a CondExpr.

In essence, this approach is monolithic and non-modular.

Logical Subdomains

For programming languages, the symbolic computing subdomains are often grouped into a few principal kinds.

Expressions: the constructs that express computation of values
Types: the constructs that express the legal sets of values that may arise
Statements: the constructs that express actions that potentially change the internal or external state of a computation
Routine Definitions: the constructs that group together a set of logical computations and/or actions to implement a conceptual task.
Modules: the constructs that group together a number of related entities that typically operate upon a common data structure.

In the example of Small Lisp, it makes sense to use three subdomains: values (S-expressions), expressions and definitions. In addition it is useful to introduce separate definitions that can only occur in a single context.

S-expressions are the fundamental values in Small Lisp.

data SExpression = NumAtom Int | SymAtom [Char] | List [SExpression]

Expressions are constructs that express computations to a produce a single S-expression value.

type Identifier = [Char]

data SmLispExpr = SExpr SExpression |
                  Variable Identifier |
                  FnCall Identifier [SmLispExpr] |
                  CondExpr [CondClause] |
                  LetExpr [LocalDef] SmLispExpr

data CondClause = Clause SMLispExpr SMLispExpr
data LocalDef = Binding Identifier SmLispExpr

Here we have introduced auxiliary types CondClause and LocalDef. But since these constructs are each introduced for one purpose only, we don't need the type tags. We could just have used 2-tuples.

type CondClause = (SMLispExpr, SMLispExpr)
type LocalDef = (Identifier, SmLispExpr)

Finally, we also need to work out the structure for constant and function definitions.

data Definition = ConstantDef Identifier SmLispExpr |
                  FunctionDef Identifier [Identifier] SmLispExpr

type SmLispProgram = [Definition]

A Comment About Comments

Note that we have not included <comment>s within the internal structure of the Definition domains. It is common that comments are discarded during parsing and are absent from the AST (abstract syntax tree) representations. Indeed, in many programming languages comments are treated as white space and my be freely inserted anywhere.

Small Lisp actually confines comments to occur only before a constant or function definition. This means that it would logically be easy to extend the AST structures to include the representation of comments.

data Definition = ConstantDef Comment Identifier SmLispExpr |
                  FunctionDef Comment Identifier [Identifier] SmLispExpr
type Comment = [Char]

Updated Thu Oct. 11 2018, 12:15 by cameron.

Simon Fraser University
Engaging the World

CourSys