Modelling Programming Languages as Symbolic Data
Programming languages are substantially more complex that expression languages that we have seen such as the domains of regular expressions and algebraic expressions. How does this affect the approach to modelling programs in such languages as symbolic data objects?
In general, we model data domains starting from some description of the domains, typically given as a grammar in BNF or EBNF.
Small Lisp
As an introduction to the modelling of programming languages as symbolic data objects, we consider the language Small Lisp. This is small, but surprisingly powerful symbolic computing language based on McCarthy's original m-Lisp notation. The Small Lisp Reference Manual defines the complete syntax and semantics in only 7 pages!
Given this overview of Small Lisp, we can now consider how to model Small Lisp programs as data objects. We begin by working through the grammar and considering the possible Haskell data representations.
Values and Identifiers
Values in Small Lisp are atoms and lists.
<S-expression> ::= <atom> | <list> <atom> ::= <numeric-atom> | <symbolic-atom> <list> ::= '(' {<S-expression>} ')'
Numeric-atoms and symbolic-atoms have syntax given by the following regular expressions.
<numeric-atom> = -?[0-9]+ <symbolic-atom> = [A-Za-z](-?[A-Za-z0-9])*|[+-*/<>=&|!@#$%?:]+
Note that <identifier>
s for variable and functions in Small Lisp have a similar syntax to that of symbolic atoms.
<identifier> = [A-Za-z](-?[A-Za-z0-9])*
Expressions
The structure of expressions is given by the following grammar rules.
<expression> ::= <value> | <variable> | <function-call> | <conditional-expression> | <let-expression> <value> ::= <numeric-atom> | '"' <symbolic-atom> '"' | <list> <variable> ::= <identifier> <function-call> ::= <function-identifier> '[' <expression> {';' <expression>} ']' <function-identifier> ::= <identifier> <conditional-expression> ::= '[' <clause> {';' <clause>}] <clause> ::= <expression> '-->' <expression> <let-expression> ::= '{' <local-definition> {';' <local-definition>} ':' <expression> '}' <local-definition> ::= <identifier> '=' <expression>
Programs: Function and Constant Definitions
<program> ::= {<definition>} <definition> ::= <constant-definition> | <function-definition> <constant-definition> ::= [<comment>] <identifier> '=' <expression> <function-definition> ::= [<comment>] <function-identifier> '[' <parameter> {';' <parameter> }']' '=' <expression> <parameter> ::= <identifier>
Comments have a syntax given by the following regular expression.
<comment> = (^;;;.*$)*
Symbolic Subdomains
One possible approach to modeling the structure of a programming language would be to take the entire grammar of the language and create a single symbolic data type that would encompass all the constructs of the grammar.
Monolithic Domain Approach
For example, working from the Small Lisp grammar above, we could introduce a single Small Lisp domain.
data SmLisp = NumAtom Int | SymAtom [Char] | List [SmLisp] | Variable [Char] | FnCall [Char] [SmLisp] | CondExpr [SmLisp] | CondClause SmLisp SmLisp | LetExpr [SmLisp] SmLisp | LocalDef [Char] SmLisp | ConstantDef [Char] SmLisp | FunctionDef [Char] [[Char]] SmLisp
This creates a single domain structure with 11 subdomains, but has several disadvantages.
- It allows inappropriate structures, such as a
LocalDef
inside aList
. - Writing functions on
SmLisp
objects is potentially complex, potentially involving equations for all 11 patterns. - It mixes things that have different semantics, for examples, it mixes expressions, which can be evaluated to produce a S-expression value, with definitions, which do not.
- It creates unnecessary distinctions, such as that for
CondClause
, which can only occur within aCondExpr
.
In essence, this approach is monolithic and non-modular.
Logical Subdomains
For programming languages, the symbolic computing subdomains are often grouped into a few principal kinds.
- Expressions: the constructs that express computation of values
- Types: the constructs that express the legal sets of values that may arise
- Statements: the constructs that express actions that potentially change the internal or external state of a computation
- Routine Definitions: the constructs that group together a set of logical computations and/or actions to implement a conceptual task.
- Modules: the constructs that group together a number of related entities that typically operate upon a common data structure.
In the example of Small Lisp, it makes sense to use three subdomains: values (S-expressions), expressions and definitions. In addition it is useful to introduce separate definitions that can only occur in a single context.
S-expressions are the fundamental values in Small Lisp.
data SExpression = NumAtom Int | SymAtom [Char] | List [SExpression]
Expressions are constructs that express computations to a produce a single S-expression value.
type Identifier = [Char] data SmLispExpr = SExpr SExpression | Variable Identifier | FnCall Identifier [SmLispExpr] | CondExpr [CondClause] | LetExpr [LocalDef] SmLispExpr data CondClause = Clause SMLispExpr SMLispExpr data LocalDef = Binding Identifier SmLispExpr
Here we have introduced auxiliary types CondClause
and LocalDef
. But since these constructs are each introduced for
one purpose only, we don't need the type tags. We could just have used 2-tuples.
type CondClause = (SMLispExpr, SMLispExpr) type LocalDef = (Identifier, SmLispExpr)
Finally, we also need to work out the structure for constant and function definitions.
data Definition = ConstantDef Identifier SmLispExpr | FunctionDef Identifier [Identifier] SmLispExpr type SmLispProgram = [Definition]
A Comment About Comments
Note that we have not included <comment>
s within the internal
structure of the Definition
domains. It is common that
comments are discarded during parsing and are absent from the AST (abstract syntax tree) representations. Indeed, in many programming
languages comments are treated as white space and my be freely inserted anywhere.
Small Lisp actually confines comments to occur only before a constant or function definition. This means that it would logically be easy to extend the AST structures to include the representation of comments.
data Definition = ConstantDef Comment Identifier SmLispExpr | FunctionDef Comment Identifier [Identifier] SmLispExpr type Comment = [Char]