Between Lexics and Syntax: Whitespace, Layout and Comments

Fixed Format Languages

Early programming languages, such as FORTRAN 66, were fixed-format, specifying fixed character positions at which particular components of program statements and declarations were to occur. The basic conventions of FORTRAN 66 are as follows.

Character position 1 was reserved for the comment code C, which, if present, indicated that the rest of the line was a comment, not a programming language statement or declaration.
Character positions 2 through 5 were reserved for numeric labels that could be used as targets of control transfer statements (e.g., GO TO).
Character position 6 was used as a continuation mark, to continue a statement or declaration begun on the previous line.
Columns 7 through 72 were for program statements and declarations.
Columns 73 through 80 were for sequence numbers of statements and declarations, used to resort the statements and declarations if their order became confused (e.g., a deck of punch cards dropped).

BNF Grammars and Implicit Whitespace

Normally, BNF grammars of programming languages are written without explicit description of the language rules for comments or whitespace. Whitespace refers to the spacing and line break characters that separate programming language tokens. A general rule followed by many languages is that whitespace or comments may be included between any two tokens, but is only necessary between adjacent alphanumeric tokens. Languages that freely allow such whitespace insertion are called free format languages.

Whitespace is Significant

FORTRAN treated blanks as completely insignificant in the body of a statement. This could sometimes be convenient to make numbers and variable names more readable. For example, the statement

HIVAL = 14713678923

could be more readably written as follows.

HI VAL = 14 713 678 923

However, ignoring blanks can sometimes cause problems. For example, consider the looping statement:

DO 3 I = 1,3

Suppose this were mistakenly coded with a period instead of a comma. Then the looping statement would instead be interpreted as an assignment to the identifier DO3I.

DO3I = 1.3

Unfortunately, exactly this kind of error occurred in the control program for a Mariner spacecraft to Venus, resulting in its loss. (Annals of the History of Computing, 1984, 6(1), page 6).

Comment Conventions

Programming languages have two basic conventions for comments.

End-of-line comments.
Comments continue from a comment begin token until the end of the input line.
Bracketted comments. Comments are enclosed in "comment brackets" (e.g., /* and */), which may span multiple lines.

When bracketted comments are used, it is important to be aware of potential errors if the final delimiter is mistakenly omitted. In this case, language statements may be skipped by the compiler, without error reports.

Historically, many programming languages do not allow nesting of bracketted comments. This makes it difficult to "comment out" sections of code that themselves contain nested comments.

Syntactic Comment Positioning

Although most languages allow comments to appear anywhere between tokens, it may be useful to define conventional syntactic positions for certain kinds of comments. For example, Java documentation comments appear immediately before class, interface, method, constructor or field declarations and serve to document those declarations. With conventions about the contents of these comments, documentation can be automatically generated.

Intelligent Layout: The Offside Rule

Miranda and Haskell are functional programming languages that follow the offside rule of Landin: every token of an object must be to the right or directly below its first token.

The implementation of intelligent layout in Python is described in its indentation rules.

Updated Tue Sept. 08 2015, 06:14 by cameron.

Simon Fraser University
Engaging the World

CourSys