Nonrecursive Syntax: The Regular Expression Metalanguage
Regular Expressions
Regular expressions are a grammatical notation often used for describing the structure of programming language tokens or symbols. Other practical applications:
- Powerful find-and-replace operations in editors such
emacs
andex
. - Powerful string processing features in scripting languages such as Perl, Javascript, PHP and Python.
- Automatic generation of lexical analyzers using tools such as Flex.
Classical Regular Expressions
A simplified version of regular expressions is the following definition of regular expressions over an alphabet S.
- Base Rule 1: ε is a regular expression that stands for the the empty string.
- Base Rule 2: Any character of S is a regular expression that stands for itself.
- Inductive Rule 1: If A and B are regular expressions, then A B is a regular expression that stands for the strings which are formed by concatenating a string represented by A and a string represented by B.
- Inductive Rule 2: If A and B are regular expressions, then A|B is a regular expression that stands for all the strings which are represented by either A or B.
- Inductive Rule 3: If A is a regular expression, then A* is a regular expression that stands for all the strings which consist of the concatentation of zero or more strings represented by A.
- Inductive Rule 4: If A is a regular expression, then (A) is an equivalent regular expression.
In the above rules, A and B are metavariables which stand for regular expressions.
To complete the syntactic specification of regular expressions, the following precedence rules are standard. Repetition has the highest precedence, so that A B* means A (B*) not (A B)*. Alternation has the lowest precedence so that A B | C means (A B) | C not A (B | C).
Names are frequently used as shorthands for already defined expressions. However, names may not be used recursively. It is always possible to replace a name by the regular expression it denotes.
Examples over the set of ASCII characters.
- digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
- integer-numeral = digit digit*
- real-numeral = digit digit* . digit digit* E (ε | + | -) digit digit*
- identifier = letter (letter | digit)*
A common extension is:
- Inductive Rule 5: If A is a regular expression, then A+ is a regular expression that stands for all the strings which consist of the concatentation of one or more strings represented by A. This shortens things:
- integer-numeral = digit+
- real-numeral = digit+ . digit+ E (ε | + | -) digit+
- identifier = letter (letter | digit)*
Regular Expressions Define Sets Of Strings
Formally, regular expressions define sets of strings.
- Base Rule 1: ε stands for the set whose only member is the empty string.
- Base Rule 2: Any character of S is a regular expression that stands for the set whose only member is the string consisting of that single character.
- Inductive Rule 1: A B stands for the set of all possible concatenations of one string from the set A and one string from the set B.
- Inductive Rule 2: A|B is the union of sets A and B.
- Inductive Rule 3: A* is equivalent to the "infinite" regular expression (ε | A | A A | A A A | A A A A | ...). That is, it is the set of all possible concatenations of zero or more strings from set A.
- Inductive Rule 4: (A) denotes the same set as A.
Regular Expressions Versus BNF
Regular expressions and BNF are two grammatical formalisms for describing sets of strings. Regular expressions are a concise and convenient notation for describing syntax when "nesting" is not an issue. BNF is a more powerful notation that allows for the description of nested language constructs using nonterminal symbols in arbitrary recursive combinations. Thus regular expressions are appropriate for token-level syntax of programming languages, while BNF is required for the higher-level recursive syntax of expressions, statements and so on.
In formal language theory, the class of languages definable by regular expressions is called the class of regular languages. The languages definable by BNF grammars are called context-free languages. The class of regular languages is a subset of the class of context-free languages.
Further Material
- Wikipedia: Regular Expressions
- regular-expressions.info - Explaining and comparing regular expression libraries and tools
- UTS #18: Unicode Regular Expressions
- icgrep - Ultra-fast modern grep search tool