Regular Expressions - Introduction

Regular Expressions

Regular expressions are a grammatical notation often used for describing the structure of programming language tokens or symbols. Other practical applications:

Powerful find-and-replace operations in editors such emacs and ex.
Powerful string processing features in scripting languages such as Perl, Javascript, PHP and Python.
Automatic generation of lexical analyzers using tools such as Flex.
Searching for gene and protein sequence patterns in DNA and RNA.

Classical Regular Expressions

A simplified version of regular expressions is the following definition of regular expressions over an alphabet S.

Base Rule 1: ε is a regular expression that stands for the the empty string.
Base Rule 2: Any character of S is a regular expression that stands for itself.
Inductive Rule 1: If A and B are regular expressions, then A B is a regular expression that stands for the strings which are formed by concatenating a string represented by A and a string represented by B.
Inductive Rule 2: If A and B are regular expressions, then A|B is a regular expression that stands for all the strings which are represented by either A or B.
Inductive Rule 3: If A is a regular expression, then A* is a regular expression that stands for all the strings which consist of the concatentation of zero or more strings represented by A.
Inductive Rule 4: If A is a regular expression, then (A) is an equivalent regular expression.

In the above rules, A and B are metavariables which stand for regular expressions.

To complete the syntactic specification of regular expressions, the following precedence rules are standard. Repetition has the highest precedence, so that A B* means A (B*) not (A B)*. Alternation has the lowest precedence so that A B | C means (A B) | C not A (B | C).

Names are frequently used as shorthands for already defined expressions. However, names may not be used recursively. It is always possible to replace a name by the regular expression it denotes.

Examples over the set of ASCII characters.

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
integer-numeral = digit digit*
real-numeral = digit digit* . digit digit* E (ε | + | -) digit digit*
identifier = letter (letter | digit)*

A common extension is:

Inductive Rule 5: If A is a regular expression, then A+ is a regular expression that stands for all the strings which consist of the concatentation of one or more strings represented by A. This shortens things:
integer-numeral = digit+
real-numeral = digit+ . digit+ E (ε | + | -) digit+
identifier = letter (letter | digit)*

Regular Expressions Define Sets Of Strings

Formally, regular expressions define sets of strings.

Base Rule 1: ε stands for the set whose only member is the empty string.
Base Rule 2: Any character of S is a regular expression that stands for the set whose only member is the string consisting of that single character.
Inductive Rule 1: A B stands for the set of all possible concatenations of one string from the set A and one string from the set B.
Inductive Rule 2: A|B is the union of sets A and B.
Inductive Rule 3: A* is equivalent to the "infinite" regular expression (ε | A | A A | A A A | A A A A | ...). That is, it is the set of all possible concatenations of zero or more strings from set A.
Inductive Rule 4: (A) denotes the same set as A.

Regular Expressions Versus BNF

Regular expressions and BNF are two grammatical formalisms for describing sets of strings. Regular expressions are a concise and convenient notation for describing syntax when "nesting" is not an issue. BNF is a more powerful notation that allows for the description of nested language constructs using nonterminal symbols in arbitrary recursive combinations. Thus regular expressions are appropriate for token-level syntax of programming languages, while BNF is required for the higher-level recursive syntax of expressions, statements and so on.

In formal language theory, the class of languages definable by regular expressions is called the class of regular languages. The languages definable by BNF grammars are called context-free languages. The class of regular languages is a subset of the class of context-free languages.

Further Material

Wikipedia: Regular Expressions
regular-expressions.info - Explaining and comparing regular expression libraries and tools
UTS #18: Unicode Regular Expressions
icgrep - Ultra-fast modern grep search tool

Updated Mon Jan. 01 2018, 11:25 by cameron.

Simon Fraser University
Engaging the World

CourSys

Regular Expressions - Introduction