Regular Expressions - Introduction

Regular Expressions

Regular expressions are a grammatical notation often used for describing the structure of programming language tokens or symbols. Other practical applications:

Powerful find-and-replace operations in editors such emacs and ex.
Powerful string processing features in scripting languages such as Perl, Javascript, PHP and Python.
Automatic generation of lexical analyzers using tools such as Flex.
Searching for gene and protein sequence patterns in DNA and RNA.

Regular expressions may be considered to be string patterns which match subject strings under certain conditions.

Classical Regular Expressions

A simplified version of regular expressions is the following definition of regular expressions over an alphabet S.

Base Rule 1: ε is a regular expression that matches the empty string.
Base Rule 2: Any character of S is a regular expression that matches a string consisting only of the character itself.
Inductive Rule 1: If A and B are regular expressions, then A B is a regular expression that matches strings which are formed by concatenating a string matched by A and a string matched by B.
Inductive Rule 2: If A and B are regular expressions, then A|B is a regular expression that matches any string which is matched by either A or B.
Inductive Rule 3: If A is a regular expression, then A* is a regular expression that stands for any string which consists of the concatentation of zero or more strings matched by A.
Inductive Rule 4: If A is a regular expression, then (A) is an equivalent regular expression.

In the above rules, A and B are metavariables which stand for regular expressions.

To complete the syntactic specification of regular expressions, the following precedence rules are standard. Repetition has the highest precedence, so that A B* means A (B*) not (A B)*. Alternation has the lowest precedence so that A B | C means (A B) | C not A (B | C).

Names are frequently used as shorthands for already defined expressions. However, names may not be used recursively. It is always possible to replace a name by the regular expression it denotes.

Examples over the set of ASCII characters.

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
integer-numeral = digit digit*
real-numeral = digit digit* . digit digit* E (ε | + | -) digit digit*
identifier = letter (letter | digit)*

A common extension is:

Inductive Rule 5: If A is a regular expression, then A+ is a regular expression that matches any string which consists of the concatenation of one or more strings matched by A. This shortens things:
integer-numeral = digit+
real-numeral = digit+ . digit+ E (ε | + | -) digit+
identifier = letter (letter | digit)*

Regular Expressions Define Sets Of Strings

Formally, regular expressions define sets of strings.

Base Rule 1: ε stands for the set whose only member is the empty string.
Base Rule 2: Any character of S is a regular expression that stands for the set whose only member is the string consisting of that single character.
Inductive Rule 1: A B stands for the set of all possible concatenations of one string from the set A and one string from the set B.
Inductive Rule 2: A|B is the union of sets A and B.
Inductive Rule 3: A* is equivalent to the "infinite" regular expression (ε | A | A A | A A A | A A A A | ...). That is, it is the set of all possible concatenations of zero or more strings from set A.
Inductive Rule 4: (A) denotes the same set as A.

Practical Regular Expressions

Many programming tools have support for regular expressions with many more features than the definitions above. Some of these features are simple, but useful shorthands, while others extend the power of the matching technology.

The ? character may be added to match 0 or 1 occurrences of a regular expression.
- (ab)? is equivalent to (ab)|ε

The {m,n} notation may be added for m to n occcurences of a regular expression.
- a{2,4} is equivalent to aaa?a?

Character classes within square brackets may be used insted of alternation.
- [abc] is equivalent to a | b | c

Character classes may have ranges indicated with hyphens.
- [a-f0-9] is equivalent to [abcdef0123456789]

Backslashes may be use to match metacharacters.
- [a-z]\*[0-9] matches any 3 character string consisting of a lower case letter, an asterisk, and a digit.

Unicode categories may be used.
- [\p{Greek}&&\p{uppercase}] matches any single character that is an upper case Greek letter.

Regular Expressions Versus BNF

Regular expressions and BNF are two grammatical formalisms for describing sets of strings. Regular expressions are a concise and convenient notation for describing syntax when "nesting" is not an issue. BNF is a more powerful notation that allows for the description of nested language constructs using nonterminal symbols in arbitrary recursive combinations. Thus regular expressions are appropriate for token-level syntax of programming languages, while BNF is required for the higher-level recursive syntax of expressions, statements and so on.

How can we describe the syntax of regular expressions?

In formal language theory, the class of languages definable by regular expressions is called the class of regular languages. The languages definable by BNF grammars are called context-free languages. The class of regular languages is a subset of the class of context-free languages.

Further Material

Wikipedia: Regular Expressions
regular-expressions.info - Explaining and comparing regular expression libraries and tools
UTS #18: Unicode Regular Expressions
icgrep - Ultra-fast modern grep search tool

Updated Thu Sept. 06 2018, 11:57 by cameron.

Simon Fraser University
Engaging the World

CourSys

Regular Expressions - Introduction