Character Class Multiplexing
The process of character class multiplexing is to find a minimal set of
character class basis streams from which all the character classes used in
a particular regular expression can be computed. For example, the regular expression
[a-z]+ingly!
has seven nominal character classes:
[a-z]
, [i]
, [n]
, [g]
, [l]
, [y]
, and [!]
.
By applying the multiplexing concept, however, a set of three character class bit
streams can be computed which represent all the possibilities.
Exclusive Character Class Set
The first step in character class multiplexing is to determine a set of
mutually-exclusive and collectively-exhaustive
character classes that represent the nominal character classes as well as
input characters that are not in any class. In this case, the class
of all characters not present in the expression is [^a-z!]
, while
the minimal set of mutually exclusive classes that include all the characters
in the expression comprises
[!]
, [a-fhjkmo-xz]
, [g]
, [i]
, [l]
, [n]
, and [y]
.
Multiplexed Encoding
Given a set of mutually exclusive and collectively exhaustive character classes of size \(N\), a \(\log_2 N\) bit encoding can be computed to represent the classes. Given the 8 classes of our running example, we can assign a 3-bit code to each class as follows.
- 000:
[^a-z!]
- 001:
[!]
- 010:
[a-fhjkmo-xz]
- 011:
[g]
- 100:
[i]
- 101:
[l]
- 110:
[n]
- 111:
[y]
Multiplexed Classes
In order to compute the final set of multiplexed basis streams, we can determine the appropriate character classes for each bit of the multiplexed encoding. Our running example then reduces to the following 3 character classes for the multiplexed encoding (little-endian bit numbering).
- bit 2:
[inly]
- bit 1:
[a-hjkm-z]
- bit 0:
[!gly]
Why Multiplex?
- A reduced number of streams represent all the possible character classes.
- More expensive operations can be implemented on the reduced stream set.
Advance
andMatchStar
are more expensive than bitwise logic.- Example:
Advance
on 7 streams: multiplex to 3 streams first.