Unicode Regular Expressions
Unicode
Unicode is a common coding conventions for characters in all the world's languages.
- Each character is assigned a unique codepoint - a numeric value in the range 0-0x10FFFF.
- Assignments and standards are managed by the Unicode consortium.
- 1,114,112 codepoints, most of them still unassigned.
- Agreement on the Unicode standard means that emails, web pages, text messages and many other textual information sources can be communicated across national and linguistic boundaries.
- Alphabetic characters:
- Latin
- Katakana: カタカナ
- Cyrillic: Кириллица
- Greek: ελληνικά
- Ideographs
- Chinese: 新年快乐 (Happy Chinese New Year, by the way!)
- Symbols
- Math: ∀ ∁ ∂ ∃ ∄ ∅ ∆ ∇ ∈ ∉
- Emoji:? ? ?
Encoding Forms
- Unicode codepoints can be encoded in different forms:
- UTF-8: 8-bit code units, 1 to 4 code units per codepoint
- UTF-16: 16-bit code units, 1 or 2 code units per codepoint
- UTF-16LE: byte order serialization of UTF-16 with low byte first
- UTF-32: 32-bit code units
Unicode Regular Expressions
Unicode Technical Standard 18 defines a series of requirements for Unicode regular expressions, at 3 levels of support.
- Basic Unicode Support - codepoint notation, properties, set operations
- Extended Unicode Support - grapheme clusters, full properties, equivalent forms
- Tailored Unicode Support - dependent on a specific locale.
Level 1: Basic Unicode Support
- Hex notation for codepoints
\u{2087}
for SUBSCRIPT SEVEN- ranges
[\u{2080}-\u{2089}]
, for SUBSCRIPT ZERO through NINE
- Properties
\p{Greek}
- class for all characters in Greek script\p{Whitespace}
- class for whitespace characters[\p{Letter}\p{Digit}]
- letters and digits in any script
- Set operations
[\p{Cyrillic}&&\p{Upper}]
- cyrillic upper case letters
- Unicode simple case folding for case-insensitive matching
- Unicode line breaks, word boundaries.
Level 2: Extended Unicode Support
- Grapheme clusters are sequences of Unicode codepoints that
may be considered as a single character.
- Typically, a base character plus combining marks.
- Example: o + horn + dot_below
- U+006F ( o ) LATIN SMALL LETTER O
- U+031B ( ◌̛ ) COMBINING HORN
- U+0323 ( ◌̣ ) COMBINING DOT BELOW
- Canonical equivalence: multiple sequences considered equivalent
- o + horn + dot_below
- U+006F ( o ) LATIN SMALL LETTER O
- U+031B ( ◌̛ ) COMBINING HORN
- U+0323 ( ◌̣ ) COMBINING DOT BELOW
- o + dot_below + horn
- U+006F ( o ) LATIN SMALL LETTER O
- U+0323 ( ◌̣ ) COMBINING DOT BELOW
- U+031B ( ◌̛ ) COMBINING HORN
- o-horn + dot_below
- U+01A1 ( ơ ) LATIN SMALL LETTER O WITH HORN
- U+0323 ( ◌̣ ) COMBINING DOT BELOW
- o-dot_below + horn
- U+1ECD ( ọ ) LATIN SMALL LETTER O WITH DOT BELOW
- U+031B ( ◌̛ ) COMBINING HORN
- o-horn-dot_below
- U+1EE3 ( ợ ) LATIN SMALL LETTER O WITH HORN AND DOT BELOW
- o + horn + dot_below
- Matching in terms of grapheme clusters
.
(the Any metacharacter) matches full grapheme clusters.\b{g}
is a zero-width assertion that matches at a grapheme cluster boundary- In grapheme cluster mode, subexpressions must match to grapheme cluster boundaries.
- Equivalently
\b{g}
is implicitly inserted at the end of every subexpression.
Symbolic Computing Problems for Unicode REs
Grapheme Cluster Boundaries
- Unicode Standard Annex 29: Text Segmentation defines the rules for text segmentation.
- The implementation of grapheme cluster boundaries is complex:
- For example, grapheme_clusters.cpp in icgrep.
- But grapheme clusters boundaries can often be simplified or eliminated in context.
- If the characters on both sides of a boundary are known, the boundary assertion can usually be eliminated.
a\b{g}b
simplifies toab
.a\b{g}\u{031B}
simplifies to ? (empty set - never matches).
- If characters on one side of a boundary are known, the boundary assertion can be simplified to just one case.
- re_contextual_simplification.cpp in icgrep.
- Based on last semester course project by CMPT 384 student.
- If the characters on both sides of a boundary are known, the boundary assertion can usually be eliminated.
Canonically Equivalent Matching
- A regular expression E matches a set of strings S.
- A regular expression satisfies the canonical equivalence property if
- for any string s in S, all canonical equivalents to s are also in S.
- Given an arbitrary regular expression E, compute an expression E' such that:
- any string s matched by E is also matched by E'
- for any string s matched by E, all canonically equivalent strings s' are matched by E'
- NFD is the Unicode Normal Form for fully decomposed sequences.
- All precomposed characters (such as U+01A1 ( ơ ) LATIN SMALL LETTER O WITH HORN) are decomposed into equivalent sequences of base characters plus marks.
- All sequences of marks are put in canonical order (Unicode canonical ordering algorithm).
- UTS #18 suggests that both the regular expression and the text being matched are put in NFD in order to look for canonical matches.
- Transforming large text files to NFD is an expensive process.
- Alternative: compute the all-equivalents RE.
- NFD form, plus all equivalents with precomposed characters.
- All orderings of the marks allowed by Unicode canonical ordering algoritm.
- Make an alternation of the orderings?
- Combinatorial explosion for sequences with many marks!
- Linear expansion: approach: one mark coverage expression for each mark M
\p{Mark}*M\p{Mark}*
- Embed each mark coverage in a lookbehind assertion
(?<=\p{Mark}*M\p{Mark}*)
- equivalence.cpp in icgrep.
Updated Thu July 25 2024, 11:37 by cameron.