Unicode Regular Expressions

Unicode

Unicode is a common coding conventions for characters in all the world's languages.

Each character is assigned a unique codepoint - a numeric value in the range 0-0x10FFFF.
Assignments and standards are managed by the Unicode consortium.
1,114,112 codepoints, most of them still unassigned.
Agreement on the Unicode standard means that emails, web pages, text messages and many other textual information sources can be communicated across national and linguistic boundaries.
Alphabetic characters:
- Latin
- Katakana: カタカナ
- Cyrillic: Кириллица
- Greek: ελληνικά
Ideographs
- Chinese: 新年快乐 (Happy Chinese New Year, by the way!)
Symbols
- Math: ∀ ∁ ∂ ∃ ∄ ∅ ∆ ∇ ∈ ∉
- Emoji:? ? ?

Unicode codepoints can be encoded in different forms:
- UTF-8: 8-bit code units, 1 to 4 code units per codepoint
- UTF-16: 16-bit code units, 1 or 2 code units per codepoint
- UTF-16LE: byte order serialization of UTF-16 with low byte first
- UTF-32: 32-bit code units

Unicode Technical Standard 18 defines a series of requirements for Unicode regular expressions, at 3 levels of support.

Hex notation for codepoints
- \u{2087} for SUBSCRIPT SEVEN
- ranges [\u{2080}-\u{2089}], for SUBSCRIPT ZERO through NINE
Properties
- \p{Greek} - class for all characters in Greek script
- \p{Whitespace} - class for whitespace characters
- [\p{Letter}\p{Digit}] - letters and digits in any script
Set operations
- [\p{Cyrillic}&&\p{Upper}] - cyrillic upper case letters
Unicode simple case folding for case-insensitive matching
Unicode line breaks, word boundaries.

Matching in terms of grapheme clusters
- . (the Any metacharacter) matches full grapheme clusters.
- \b{g} is a zero-width assertion that matches at a grapheme cluster boundary
- In grapheme cluster mode, subexpressions must match to grapheme cluster boundaries.
- Equivalently \b{g} is implicitly inserted at the end of every subexpression.

Unicode Standard Annex 29: Text Segmentation defines the rules for text segmentation.
The implementation of grapheme cluster boundaries is complex:
- For example, grapheme_clusters.cpp in icgrep.
But grapheme clusters boundaries can often be simplified or eliminated in context.
- If the characters on both sides of a boundary are known, the boundary assertion can usually be eliminated.
  - a\b{g}b simplifies to ab.
  - a\b{g}\u{031B} simplifies to ? (empty set - never matches).
- If characters on one side of a boundary are known, the boundary assertion can be simplified to just one case.
- re_contextual_simplification.cpp in icgrep.
  - Based on last semester course project by CMPT 384 student.

A regular expression E matches a set of strings S.
A regular expression satisfies the canonical equivalence property if
- for any string s in S, all canonical equivalents to s are also in S.
Given an arbitrary regular expression E, compute an expression E' such that:
- any string s matched by E is also matched by E'
- for any string s matched by E, all canonically equivalent strings s' are matched by E'

NFD is the Unicode Normal Form for fully decomposed sequences.
- All precomposed characters (such as U+01A1 ( ơ ) LATIN SMALL LETTER O WITH HORN) are decomposed into equivalent sequences of base characters plus marks.
- All sequences of marks are put in canonical order (Unicode canonical ordering algorithm).

UTS #18 suggests that both the regular expression and the text being matched are put in NFD in order to look for canonical matches.
- Transforming large text files to NFD is an expensive process.

Updated Thu July 25 2024, 11:37 by cameron.