Not logged in. Login

Unicode Regular Expressions

Unicode

Unicode is a common coding conventions for characters in all the world's languages.

  • Each character is assigned a unique codepoint - a numeric value in the range 0-0x10FFFF.
  • Assignments and standards are managed by the Unicode consortium.
  • 1,114,112 codepoints, most of them still unassigned.
  • Agreement on the Unicode standard means that emails, web pages, text messages and many other textual information sources can be communicated across national and linguistic boundaries.
  • Alphabetic characters:
    • Latin
    • Katakana: カタカナ
    • Cyrillic: Кириллица
    • Greek: ελληνικά
  • Ideographs
    • Chinese: 新年快乐 (Happy Chinese New Year, by the way!)
  • Symbols
    • Math: ∀ ∁ ∂ ∃ ∄ ∅ ∆ ∇ ∈ ∉
    • Emoji:? ? ?

Encoding Forms

  • Unicode codepoints can be encoded in different forms:
    • UTF-8: 8-bit code units, 1 to 4 code units per codepoint
    • UTF-16: 16-bit code units, 1 or 2 code units per codepoint
    • UTF-16LE: byte order serialization of UTF-16 with low byte first
    • UTF-32: 32-bit code units

Unicode Regular Expressions

Unicode Technical Standard 18 defines a series of requirements for Unicode regular expressions, at 3 levels of support.

  1. Basic Unicode Support - codepoint notation, properties, set operations
  2. Extended Unicode Support - grapheme clusters, full properties, equivalent forms
  3. Tailored Unicode Support - dependent on a specific locale.

Level 1: Basic Unicode Support

  • Hex notation for codepoints
    • \u{2087} for SUBSCRIPT SEVEN
    • ranges [\u{2080}-\u{2089}], for SUBSCRIPT ZERO through NINE
  • Properties
    • \p{Greek} - class for all characters in Greek script
    • \p{Whitespace} - class for whitespace characters
    • [\p{Letter}\p{Digit}] - letters and digits in any script
  • Set operations
    • [\p{Cyrillic}&&\p{Upper}] - cyrillic upper case letters
  • Unicode simple case folding for case-insensitive matching
  • Unicode line breaks, word boundaries.

Level 2: Extended Unicode Support

  • Grapheme clusters are sequences of Unicode codepoints that may be considered as a single character.
    • Typically, a base character plus combining marks.
    • Example: o + horn + dot_below
      • U+006F ( o ) LATIN SMALL LETTER O
      • U+031B ( ◌̛ ) COMBINING HORN
      • U+0323 ( ◌̣ ) COMBINING DOT BELOW
  • Canonical equivalence: multiple sequences considered equivalent
    • o + horn + dot_below
      • U+006F ( o ) LATIN SMALL LETTER O
      • U+031B ( ◌̛ ) COMBINING HORN
      • U+0323 ( ◌̣ ) COMBINING DOT BELOW
    • o + dot_below + horn
      • U+006F ( o ) LATIN SMALL LETTER O
      • U+0323 ( ◌̣ ) COMBINING DOT BELOW
      • U+031B ( ◌̛ ) COMBINING HORN
    • o-horn + dot_below
      • U+01A1 ( ơ ) LATIN SMALL LETTER O WITH HORN
      • U+0323 ( ◌̣ ) COMBINING DOT BELOW
    • o-dot_below + horn
      • U+1ECD ( ọ ) LATIN SMALL LETTER O WITH DOT BELOW
      • U+031B ( ◌̛ ) COMBINING HORN
    • o-horn-dot_below
      • U+1EE3 ( ợ ) LATIN SMALL LETTER O WITH HORN AND DOT BELOW
  • Matching in terms of grapheme clusters
    • . (the Any metacharacter) matches full grapheme clusters.
    • \b{g} is a zero-width assertion that matches at a grapheme cluster boundary
    • In grapheme cluster mode, subexpressions must match to grapheme cluster boundaries.
    • Equivalently \b{g} is implicitly inserted at the end of every subexpression.

Symbolic Computing Problems for Unicode REs

Grapheme Cluster Boundaries

  • Unicode Standard Annex 29: Text Segmentation defines the rules for text segmentation.
  • The implementation of grapheme cluster boundaries is complex:
  • But grapheme clusters boundaries can often be simplified or eliminated in context.
    • If the characters on both sides of a boundary are known, the boundary assertion can usually be eliminated.
      • a\b{g}b simplifies to ab.
      • a\b{g}\u{031B} simplifies to ? (empty set - never matches).
    • If characters on one side of a boundary are known, the boundary assertion can be simplified to just one case.
    • re_contextual_simplification.cpp in icgrep.
      • Based on last semester course project by CMPT 384 student.

Canonically Equivalent Matching

  • A regular expression E matches a set of strings S.
  • A regular expression satisfies the canonical equivalence property if
    • for any string s in S, all canonical equivalents to s are also in S.
  • Given an arbitrary regular expression E, compute an expression E' such that:
    • any string s matched by E is also matched by E'
    • for any string s matched by E, all canonically equivalent strings s' are matched by E'
  • NFD is the Unicode Normal Form for fully decomposed sequences.
    • All precomposed characters (such as U+01A1 ( ơ ) LATIN SMALL LETTER O WITH HORN) are decomposed into equivalent sequences of base characters plus marks.
    • All sequences of marks are put in canonical order (Unicode canonical ordering algorithm).
  • UTS #18 suggests that both the regular expression and the text being matched are put in NFD in order to look for canonical matches.
    • Transforming large text files to NFD is an expensive process.
  • Alternative: compute the all-equivalents RE.
    • NFD form, plus all equivalents with precomposed characters.
    • All orderings of the marks allowed by Unicode canonical ordering algoritm.
    • Make an alternation of the orderings?
      • Combinatorial explosion for sequences with many marks!
    • Linear expansion: approach: one mark coverage expression for each mark M
      • \p{Mark}*M\p{Mark}*
      • Embed each mark coverage in a lookbehind assertion (?<=\p{Mark}*M\p{Mark}*)
    • equivalence.cpp in icgrep.
Updated Thu July 25 2024, 11:37 by cameron.