Not logged in. Login

Unicode and Internationalization in Software Engineering

Unicode

  • Unicode is a universal character encoding system designed to incorporate the characters used in all the world's encoding systems.
  • Unicode characters are identified by code points, unique numeric values in the range 0..0x10FFFF. (1,114,112 total code points)
  • Unicode is divided into 17 planes of 65,536 code points each.
  • Unicode 9.0 allocates 271,792 code points:
    • 128,237 assigned characters
    • 137,468 codepoints reserved for private use
    • 2,048 codepoints for surrogates
    • 66 non-character codepoints, e.g., 0xFFFE, 0xFFFF.

Unicode Code Point Issues

  • How does the software handle non-character codepoints? Is it possible that noncharacters will be erroneously processed as characters?
  • How does the system handle private use codepoints? Does it avoid mixing of private use code points that are assigned different meanings by different groups?
    • 0xF8FF for the Apple logo (Apple)
    • 0xF8FF for KLINGON MUMMIFICATION GLYPH (Conscript Unicode Registry)

Unicode Encoding Formats

  • UTF-8 is system for encoding system for Unicode using 8-bit code units, with each code point being encoded as a sequence of up to 4 code units.
  • UTF-16 is a system for encoding Unicode using 16-bit code units. All the codepoints of the Basic Multilingual Plane (0..0xFFFF) can be encoded as a single UTF-16 code unit. Characters in the supplementary planes are encoded as a surrogate pair using a sequence of 2 codepoints in the range D800..DFFF.
  • UTF-32 allows every code point to be encoded as a single 32-bit code unit.

UTF-8 Encoding Issues

Does the software correctly encode UTF-8? See the UTF-8 to UTF-16 Transcoder Testing case study for details.

UTF-16 Encoding Issues

There a number of quality issues for UTF-16 processing. (These issues may be used to generate choices for functional testing.)

  • Does the system recognize both UTF-16BE and UTF-16LE forms, based on the byte order mark 0xFEFF occurring at the head of a character stream?
  • Does the software allow code points in the range 0x10000..0x10FFFF to be encoded as surrogate pairs?
  • Does the software recognize that a high surrogate in the 0xD800..0xDBFF range must also be followed by a low surrogate in the 0xDC00..0xDFFF range and that a low surrogate may never appear except after a high surrogate?
  • Does the system produce correct character counts, counting occurrences of a surrogate pair as a single character?

Combining Marks and Precomposed Characters

For accented characters and other purposes, Unicode supports the concept of grapheme clusters, sequences of code points which taken together appear as a single logical character. Some of the combinations that are possible are also given dedicated code points of their own (precomposed characters). This often gives rise to various different combining character sequences that all should be considered equivalent.

  • Does software treat all of the following as equivalent?
  1. o + horn + dot_below
    1. 0x006F ( o ) LATIN SMALL LETTER O
    2. 0x031B ( ◌̛ ) COMBINING HORN
    3. 0x0323 ( ◌̣ ) COMBINING DOT BELOW
  2. o + dot_below + horn
    1. 0x006F ( o ) LATIN SMALL LETTER O
    2. 0x0323 ( ◌̣ ) COMBINING DOT BELOW
    3. 0x031B ( ◌̛ ) COMBINING HORN
  3. o-horn + dot_below
    1. 0x01A1 ( ơ ) LATIN SMALL LETTER O WITH HORN
    2. 0x0323 ( ◌̣ ) COMBINING DOT BELOW
  4. o-dot_below + horn
    1. 0x1ECD ( ọ ) LATIN SMALL LETTER O WITH DOT BELOW
    2. 0x031B ( ◌̛ ) COMBINING HORN
  5. o-horn-dot_below
    1. 0x1EE3 ( ợ ) LATIN SMALL LETTER O WITH HORN AND DOT BELOW

Internationalization

  • Does software support generalization of character types (e.g., letters, digits, symbols) by working with Unicode properties.
  • ICU (International Components for Unicode) is a toolkit that provides access to all the Unicode character properties. by

Does the software present appropriate locale-dependent information based on the Unicode CLDR?

Unicode Common Locale Data Repository

Security

Unicode Security Considerations

Updated Sun Sept. 10 2023, 10:41 by cameron.