Case Study: UTF-8 to UTF-16 Transcoder Testing

We examine the problem of testing a UTF-8 to UTF-16 validating transcoder to illustrate the following concepts in testing.

Functional Testing (Black-Box)
Equivalence Partitioning
Boundary-Value Analysis
Automated Test-Case Generation
Automated Test Execution
Automated Test Oracle

As this approach combines the black-box functional test case generation together with some white-box analysis of important test categories and choices, it is an example of grey-box testing.

This case study is illustrated by the test case generator and execution scripts in the http://u8u16.costar.sfu.ca/browser/QA/ directory.

Problem Statement: UTF-8 to UTF-16 Validation and Conversion

The problem is to convert valid, complete sequences of UTF-8 code units into their corresponding representation. UTF-8 is a variable length representation of Unicode of 1 to 4 bytes. UTF-16 is variable-length representation of 1 or 2 doublebytes. The following table shows conversion ranges of individual character code points.

Unicode	UTF-8	UTF-16
Codepoint	Byte Sequence	Doublebyte Sequence
00-0x7F	[`0x00-0x7F`]	[`0x0000-0x007F`]
0x80-0x7FF	[`0xC2-0xDF`] [`0x80-0xBF`]	[`0x0080-0x07FF`]
0x800-0xFFFF	[`0xE0-0xEF`] [`0x80-0xBF`] [`0x80-0xBF`]	[`0x07FF-0xFFFF]`
0x10000-0x10FFFF	[`0xF0-0xF4`] [`0x80-0xBF]` [`0x80-0xBF`] [`0x80-0xBF`]	[`0xD800-0xDBFF`] [`0xDC00-0xDFFF`]

Valid and Invalid UTF-8 Input Sequences

Some byte sequences do not represent valid UTF-8.

The prefix bytes 0xC0 and 0xC1 are illegal.
The prefix bytes 0xF5 through 0xFF are illegal.
Any prefix byte in the range 0xC2 through 0xDF must be followed by exactly one suffix byte in the range 0x80through 0xBF.
Any prefix byte in the range 0xE0 through 0xEF must be followed by exactly two suffix bytes in the range 0x80through 0xBF.
Any prefix byte in the range 0xF0 through 0xF4 must be followed by exactly three suffix bytes in the range 0x80through 0xBF.
A prefix byte 0xE0 requires that the range of the immediately next suffix byte be limited to the range 0xA0through 0xBF.
A prefix byte 0xED requires that the range of the immediately next suffix byte be limited to the range 0x80through 0x9F.
A prefix byte 0xF0 requires that the range of the immediately next suffix byte be limited to the range 0x90through 0xBF.
A prefix byte 0xF4 requires that the range of the immediately next suffix byte be limited to the range 0x80through 0x8F.
A suffix byte standing alone or beyond the specified number of suffix bytes for a given prefix byte is illegal.

Based on the requirements of UTF-8 to UTF-16 conversion, we take a functional testing approach, also known as black-box testing. This is in contrast to a structural testing approach in which we can see inside the box, often called white-box (or, sometimes, glass-box) testing.

Equivalence Partitioning

Equivalence partitioning is the technique of dividing up the domain of possible inputs into equivalence classes. Testing any member of the equivalence class is assumed to be a representative test point for the entire class.

We divide the UTF-8 inputs up into the equivalence classes based on the following ideas.

Correct instances of each length are each assumed to form an equivalence class.
Two-byte sequences with an illegal prefix byte 0xC0 or 0xC1 together with a legal suffix byte form an equivalence class., as do four-byte sequences with an illegal prefix byte 0xF5 through 0xFF (2 classes)
For each of the prefix bytes 0xE0, 0xED, 0xF0, and 0xF4, an otherwise correct three or four byte sequence that violates the constraint for the first suffix byte forms a class. (4 classes).
For each prefix length, a sequence with a non-suffix byte in each possible position forms an equivalence class. (6 classes).
A bare suffix byte forms an equivalence class.
Equivalence classes are defined for each possible sequence length that is too short for its prefix (6 classes).

Test Automation: Random Value Generation

Given the equivalence classes above, we can automate the generation of test sequences. A Python implementation of generating a random element from each erroneous and each incomplete UTF-8 sequence is given by the functions gen_UTF8_error_sequences and gen_UTF8_incomplete_sequences, respectively in the u8u16_testgen.py script.

Adding White-Box Boundary Value Analysis

Boundary-Value Analysis often looks at boundary cases associated with equivalence classes as ones that may need particular testing. For black-box testing, these boundaries are related to the natural boundaries of the problem specification.

However, we sometimes perform boundary-value analysis based on boundaries involving in the actual code being tested. This is a white-box testing technique. These boundaries may relate to boundaries of internal data structures, such as the size of particular buffers.

Because the UTF-8 to UTF-16 implementations being tested were based on parallel techniques with potential block and boundary value problems, it was desired to insert instances of each equivalence class at randomly chosen positions reflecting boundary problems.

At the beginning of the file.
Near the beginning of the file (positions 1 to 4).
Near the data word boundary (positions 29 to 31).
Near the 64-bit boundary (positions 60 to 64).
Near the SSE 128-bit boundary (positions 124 to 128).
After the first 128-bit block, but before the 2K buffer boundary.
Near 2K boundary (positions 2046, 2047).
Beyond the buffer boundary.

These categories were implemented as the prefix_groups list in the script.

Similarly, for each type of error at each boundary class, it was chosen to generate different length sequences of correct input past the error position in the following 4 classes.

No suffix.
One to 4 bytes of additional input.
Up to a full block of additional input (5 to 128 bytes).
Substantial further input (500 to 5000 bytes).

These categories were implemented as the suffix_groups list in the script.

When black-box and white-box techniques are combined together, the result is sometimes called grey-box testing.

Test Data Generation

Using the equivalence classes defined above, a test data generator was then implemented to produce test files with examples of erroneous input and incomplete input. In each case, a sample input file was constructed, together with an expected output file and an expected message file according to the conventions of a driver program written for the purpose.

Test Execution Harness and Oracle

With the expected output and error message files available, a simple shell script was written to perform the tests and produce test output and test message directory in parallel to the expected output and expected messages directory. A simple shell script was used and the standard diff invocation implemented the test oracle.

run_all script

Updated Tue Sept. 27 2016, 17:40 by cameron.

Simon Fraser University
Engaging the World

CourSys