Case Study: UTF-8 to UTF-16 Transcoder Testing
We examine the problem of testing a UTF-8 to UTF-16 validating transcoder to illustrate the following concepts in testing.
- Functional Testing (Black-Box)
- Equivalence Partitioning
- Boundary-Value Analysis
- Automated Test-Case Generation
- Automated Test Execution
- Automated Test Oracle
As this approach combines the black-box functional test case generation together with some white-box analysis of important test categories and choices, it is an example of grey-box testing.
This case study is illustrated by the test case generator and execution scripts in the http://u8u16.costar.sfu.ca/browser/QA/ directory.
Problem Statement: UTF-8 to UTF-16 Validation and Conversion
The problem is to convert valid, complete sequences of UTF-8 code units into their corresponding representation. UTF-8 is a variable length representation of Unicode of 1 to 4 bytes. UTF-16 is variable-length representation of 1 or 2 doublebytes. The following table shows conversion ranges of individual character code points.
Unicode | UTF-8 | UTF-16 |
---|---|---|
Codepoint | Byte Sequence | Doublebyte Sequence |
00-0x7F | [0x00-0x7F ] | [0x0000-0x007F ] |
0x80-0x7FF | [0xC2-0xDF ] [0x80-0xBF ] | [0x0080-0x07FF ] |
0x800-0xFFFF | [0xE0-0xEF ] [0x80-0xBF ] [0x80-0xBF ] | [0x07FF-0xFFFF] |
0x10000-0x10FFFF | [0xF0-0xF4 ] [0x80-0xBF] [0x80-0xBF ] [0x80-0xBF ] | [0xD800-0xDBFF ] [0xDC00-0xDFFF ] |
Valid and Invalid UTF-8 Input Sequences
Some byte sequences do not represent valid UTF-8.
- The prefix bytes
0xC0
and0xC1
are illegal. - The prefix bytes
0xF5
through0xFF
are illegal. - Any prefix byte in the range
0xC2
through0xDF
must be followed by exactly one suffix byte in the range0x80
through0xBF
. - Any prefix byte in the range
0xE0
through0xEF
must be followed by exactly two suffix bytes in the range0x80
through0xBF
. - Any prefix byte in the range
0xF0
through0xF4
must be followed by exactly three suffix bytes in the range0x80
through0xBF
. - A prefix byte
0xE0
requires that the range of the immediately next suffix byte be limited to the range0xA0
through0xBF
. - A prefix byte
0xED
requires that the range of the immediately next suffix byte be limited to the range0x80
through0x9F
. - A prefix byte
0xF0
requires that the range of the immediately next suffix byte be limited to the range0x90
through0xBF
. - A prefix byte
0xF4
requires that the range of the immediately next suffix byte be limited to the range0x80
through0x8F
. - A suffix byte standing alone or beyond the specified number of suffix bytes for a given prefix byte is illegal.
Based on the requirements of UTF-8 to UTF-16 conversion, we take a functional testing approach, also known as black-box testing. This is in contrast to a structural testing approach in which we can see inside the box, often called white-box (or, sometimes, glass-box) testing.
Equivalence Partitioning
Equivalence partitioning is the technique of dividing up the domain of possible inputs into equivalence classes. Testing any member of the equivalence class is assumed to be a representative test point for the entire class.
We divide the UTF-8 inputs up into the equivalence classes based on the following ideas.
- Correct instances of each length are each assumed to form an equivalence class.
- Two-byte sequences with an illegal prefix byte
0xC0
or0xC1
together with a legal suffix byte form an equivalence class., as do four-byte sequences with an illegal prefix byte0xF5
through0xFF
(2 classes) - For each of the prefix bytes
0xE0
,0xED
,0xF0
, and0xF4
, an otherwise correct three or four byte sequence that violates the constraint for the first suffix byte forms a class. (4 classes). - For each prefix length, a sequence with a non-suffix byte in each possible position forms an equivalence class. (6 classes).
- A bare suffix byte forms an equivalence class.
- Equivalence classes are defined for each possible sequence length that is too short for its prefix (6 classes).
Test Automation: Random Value Generation
Given the equivalence classes above, we can automate the generation of test sequences. A Python implementation of generating a random element from each erroneous and each incomplete UTF-8 sequence is given by the functions gen_UTF8_error_sequences and gen_UTF8_incomplete_sequences, respectively in the u8u16_testgen.py script.
Adding White-Box Boundary Value Analysis
Boundary-Value Analysis often looks at boundary cases associated with equivalence classes as ones that may need particular testing. For black-box testing, these boundaries are related to the natural boundaries of the problem specification.
However, we sometimes perform boundary-value analysis based on boundaries involving in the actual code being tested. This is a white-box testing technique. These boundaries may relate to boundaries of internal data structures, such as the size of particular buffers.
Because the UTF-8 to UTF-16 implementations being tested were based on parallel techniques with potential block and boundary value problems, it was desired to insert instances of each equivalence class at randomly chosen positions reflecting boundary problems.
- At the beginning of the file.
- Near the beginning of the file (positions 1 to 4).
- Near the data word boundary (positions 29 to 31).
- Near the 64-bit boundary (positions 60 to 64).
- Near the SSE 128-bit boundary (positions 124 to 128).
- After the first 128-bit block, but before the 2K buffer boundary.
- Near 2K boundary (positions 2046, 2047).
- Beyond the buffer boundary.
These categories were implemented as the prefix_groups list in the script.
Similarly, for each type of error at each boundary class, it was chosen to generate different length sequences of correct input past the error position in the following 4 classes.
- No suffix.
- One to 4 bytes of additional input.
- Up to a full block of additional input (5 to 128 bytes).
- Substantial further input (500 to 5000 bytes).
These categories were implemented as the suffix_groups list in the script.
When black-box and white-box techniques are combined together, the result is sometimes called grey-box testing.
Test Data Generation
Using the equivalence classes defined above, a test data generator was then implemented to produce test files with examples of erroneous input and incomplete input. In each case, a sample input file was constructed, together with an expected output file and an expected message file according to the conventions of a driver program written for the purpose.
Test Execution Harness and Oracle
With the expected output and error message files available, a simple shell script was written to perform the tests and produce test output and test message directory in parallel to the expected output and expected messages directory. A simple shell script was used and the standard diff
invocation implemented the test oracle.
- run_all script