Parallel Transcoding
Transcoding - introductory Cyrillic to Latin example
Notes and Resources
- This introductory part may be implemented in a brute-force manner as follows.
- For each of the 16 output bits of UTF-16, we define a character class. The members of this character class for bit position K are the Cyrillic input codepoints that whose bit changes at position K. The character class compiler algorithm is then applied.
- Using this method, characters not in the Cyrillic range will automatically be maintained.
- See the transcode_compiler.py for a prototype that addresses a similar transcoding problem.
- The character class compiler knows how to apply many optimizations, so it should be tried first. However, if there are patterns that can be identified that the character class compiler cannot detect, there might be further improvements that can be made.
Fixed vs. Variable-Length Transcoding
A fixed-length transcoding problem is one in which the output code units are in one-to-one correspondence with the input code units. For example, in the Cyrillic to Latin case using UTF-16, there is one UTF-16 output code unit for every UTF-16 input code unit. In this case transcoding using bit parallel methods is straightforward.
- Transform data to 16 parallel bit streams (two sets of 8 bit streams each for the high and low bytes of each code unit).
- Perform logical operations to calculate the corresponding bit streams for the output data.
- Perform inverse transposition to generate a UTF-16 output file that is precisely the same size as the original.
A variable-length transcoding problem is one in which there is not necessarily a one-to-one correspondence between the code units of the input data stream and those of the output data. UTF-8 to UTF-16 is a good example. In this case, input characters in the ASCII range are encoded with one UTF-8 byte and transformed to a single UTF-16 code unit. In the 0x80-0x7FF codepoint range, however, two UTF-8 bytes are used for each UTF-16 code unit. In the 0x800-0xFFFF codepoint range, three UTF-8 bytes are required for each codepoint, which may again be represented as a single UTF-16 code unit. Finally, in the 0x10000-0x10FFFF range, four UTF-8 bytes are needed for characters, producing two UTF-16 code units.
In UTF-8 to UTF-16 transcoding, the number of code units always stays the same or decreases. In this case, the transcoding problem requires the elimination of some code unit positions in the input when the corresponding output codepoint uses fewer code units. Parallel transcoding proceeds as follows.
- Transpose the input data to parallel bit stream form.
- For each instance in which m input code units are transcoded to a single code unit, produce shifted input streams that move bit data from the first m-1 positions forward to the final position of the code unit.
- Perform the calculations of output bit stream data using combinations of the input and shifted bit streams.
- Mark the first m-1 positions of the input code units for deletion.
- Perform a parallel deletion (parallel compress) on the calculated bit stream data so that retained positions are in one-to-one correspondence with required output data.
- Compute the output stream by inverse transposition.
In order for parallel deletion to be implemented efficiently, some of the deletion work may be performed using bitstream calculations, while other work is performed only after inverse transposition.
This process is illustrated in the u8u16
transcoder.