Hangul NFC Assignment
In this assignment, you will use Parabix methods to implement
a text transformation that turns Hangul decomposed character
sequences (NFD form) into the equivalent Hangul precomposed characters.
This is essentially the inverse of the process in the tools/transcoders/hangul-nfd.cpp program.
For reference, the rules for Hangul decomposition and composition are found in Section 3.12 of the Unicode Standard.
The following steps should be useful in carrying out this assignment.
- Make a copy of
hangul-nfd.cppcalledhangul-nfc.cppand place it in thetools/transcodersdirectory. - Edit the
tools/transcoders/CMakeLists.txtfile and duplicate thehangul-nfdentry ashangul-nfc. - Edit the
hangul-nfc.cppfile as follows.- Keep all the steps of transposing data and computing stream sets up to and including the step to create the
U21stream (21 basis bit streams for Unicode). - Create character class bit streams for Hangul L, V and T syllables.
- Create a set of 5 bit streams to hold L, V and T index values.
- Create a PabloKernel that computes the correct index values at L, V and T positions, and having zeroes at all other positions. (Note that the index value at L positions should be the L index value, the index value at the V position should be the V index value and the index value at the T position should be the T index value).
- Create a PabloKernel
LFT2NFCthat calculates the precomposed replacement characters for the <L, V> and <L, V, T> sequences using the calculated index values. The input should be the U21 stream and the output should be a modified U21 stream, with changes only at the positions of recognized <L, V> and <L, V, T> sequences .- Using Advance operations, this calculation could be carried out at the V position of a <L, V> sequence or the T position of a <L, V, T> sequence.
- Alternatively, using Lookahead operations, this calculation could be carried out at the L positions of sequences.
- Use FilterByMask to determine a final set of 21 basis bit streams by filtering out the unused positions of the <L, V> and <L, V, T> sequences.
- Transform to UTF8 basis bit streams by applying the same
U21_to_UTF8kernel as in theHangul-NFDapplication. - Perform the inverse Parabix transform using
P2SKerneland generate the output usingStdOutKernel.
- Keep all the steps of transposing data and computing stream sets up to and including the step to create the
Testing
You can use icgrep to create test files.
- To make a file of all Hangul precomposed LV and LVT characters.
bin/icgrep '\p{hst:lvt}|\p{hst:lv}' ../QA/All_good >hgl-lvlvt
- To make a file of 5 LV characters.
bin/icgrep '\p{hst:lvt}|\p{hst:lv}' -m=5 ../QA/All_good >hgl-lv5
- Use
hangul-nfdto convert to decomposed form. - Use
hangul-nfcto convert back to precomposed form. - The result of these two steps should give you the original file.
- You can use
uconv -x any-nfdanduconv -x any-nfcto check.
Updated Mon Jan. 20 2025, 17:18 by cameron.