Hangul NFC Assignment
In this assignment, you will use Parabix methods to implement
a text transformation that turns Hangul decomposed character
sequences (NFD form) into the equivalent Hangul precomposed characters.
This is essentially the inverse of the process in the tools/transcoders/hangul-nfd.cpp
program.
For reference, the rules for Hangul decomposition and composition are found in Section 3.12 of the Unicode Standard.
The following steps should be useful in carrying out this assignment.
- Make a copy of
hangul-nfd.cpp
calledhangul-nfc.cpp
and place it in thetools/transcoders
directory. - Edit the
tools/transcoders/CMakeLists.txt
file and duplicate thehangul-nfd
entry ashangul-nfc
. - Edit the
hangul-nfc.cpp
file as follows.- Keep all the steps of transposing data and computing stream sets up to and including the step to create the
U21
stream (21 basis bit streams for Unicode). - Create character class bit streams for Hangul L, V and T syllables.
- Create a set of 5 bit streams to hold L, V and T index values.
- Create a PabloKernel that computes the correct index values at L, V and T positions, and having zeroes at all other positions. (Note that the index value at L positions should be the L index value, the index value at the V position should be the V index value and the index value at the T position should be the T index value).
- Create a PabloKernel
LFT2NFC
that calculates the precomposed replacement characters for the <L, V> and <L, V, T> sequences using the calculated index values. The input should be the U21 stream and the output should be a modified U21 stream, with changes only at the positions of recognized <L, V> and <L, V, T> sequences .- Using Advance operations, this calculation could be carried out at the V position of a <L, V> sequence or the T position of a <L, V, T> sequence.
- Alternatively, using Lookahead operations, this calculation could be carried out at the L positions of sequences.
- Use FilterByMask to determine a final set of 21 basis bit streams by filtering out the unused positions of the <L, V> and <L, V, T> sequences.
- Transform to UTF8 basis bit streams by applying the same
U21_to_UTF8
kernel as in theHangul-NFD
application. - Perform the inverse Parabix transform using
P2SKernel
and generate the output usingStdOutKernel
.
- Keep all the steps of transposing data and computing stream sets up to and including the step to create the
Testing
You can use icgrep
to create test files.
- To make a file of all Hangul precomposed LV and LVT characters.
bin/icgrep '\p{hst:lvt}|\p{hst:lv}' ../QA/All_good >hgl-lvlvt
- To make a file of 5 LV characters.
bin/icgrep '\p{hst:lvt}|\p{hst:lv}' -m=5 ../QA/All_good >hgl-lv5
- Use
hangul-nfd
to convert to decomposed form. - Use
hangul-nfc
to convert back to precomposed form. - The result of these two steps should give you the original file.
- You can use
uconv -x any-nfd
anduconv -x any-nfc
to check.
Updated Mon Jan. 20 2025, 17:18 by cameron.