Not logged in. Login

Hangul NFC Assignment

In this assignment, you will use Parabix methods to implement a text transformation that turns Hangul decomposed character sequences (NFD form) into the equivalent Hangul precomposed characters. This is essentially the inverse of the process in the tools/transcoders/hangul-nfd.cpp program.

For reference, the rules for Hangul decomposition and composition are found in Section 3.12 of the Unicode Standard.

The following steps should be useful in carrying out this assignment.

  • Make a copy of hangul-nfd.cpp called hangul-nfc.cpp and place it in the tools/transcoders directory.
  • Edit the tools/transcoders/CMakeLists.txt file and duplicate the hangul-nfd entry as hangul-nfc.
  • Edit the hangul-nfc.cpp file as follows.
    • Keep all the steps of transposing data and computing stream sets up to and including the step to create the U21 stream (21 basis bit streams for Unicode).
    • Create character class bit streams for Hangul L, V and T syllables.
    • Create a set of 5 bit streams to hold L, V and T index values.
    • Create a PabloKernel that computes the correct index values at L, V and T positions, and having zeroes at all other positions. (Note that the index value at L positions should be the L index value, the index value at the V position should be the V index value and the index value at the T position should be the T index value).
    • Create a PabloKernel LFT2NFC that calculates the precomposed replacement characters for the <L, V> and <L, V, T> sequences using the calculated index values. The input should be the U21 stream and the output should be a modified U21 stream, with changes only at the positions of recognized <L, V> and <L, V, T> sequences .
      • Using Advance operations, this calculation could be carried out at the V position of a <L, V> sequence or the T position of a <L, V, T> sequence.
      • Alternatively, using Lookahead operations, this calculation could be carried out at the L positions of sequences.
    • Use FilterByMask to determine a final set of 21 basis bit streams by filtering out the unused positions of the <L, V> and <L, V, T> sequences.
    • Transform to UTF8 basis bit streams by applying the same U21_to_UTF8 kernel as in the Hangul-NFD application.
    • Perform the inverse Parabix transform using P2SKernel and generate the output using StdOutKernel.

Testing

You can use icgrep to create test files.

  • To make a file of all Hangul precomposed LV and LVT characters.
bin/icgrep '\p{hst:lvt}|\p{hst:lv}' ../QA/All_good >hgl-lvlvt
  • To make a file of 5 LV characters.
bin/icgrep '\p{hst:lvt}|\p{hst:lv}' -m=5 ../QA/All_good >hgl-lv5
  • Use hangul-nfd to convert to decomposed form.
  • Use hangul-nfc to convert back to precomposed form.
  • The result of these two steps should give you the original file.
  • You can use uconv -x any-nfd and uconv -x any-nfc to check.
Updated Mon Jan. 20 2025, 17:18 by cameron.