Unicode Statistics

Unicode Statistics Program

This project is to create a command-line tool ustats that analyzes one or more files to produces statistics about the Unicode characters found in those files.

Unicode Property Statistics

The ustats program should allow users to collect statistics associated with various Unicode properties. In general, Unicode properties are properties of Unicode characters. For example, all Unicode characters have the Script property, which defines the linguistic script associated with a character, for example, Greek, Cyrillic, Arabic or Common (characters used in many scripts). The command

ustats -prop=Script file1 file2 file3

should thus collect statistics about the script property of every character in the input files. The output should be a table of values with one row for each script value and a column for each input file giving the number of characters with that property found in the file.

Unicode Annex 44: Unicode Character Database is the definitive source describing various Unicode properties.

The Parabix system has a full database of Unicode properties built-in. Subclasses of the PropertyObject class are set up for different kinds of properties, such as enumerated properties, string properties and binary properties. The ustats program need only work with the properties as implemented within the Parabix framework.

Explicit Character Classes

Another option that ustats should support is an option that allows explicit character class expressions. For example, the command

ustats --cc=[a-z]

could print out a table with one row for each of the lower case letters showing the number of occurrences of each letter found in the input files.

Unicode character classes should support the features of Unicode Regular Expressions using the syntax supported by icgrep.

When printing table rows for members of a character class, the first column should be just the literal character, by default. However, the option --char-display=codepoint should instead display the character using its hexadecimal codepoint value, while the option --char-display=name should use the full Unicode name of the character instead.

Output Format

The standard output format for ustats should be a plain text table with one line per table row and data presented in justified columns (so that the output is visually aligned when displayed using a monospace font). However, other output options should also defined, such as the CSV (comma-separated value) format that can be read into spreadsheet programs. Other format options could include HTML, XML, JSON or Markdown.

Updated Sun Oct. 03 2021, 12:41 by cameron.

Simon Fraser University
Engaging the World

CourSys

Unicode Statistics