Unicode Statistics
Unicode Statistics Program
This project is to create a command-line tool ustats
that analyzes one or more files to produces statistics about the Unicode characters found in those files.
Unicode Property Statistics
The ustats
program should allow users to collect statistics associated
with various Unicode properties. In general, Unicode properties are properties of Unicode characters. For example, all Unicode characters have the Script
property, which defines the linguistic script associated with a character, for example, Greek
, Cyrillic
, Arabic
or Common
(characters used in many scripts). The command
ustats -prop=Script file1 file2 file3
should thus collect statistics about the script property of every character in the input files. The output should be a table of values with one row for each script value and a column for each input file giving the number of characters with that property found in the file.
Unicode Annex 44: Unicode Character Database is the definitive source describing various Unicode properties.
The Parabix system has a full database of Unicode properties built-in.
Subclasses of the PropertyObject class are set up for different kinds of properties, such as enumerated properties, string properties and binary properties. The ustats
program need only work with the properties as
implemented within the Parabix framework.
Explicit Character Classes
Another option that ustats
should support is an option that
allows explicit character class expressions. For example, the command
ustats --cc=[a-z]
could print out a table with one row for each of the lower case letters showing the number of occurrences of each letter found in the input files.
Unicode character classes should support the features of Unicode Regular Expressions using the syntax supported by icgrep.
When printing table rows for members of a character class, the first
column should be just the literal character, by default. However,
the option
--char-display=codepoint
should instead display the character using
its hexadecimal codepoint value, while the option --char-display=name
should use the full Unicode name of the character instead.
Output Format
The standard output format for ustats
should be a plain text table
with one line per table row and data presented in justified columns (so that the
output is visually aligned when displayed using a monospace font).
However, other output options should also defined, such as the CSV (comma-separated value) format that can be read into spreadsheet programs.
Other format options could include HTML, XML, JSON or Markdown.