The following represent proposed requirements on collation for Korean text in ISO/IEC 14651 and in the UCA (UTS #10: Unicode Collation Algorithm). The definitions, requirements, and desired features are numbered for reference.
The following abbreviations are used:
L | any leading Jamo |
V | any vowel Jamo |
T | any trailing Jamo |
Lv | any composite Hangul syllable character in the range U+AC00..U+D7A3 that is canonically equivalent to a sequence L + V |
Lvt | any composite Hangul syllable character in the range U+AC00..U+D7A3 that is canonically equivalent to a sequence L + V + T |
S | any standard Korean syllable-block (whether a composite character or a sequence of jamo or a mixture) |
X | any character that is not of the above |
In examples, subscripts will be used to indication specific letters, such as L1 vs L2.
Abbreviated Examples. We give examples below such as "V1 < V2". This is the common way to cite the ordering among these characters. However, these examples are really abbreviations, since the V's alone do not form complete syllables. What such examples stand for is really "xV1 < xV2", where x is the identical string on both sides, and contains characters such that the string forms a standard Korean syllable-block. It would make the exposition somewhat clumsier to always spell this out, which is why the variable x and the explanation of its contents are omitted in such examples.
D1. Cluster. A sequence consisting of one or more L's, one or more V's, or one or more T's is called a cluster. A cluster of all L's is a leading cluster; similarly there are vowel clusters and trailing clusters. Thus L1L2 is a leading cluster, and V1 is a vowel cluster, but L1L2V1 is not a cluster; it is a sequence of a leading cluster (L1L2) followed by a vowel cluster (V1).
D2. Standard Korean syllable-block. This is a sequence of letters that consists of a leading cluster followed by a vowel cluster, optionally then followed by a trailing cluster. In regular expression syntax, this is of the form: L+ V+ T*. It also includes any canonically equivalent sequence. For details on canonical equivalence, see the Unicode Standard, Chapter 3. Examples:
S1 = L1L2V1V2
S2 = L1V1T1T2
S3 = Lvt1
S4 = Lvt1T1
D3. Existing Syllables are those that actually occur in modern or ancient practice, which is a small subset of all theoretically possible syllables. The existing syllables include all syllables encoded as the modern composite 11,172 Hangul Syllables, plus documented ancient syllables. See Desired Features.
D4. CTT. In this document, this refers to the current weighting table for UCA and the current Common Tailorable Template Table (CTT) for ISO/IEC 14651, which are kept in synchronization by the respective organizations.
Note: Contractions, Thai reversal, and canonical equivalence are the only mechanisms that are stateful in the UCA architecture; otherwise each character is simply mapped to a sequence of weights without regard to context. ISO/IEC 14651 allows for implementations to use canonical equivalence but does not require it. It is silent on the issue of Thai reversal.
Note: There may be intermediate versions of the UCA for which there is no corresponding ISO/IEC 14651 table, because implementers require tables that are kept up to date with the latest version of Unicode and ISO/IEC 10646.
R1. Tailorability. It must be possible to tailor the UCA so as to meet the following requirements for Korean.
R2. Canonical Equivalence. Whenever text is canonically equivalent, it must sort the same. In particular, one string made up of HANGUL SYLLABLE characters will sort the same as a string made up of the equivalent JAMO characters. Example: the following sequences must sort the same:
<U+AC01 (각) HANGUL SYLLABLE GAG>
<U+1100 (ᄀ) HANGUL CHOSEONG KIYEOK, U+1161 (ᅡ) HANGUL JUNGSEONG A, U+11A8 (ᆨ) HANGUL JONGSEONG KIYEOK>
R3. Performance/Size. Whatever approach is taken, it must be possible to sort strings containing the composite 11,172 Hangul Syllables (which form the vast majority of Korean text) without undo performance/storage costs. In particular, when restricted to such text, sort keys must be reasonably short and incremental comparison reasonably fast, in comparison to current good implementations of Korean sorting. A reasonable limit is no more than about 3 times worse. See Performance.
R4. Existing Korean Syllable-Blocks. It is only important that existing standard Korean syllable-blocks and equivalent text sorts correctly, as long as for other characters the ordering is still determinate. That is:
R3a. There is no requirement on "degenerate" syllables (such as L1T1).
R3b. It is only a requirement to sort existing syllables, not all theoretically possible syllables. See Desired Features. It must, however, be possible to tailor additional syllables.
R5. Syllables as Primary Units. Standard Korean syllable-blocks sort as primary units, meaning that if two syllables are different, no further characters in the string make any difference (in ordering). Examples:
if S1 < S2, then S1x < S2y, where x and y are any other characters that would not extend the syllables.
R6. Clusters as Primary Units. Within a standard Korean syllable-block, elements of a cluster sort as primary units, meaning that if two such elements are different, no further characters in the syllable make any difference (in ordering). Examples:
if L1 < L2, then L1x < L1y, where x and y are any non-L.
if V1 < V2, then V1x < V1y, where x and y are any non-V.
if T1 < T2, then T1x < T1y, where x and y are any non-T.
R6. Shorter Clusters Sort First. A shorter, but otherwise identical, cluster sorts before a longer one, no matter what is later in the string. Examples:
L1x < L1L2, where x is any non-L
V1x < V1V2, where x is any non-V
T1x < T1T2, where x is any non-T
The current UCA/14651 architecture simply assigns a different primary weight to each L, V, T and other characters. It can also assign primary weights to sequences, called contractions.
Without any change in that architecture, the above requirements would have the following implications on the weights for those characters. However, these are only implications given the current architecture; under discussion is whether we should change the architecture.
There are no implications vis-à-vis weighting of L vs X. However, the CTT puts V and T before Katakana, Hiragana, and more importantly, CJK Ideographs. Without tailoring all V and T to be after all other characters that could occur right after a Hangul syllable, the current tables break the requirement that syllables sort as primary units. See Trailing Weights in the UCA for an example.
The following are features that are desired, if they can be accommodated without impacting the requirements.
DF1. Theoretical Syllables. Sort all theoretical syllables (not just existing syllables) according to the requirements. As a practical matter, this would imply that contractions could not be used as in (v) above, since it would not be possible to list all possible contractions.
DF2. Further Decompositions. (open issue) Some of the encoded Hangul Jamo can be decomposed further than canonical decomposition does. For example:
U+11AA (ᆪ) HANGUL JONGSEONG KIYEOK-SIOS =>
U+11A8 (ᆨ) HANGUL JONGSEONG KIYEOK +
U+11BA (ᆺ) HANGUL JONGSEONG SIOS
It may be desirable to have these sort as the same (on at least a primary and secondary level, perhaps also on a tertiary level).
DF3. Compatibility & Half-Width. (open issue) Map
compatibility and half-width Jamo more intelligently to standard Korean
syllable-blocks. The current weight table for compatibility and half-width Jamo
is stateless. It simply maps each such character to
the same weight as the
conjoining Jamo would have. It may be desireable to perform some additional
transformation, so that strings of compatibility or half-width Jamo sort the
same as the equivalent sequences of Hangul syllables. If so, the details of that
transformation need to be spelled out.
DF4. Default Sorting. The requirement R1 is that it be possible for a tailoring to meet the other requirements. It is also desirable that the standard CTT and architecture meet the other requirements, without tailoring being necessary.
We have to be careful of exactly what is involved in any change, especially involving a preprocessing step. Why is that? People around the world are extremely sensitive to the performance of collation. Unless "preprocessing" is designed correctly — so that in practice it can actually be done incrementally, and without large changes in architecture — it could slow down processing very significantly. And in practice, it would not be accepted by people in the market place.
The incremental nature is very important. Here is an example: in most production systems, there are two separate code paths: (1) sort key generation, and (2) direct string comparison.
The important features are that on average — in real data — a direct string comparison is something like 20 times faster than the production of a sort key (and may be even faster). This is because typical comparisons are actually decided in the first few characters, and the rest of the processing of the string is normally skipped. Unless a very large number of comparisons are made, the direct string comparison wins, and is commonly used. But incremental comparison is very performance sensitive; one can't even call a strlen() function on the text, since that screws up the performance.
The other feature is sort key length. If we end up with sort keys that are very significantly longer than what is done right now, it becomes a significant performance drain: in terms of the generation of the sort key, the extra storage it uses, the increase time to read sort keys in and out of memory, and so on. Since the same weight data is used in incremental processing, longer weights also mean slower incremental comparison.
We must be very careful of the implications for performance in everything we end up with for Hangul, since if it is too slow — and only marginally better as far as most Koreans are concerned — then it will end up being ignored (e.g. tailored away).