ISO/IEC JTC 1/SC22
Programming languages, their environments and
system software interfaces
Secretariat:
U.S.A. (ANSI)
ISO/IEC JTC 1/SC22 N3392
TITLE:
Summary of Voting on SC 22 N3352, Letter
Ballot for FPDAM 1 to ISO/IEC
14651, International String Ordering and
Comparison - Method for Comparing Character Strings and Description of a Common
Tailorable Ordering Template
DATE ASSIGNED:
2002-04-29
SOURCE:
SC 22 Secretariat
BACKWARD POINTER:
SC 22 N3352
DOCUMENT TYPE:
Summary of Voting
PROJECT NUMBER:
1.22.30.02.02.01
STATUS:
WG 20 is requested to prepare a Disposition
of Comments Report addressing the comments submitted on this ballot and prepare
the FDAM text for submission to ITTF.
The summary of voting is located at:
http://www.dkuug.dk/jtc1/sc22/def/n3392.htm
ACTION IDENTIFIER:
ACT
DUE DATE:
DISTRIBUTION:
PDF
CROSS REFERENCE:
DISTRIBUTION FORM:
HTML
Address reply to:
ISO/IEC JTC 1/SC22 Secretariat
Matt Deane
ANSI
25 West 43rd Street
New York, NY
10036
Telephone:
(212) 642-4992
Fax: (212) 840-2298
Email:
mdeane@ansi.org
SUMMARY
OF VOTING ON
Letter Ballot Reference
No: SC22 N3352
Circulated by: JTC 1/SC22
Circulation Date: 2001-12-14
Closing Date: 2002-04-14
SUBJECT: Summary of Voting on SC
22 N 3352, Letter Ballot for FPDAM 1 to ISO/IEC 14651, International String
Ordering and Comparison - Method for Comparing Character Strings and
Description of a Common Tailorable Ordering Template
----------------------------------------------------------------------
The following responses have
been received
on the subject of approval:
"P"
Members supporting approval without comment
8
(Canada, Czech Republic, Denmark, Finland, Japan, Republic of Korea,
Netherlands, United Kingdom)
"P"
Members supporting approval with comments
1
(Norway)
"P"
Members not supporting approval
1
(United States of America)
"P"
Members abstaining
1
(Switzerland)
"P"
Members not voting
13
(Austria, Belgium, Brazil, China, Egypt, France, Germany, Ireland, DPR of
Korea, Romania, Russian Federation, Slovenia, Ukraine)
O-member
Sweden approves with the attached comments.
___________
end of summary, beginning on NB comments _____________
Norway
Technical comments:
NOR.1: The template table needs to be aligned with IS 12199 and CEN/ENV 13710 European Ordering rules. This would mean that all Latin letters be sorted on level 1 as their base letters, and that special characters will be ignored at level 1.
NOR.2: Cyrillic letters need to be sorted in a fully deterministic way. Fully composed characters and corresponding character sequences should sort equally on level 3 but differently on level 4.
NOR.3: Control characters U0000..U001F and U007F..U009F shall have distinct sorting code on level 4 to make the sorting fully deterministic.
NOR.4: The
template needs to be specifying a fully deterministic ordering, as also decided
in previous ballots.
Sweden
1. There is a symbol generation bug for
parenthesised Hàn in the draft
new CTT: e.g. <U3232>
"<SFF40><SE709>";...
should be <U3232>
"<SFF40><TE709>";...
(Note S --> T) Etc. for other
parenthesised Hàn.
2. None of the five-hexdigit S-symbols in
the draft new CTT are declared
as they should be.
3. The first level weighting split into
groups according to SC22/WG20
N898R, String ordering weighting roadmap, so that Brahmic derived
scripts and Hangul are collated according to collation clustering.
Note
that Hangul in addition needs some extra prehandling for this
to work properly, but detailing that part has always been seen as
out of scope for 14651.
4. The weights for Hangul letter cluster
jamos should reflect the letter
content of each cluster jamo.
Similarly for the letter cluster
Hangul compatibility letters.
The weighting for the circled and
parenthesised Hangul should be based on the (incomplete) syllabic
content. See SC22/WG20 N891R, Hangul ordering rules, for details.
USA
The US votes NO with comments on FPDAM 1 to ISO/IEC 14651 (SC22/WG20 N890; L2/01-470). Its vote will be changed to YES if the following two problems are addressed. (Other than to address these two problems, the US prefers the weights that are present in the table).
Due to production problems in generating the data tables, the following item TC4 from L2/01-330 was not implemented in the current data table, although it was accepted (see the disposition of comments on the PDAM, SC22/WG20 N882.) It needs to be implemented to prevent formal ordering problems and maintain synchronization with UCA.
TC4. Modify handling of secondaries for Numerics. These are to be weighted consistent with the approach used in other constructed secondaries (not involving an accent), such as in:
<U16AA> <S16A8>;"<BASE><VRNT1>";"<COMPAT><MIN>";<U16AA> % RUNIC LETTER AC A
Thus, the following example for a Mongolian digit
<U1811> <S0031>;<MONGL>;<MIN>;<U1811> % MONGOLIAN DIGIT ONE
will become
<U1811> <S0031>;"<BASE><MONGL>";"<MIN><MIN>";<U1811> % MONGOLIAN DIGIT ONE
The list of numeric script secondary symbols to which this should be applied are the following:
<NEGATIVE>
<SANSSERIF>
<NEGSANSSERIF>
<ARABIC>
<EXTARABIC>
<ETHPC>
<NAGAR>
<BENGL>
<BENGALINUMERATOR>
<GURMU>
<GUJAR>
<ORIYA>
<TAMIL>
<TELGU>
<KNNDA>
<MALAY>
<THAII>
<LAAOO>
<BODKA>
<MYANM>
<KHMER>
<MONGL>
<CJKVS>
Background. Look at the following example, with:
<U0061> <S0061>;<BASE>;<MIN>;<U0061> % LATIN SMALL LETTER A
<U00E1> <S0061>;"<BASE><AIGUT>";"<MIN><MIN>";<U00E1> % LATIN SMALL LETTER A WITH ACUTE
<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO
<U0968> <S0032>;<NAGAR>;<MIN>;<U0968> % DEVANAGARI DIGIT TWO
The
following shows how combinations of the first two and second two sort:
|
|
Letters |
Sort Key |
a2 |
<S0061><S0032><BASE><BASE>... |
a? |
<S0061><S0032><BASE><NAGAR>... |
á? |
<S0061><S0032><BASE><NAGAR>... |
á2 |
<S0061><S0032><BASE><AIGUT><BASE>... |
Notice that in the first two cases we get 2, then Devanagari 2; while in the second two cases we get the reverse. This is clearly wrong; the wrong secondary weights are being compared to one another. To prevent these cases, UCA is adding the following invariant:
For all collation elements,
3. All secondaries in non-ignorables must be strictly less than those in primary ignorables.
4. All tertiaries in primary ignorables must be strictly less than those in secondary ignorables.
In general, all Level N weights in Level N-1 ignorables must be strictly less than those in Level N-2 ignorables.
The accent in a-acute is a primary-ignoreable, and must thus have a secondary weight less than the secondary weight in Devanagari digit two. While there are different ways to produce this, the easiest way to do this is to expand the Devanagari weight into:
<U0968> <S0032>;"<BASE><NAGAR>";<MIN>;<U0968> % DEVANAGARI DIGIT TWO
To maintain the synchronization between ISO/IEC 14651 and the Unicode Collation Algorithm, the US requests that the primary values for JUNGSEONG and JONGSEONG characters be made higher than any other weights in the Default table. In no case will this result in worse sorting results, and it does preserve synchronization.
<U20000>..<U2A6D6> <S20000>..<S2A6D6>;<BASE>;<MIN>;<U20000>..<U2A6D6> % Han Extension B
<U1160> <S1160>;<BASE>;<MIN>;<U1160> % HANGUL JUNGSEONG FILLER
....
<U11F9> <S11F9>;<BASE>;<MIN>;<U11F9> % HANGUL JONGSEONG YEORINHIEUH
<PLAIN> % Maximal level 4 weight
This change does not preclude adding descriptions of possible preprocessing steps with similar objectives, as some other national bodies may request.
Background. The UCA currently sorts Hangul as follows. ISO/IEC 14651 does the same, whenever NFD (decomposed) data is used, or when archaic Hangul syllables (requiring the use of Jamo) are used.
Case 1 |
||
1 |
? |
{HANGUL SYLLABLE GA} |
2 |
? |
{HANGUL SYLLABLE GAG} |
Notice that GAG comes after GA in Case 1. But in Case 2, it comes before. That is, the order of these two Hangul syllables is reversed when each is followed by a CJK character.
Case 2 |
||
2 |
?? |
{HANGUL SYLLABLE GAG}{U+4E00} |
1 |
?? |
{HANGUL SYLLABLE GA}{U+4E00} |
This is not acceptable: when two characters A and B have different primary order, appending another independent primary-weighted character C to each should not affect the ordering. (Independent means that AC and BC do not form contractions, interact in normalization, or are subject to Thai rearrangement).
Why
does this happen? All characters are decomposed when sorting in UCA, to
preserve canonical equivalence. (This is the logical procedure -- optimizations
can be used as long as they have the same order). This results in the following
comparisons being made:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Look at column 3 in Case 1 and 2.
The Unicode Technical Committee has considered this issue, and for a number of reasons has approved the following solution. In particular, this solution normally has no performance or sort-length impact on the UCA. Collation implementations are extremely sensitive as to both performance and sort-key length, so this is a very important feature. It also has the advantage of essentially no impact on the standard implementations, since it only changes three constants used in the UCA algorithm. The changes that have been approved for UTR #10: Unicode Collation Algorithm are:
1. In 7..1.3 Implicit Weights, an area of 1024* high primary weights is reserved, by changing the BASE weights from:
|
|
FFC0 |
CJK Ideograph |
FF80 |
CJK Ideograph Extension A/B |
FF40 |
Any other code point |
to
|
|
FBC0 |
CJK Ideograph |
FB80 |
CJK Ideograph Extension A/B |
FB40 |
Any other code point |
* 1024 is sufficient room, given that multiple primaries can always be used if necessary, as in 6.2 Large Weight Values).
2. In the Default Unicode Collation Element Table, the trailing Hangul characters are changed to have primary weights in the Fxxx range, e.g. FCE0..FD7E. These include:
1161 ; [.16E0.0020.0002.1161] # HANGUL JUNGSEONG A
....
11F9 ; [.1773.0020.0002.11F9] # HANGUL JONGSEONG YEORINHIEUH
3. Since the assignment of CJK Ideographs has changed, the dependent characters are modified, such as
U+3280 CIRCLED IDEOGRAPH ONE
Because of these changes, the JUNGSEONG and JONGSEONG characters are assigned primary weights in a high range, higher than any other characters. Thus the above Case 2 changes to:
Case 2b |
|||||
|
Source |
1 |
2 |
3 |
4 |
1 |
?? |
{K} |
{a} |
{U} |
|
2 |
?? |
{K} |
{a} |
{k} |
{U} |
Because {a} and {k} now have high weights, higher than anything (e.g., {U}) that might follow them, the right order results. The only further issue is the case of multiple lead characters. The UCA and 14651 have mechanisms that can be called into play in this case, described in Section 3.1.1 Multiple Mappings. For example, suppose that the Hangul Syllable is of the form LLVT instead of LVT (this happens with archaic Hangul). If the LL is to be sorted as a unit, then it would require the addition of a contraction, so that the LL mapped to a single primary. If the second L is to be sorted as if it were trailing, then this would require a contraction-expansion, as described in 3.1.1. There are a small number of LL cases -- these can be easily tailored for environments requiring the sorting of archaic Hangul.
Note: Such a strategy can also be used for other languages. For any case where trailing characters in a sequence (grapheme cluster, conjunct, etc) are given primary weights above any other characters, tailoring to high weights can produce the right results.