Comments on PDAM 1 on ISO/IEC 14651:2001.
Document Type: Working Group Document
Source: Kent Karlsson
Status: Expert Contribution
Date: 2001-10-03
1. Changes to clause 3 should be listed in the amendment:
* refer to ISO/IEC 10646-1:2000
* refer to ISO/IEC 10646-2:2001
Have a 'note' saying that those two together have the same repertoire and code point
assignments as Unicode 3.1. References to older editions of 10646 and its amendments
should be removed.
2. Annex B: All of the tailoring examples should refer to the new table.
3. Annex A: The table should be extended to cover all of Unicode 3.1. If, during
the processing of this amendment to 14651, amendment 1 to 10646-1:2000 is
approved, consideration should be given to cover also those characters.
4. The ordering of runic letters should follow the FUÞARK based ordering.
This is the ordering in which runic letters are traditionally presented. ABC-ordering
of the runic letters should be left to tailorings for specific purposes. Even though
ABC-ordering of runic letters may have been used during the late period of runic letter
usage, no evidence of this has been presented by runic experts, and would even so
be an influence of the then dominating latin script, not the order in which runes where
originally thought of. Further, when using runic letters as counters (for calendaric
calculations, at least), the FUÞARK order was used, much as ABC order of the
latin script is used for item list numbering.
5. All Han based characters (CJK URO, CJK ext A, CJK ext B), should be ordered
together, in the order mentioned (currently ext A comes before the URO).
6. All <MIN> subweightings at level 3 that correspond to (pseudo-) combining
characters should be changed to <BLK> (possibly name-changed to <ACC>), in order
to make the weighting system more transparent to a human reader. (NOTE THAT
<MIN> should NOT be read as 'third level MINimum weight', but as 'third level weight
for MINuscule (or caseless) letter/digit'; <BLK> (=<ACC>) is the minimum third level
weight.
7. US NB suggests to introduce a <MAX> weight; but since <MIN> is not minimum, the
name <MAX> would be misleading, making the reader think that <MIN> is mimumum.
Instead of <MAX> the name <LIG> is suggested.
8. Insert as a new annex E "Order-preserving subkey reduction"; moving the
bibliography to annex F. The text is attached at the end of these comments, and can
be obtained in electronic form from the Swedish NB. The kind of subkey reduction
described in (new) annex E is inherent in the "position" directive, and this annex
motivates why the "position" directive is sound. Further, annex E describe "high-level"
optimisations, not just implementation level optimisations.
9. The clauses in 14651 should be rearranged as follows (i.e. insert this as an
instruction to the editor, in the amendment text; as a basis for the next edition of 14651):
1. Scope
2. Conformance
3. Normative references
4. Definitions and symbols [from current 4 and 5]
5. Common Template Table: formation and interpretation
[current 6.3, except 6.3.5 and 6.3.6]
6. Declaration of a delta [current 6.4]
7. String comparison [current 6.1 and 6.2]
8. Conditions for collation equivalence [from current 6.3.5 and 6.3.6]
Annex A - Common Template Table [and its name, current 6.5]
[and then the other annexes: B to F]
The subclause divisions are intended to be kept, even though the subclauses are not
listed above.
This is a more logical exposition order of the contents of 14651, without making any
normative change.
10. The note for I 3 should use <BASE> instead of <BLANK>.
11. It has been suggested that the ordering after this amendment should not
be changed, INCLUDING that added characters are collated (via the CTT) at the end,
leaving better ordering of them to tailorings. Sweden does NOT support this LATTER
part of stability, since the CTT should give a good default ordering for characters
that are not of immediate concern to tailor.
The template table of collation weights has four levels. When applied to a string, each level nominally produces a subkey that is about as long, in number of weights, as the number of characters in the string itself. There are, however, a number of ways to reduce the size of the subkeys without changing the ordering as determined by the nominal key. This international standard neither specifies normatively, and even less requires the use of, any subkey reduction technique. However, for conformity, any key size reduction method must preserve the order between strings as determined by the nominal key produced by the given tailoring of the template table.
In order to illustrate what can be done in terms of subkey reduction, we here present two example reduction methods. Good implementations of these methods produce the reduced key directly, without producing the nominal key first.
Applied correctly to each nominal key, these reduction methods keep the order between the strings, since each is a strictly monotonically increasing mapping, i.e.,
if nominal_key1
< nominal_key2, then
reduce1(nominal_key1) < reduce1(nominal_key2), and
reduce2(nominal_key1) < reduce2(nominal_key2).
Each subkey reduction method that results in a strictly monotonically increasing mapping can be applied at any level. Moreover, different reduction methods can be applied at different levels, as long as it is done consistently for all keys. E.g., example method 1 below can be used for levels 2 and 4, while using example method 2 below for level 3, while no reduction method need be applied to level 1.
These example methods are not normative, and can probably be improved upon to reduce the subkey sizes further (e.g., one may first remove trailing weights that are minimal at levels 2 and 3).
This method can be applied for a single selected weight at each level this method is used, preferably a weight that is very commonly occurring at that level.
This method uses a count value, which may be stored in the subkey using different number of digits from what the weights uses at that level. The transformed value (see below) of the count need be not related to any of the weight values. In this illustration, we will use integer values between 00 and FF (in hexadecimal), the latter is the maxcntq. A practical implementation may of course use a wider range of values.
The reduction method works as follows, described in principle, not in implementation terms: Each maximal subrun of the for that level selected weight to be reduced, including subruns of length 0, is replaced by a transformed count value, as follows:
· Consider each subkey to be terminated by a less than minimal (at that level), or zero, weight.
· swaq is the selected reduction weight at level q.
· wb is the weight (including subkey terminator weight) that follows the (empty or not) maximal subrun, of the weight swaq, currently processed.
· cnt is the length (≥ 0) of the subrun currently processed.
· maxcntq is the value (≥ 0) that is the largest value that can be stored as a count in the subkey, given the number of digits used for the storage of a (transformed) count in a subkey at that level.
· guardcntq is a value ≥ 0 and ≤ maxcntq (e.g. maxcntq div 2 can be used as guardcntq).
· lowcntsizeq = guardcntq + 1.
· highcntsizeq = maxcntq − guardcntq + 1.
·
If swaq > wb: Replace the subrun with the
count/weight-pair (guardcntq, swaq) repeated
(cnt div lowcntsizeq) times, followed
by the count (cnt mod lowcntsizeq).
·
If swaq < wb: Replace the subrun with the
count/weight-pair (guardcntq, swaq) repeated
(cnt div highcntsizeq) times, followed
by the count (maxcntq − (cnt mod
highcntsizeq)).
Note that if the limits are set reasonably high, the count/weight-pairs will be repeated zero times in practical cases. In the other extreme, if maxcntq is zero, the subkey is not changed.
The reason this method works only for one weight per level is that subruns otherwise would cause ambiguous key reductions.
This reduction method can be applied to level 2, where <BASE> can be expected to be a very common weight. It can also be used for level 3, where <MIN> (minuscule, i.e. lowercase, or caseless) can be expected to be very common. This reduction method can also be applied to level 4, if one uses a tailoring that gives one very common weight on level 4. Note that if the selected weight is not very common in a string, the resulting key by this method may be longer than the nominal key, since empty subruns of the selected weight must be replaced by another subrun that is non-empty.
Below is an example reduction using this method on level 2 (selecting the weight <BASE>), on level 3 (selecting the weight <MIN>), and on level 4 (selecting a maximal weight called <PLAIN>; for a tailoring of the template table). For readability, we will in this example write <a>, <b>, etc. for the weights of the letters, <-> for <BASE>, <l> for <MIN>, <u> for <CAP>, <*> for <PLAIN>, <A> for <ACUTE>, <H> for the level 4 weight of a hyphen, and <P> for the level 4 weight of an apostrophe. The example character string is “Vice-Président’s”. The nominal key, according to the template table, can be expressed as a number in R, expressed as a hexadecimal fractional number; <v> etc. really stands for digit sequences; spaces are used for alignment for clarity, they are in no way part the actual key; the 0000 at the end of the subkeys is the subkey termination weight):
0.<v><i><c><e> <p><r><e> <s><i><d><e><n><t> <s> 0000
<-><-><-><-> <-><-><-><A><-><-><-><-><-><-> <-> 0000
<u><l><l><l> <u><l><l> <l><l><l><l><l><l> <l> 0000
<*><*><*><*><H><*><*><*> <*><*><*><*><*><*><P><*> 0000
Do the reduction as described above (note that here: <-> (and <l>) is smaller than any other level 2 (level 3) weight, but greater than 0 (the subkey termination weight), and <*> is greater than any other level 4 weight):
0.<v><i><c><e> <p><r><e> <s><i><d><e><n><t> <s> 0000
F8 <A> 07 0000
FF <u> FC <u> 0B 0000
04 <H> 09 <P> 01 0000
Note that “count” values are at the beginning and end of the reduced subkeys, as well as between each non-selected weight. This key is significantly shorter than the nominal key, in this example as well as for most (not all) other strings that one can expect to normally occur.
This method can be applied for a set of weights at each level, preferably the ones that are commonly used at that level.
This method also uses a count value, but the transformed weight value must have the same number of digits as all other weights at that level uses. The transformed values (see below) for the count must have values that are in the neighborhood of the weight in the subrun that is replaced. This neighborhood of a weight must not overlap with any other weights or neighborhoods of weights.
The reduction method works as follows, described in principle, not in implementation terms: Each non-empty maximal subrun of a weight selected for reduction is replaced as follows:
· Consider each subkey to be terminated by a less than minimal (at that level), or zero, weight.
· wa is the weight in the subrun.
· wb is the weight (including subkey terminator) that follows the non-empty maximal subrun of wa currently processed.
· cnt is the length (≥ 1) of the subrun currently processed.
· minnbhwa is the minimum value that constitutes the neighborhood of wa.
· maxnbhwa is the maximum value that constitutes the neighborhood of wa.
· lownbhsizewa = wa − minnbhwa + 1.
· highnbhsizewa = maxnbhwa − wa + 1.
· If wa > wb: Replace the subrun with the weight wa repeated (cnt div lownbhsizewa) times, and, if (cnt mod lownbhsizewa) is non-zero, followed by the weight (minnbhwa + (cnt mod lownbhsizewa) − 1).
· If wa < wb: Replace the subrun with the weight wa repeated (cnt div highnbhsizewa) times, and, if (cnt mod highnbhsizewa) is non-zero, followed by the weight (maxnbhwa − (cnt mod highnbhsizewa) + 1).
Note that if the neighborhood of wa consists of wa only, the subrun is not changed.
This reduction method can be applied to level 2, for <BASE>. It can also be used for level 3, e.g. for <MIN>, <CAP>, <HIRA>, and <KATA>. It can also be applied to level 4, for <PLAIN> (for a tailoring that uses <PLAIN> as a maximal weight at level 4). Note that even if the weight reduced is not common, the resulting subkey is never longer than the nominal subkey. A nontrivial neighborhood is needed around each of the selected weights for a shortening of the subkey to actually take place.
We do an example reduction using this method on level 2 (for <BASE>), on level 3 (for <MIN> and <CAP>), and on level 4 (for <PLAIN>). For this we make the following assumptions:
· <-> (level 2): weight: 0026; minimum – maximum of neighborhood: 0022 – 002A.
· <l> (level 3): weight: 0005; minimum – maximum of neighborhood: 0002 – 0007.
· <u> (level 3): weight: 0017; minimum – maximum of neighborhood: 0015 – 001A.
· <*> (level 4): weight: 0F80; minimum – maximum of neighborhood: 0F00 – 0FFE.
For the same example string as in the description of example reduction method 1, we still have the nominal key:
0.<v><i><c><e> <p><r><e> <s><i><d><e><n><t> <s> 0000
<-><-><-><-> <-><-><-><A><-><-><-><-><-><-> <-> 0000
<u><l><l><l> <u><l><l> <l><l><l><l><l><l> <l> 0000
<*><*><*><*><H><*><*><*> <*><*><*><*><*><*><P><*> 0000
Do the reduction as described above:
0.<v><i><c><e> <p><r><e> <s><i><d><e><n><t> <s> 0000
0024 <A>0028 0000
0015 0005 0015 0005 0005 0002 0000
0F03 <H>0F08 <P>0F00 0000
If the lower neighborhood of <l> had been a bit bigger, we could have used only one weight (or two), instead of three, for the sequence of nine <l> weights.
This key is also significantly shorter than the nominal key, and this method never lengthens the key, since every subrun replaced is replaced by one that is shorter or at most as long as the original subrun. This method can also be applied to several weights at each level, which example method 1 cannot. On the other hand, example method 2 is a bit more complex than example method 1, since it needs to keep track of non-overlapping neighborhoods around some of the weights.
In many cases, at level 2 and up, differing weights don’t convey a real difference for the ordering, since subkeys that are more significant may contain a necessary ordering difference already. This is especially common for the fourth level, where the Common Template Table has different weights for every collating element. Since there are usually other more significant differences, it is rarely this difference at the fourth level matters. Implementers are in such cases allowed to map several nominally different weights to the same actual weight, in such a way that the resulting ordering is unchanged, in order to enable better subkey reduction.