From domo@tsa.co.uk Fri Nov 16 20:34:32 1990
          16 Nov 90 19:33 GMT
	id aa01561; Fri, 16 Nov 90 19:26:30 GMT
From: Dominic Dunlop <domo@tsa.co.uk>
X-Sequence: i18n@dkuug.dk 6
Errors-To: i18n-request@dkuug.dk
Date: Fri, 16 Nov 90 15:23:35 GMT
Message-Id: <1633.9011161523@tsa.co.uk>
In-Reply-To: Mark E Davis <Mark_E_Davis.PINKTEAM@gateway.qm.apple.com>
       "10646 Advantages" (Nov 15, 22:33)
X-Fax: +44 491 651751
X-Phone: +44 491 652590
X-Address: 9 The Forty, Cholsey, OXON OX10 9LH, U.K.
X-Organization: The Standard Answer Ltd.
X-Mailer: Mail User's Shell (7.1.2 7/11/90)
To: Mark_E_Davis.PINKTEAM@gateway.qm.apple.com, unicode@noddy.eng.sun.com,
        Internet_UniCore.PINKLINK@gateway.qm.apple.com
Subject: Re: 10646 Advantages
Cc: i18n@dkuug.dk
X-Charset: ASCII
X-Char-Esc: 29

[From "10646 Advantages" dated Nov 15]
> [Much cogent and coherent stuff deleted]
> ...
> 4. Correct collation order
> 10646 maintains the nationally mandated collation orders of Korean, Japanese
> and Chinese.  No tables are needed for collation.

Collation?!  Don't even think about collation!  It's not your problem
-- or 10646's.

Two more measured comments here (and I did read the A: section before
rushing in with them):

 1. If there is a nationally mandated collation order for Japanese, my
    colleagues and I on the ISO/IEC JTC1/SC22/WG15 rapporteur group on
    internationalization (see below for explanation) are not aware of
    it.  Our information is that, where collation orders exist at all,
    they tend to be proprietary and/or specific to a particular
    application areas (telephone books, dictionaries, directories...).
    To define a single national collating order for Japan would seem to
    be as much a political problem as it is a technical -- and it's one
    hell of a technical problem.

    If this perception, gained by working with technical experts from
    Japan, is misinformed or incomplete, we would appreciate being put
    right.

 2. ``[A complaint against ASCII is] that you cannot order a file very
    well by using the binary sequences for character repesentation.  Of
    course you can't!  The New York Telephone Company, if asked, might
    send you its multipage set of rules for ordering the names in
    telephone directories.  To think that the characters in a set
    should be grouped in a set by their usage (e.g. all arithmetic
    operators) is as futile as thinking that all vowels should lie next
    to each other on a keyboard, or that all keys should be laid out in
    alphabetical order.  No way!''

    Who said that?  R. W. Bremer, credited as ``the father of ASCII''
    in a letter published on pages 36-37 in Byte, volume 15, number 6,
    June 1990.

    In other words, correct collation was not even a goal in the choice
    of ASCII character encodings.  It should not, and cannot be a goal
    in the design of more complex encodings, particularly those which,
    like Unicode and most ISO coded character sets, can trace their
    ancestry to ASCII.

    This consideration makes Mark's 10646 advantage 4. a no-op --
    unless DIS 10646 claims to provide correct collation for Korean,
    Japanese and Chinese through simple arithmetic comparison of
    character encodings.  Such a claim would almost certainly be
    unfounded, and would be a mark against 10646.

    Looking at the September 1990 working draft of DIS 10646, I see no
    such claim -- or, indeed, any reference whatever to collation.
    This is not surprising, as SC2, the JTC1 subcommittee on character
    sets and information coding, regards collation as Somebody Else's
    Problem.  This is clearly the correct attitude to take if you are
    on a working group allocating encodings, but is unhelpful:  every
    other part of JTC1 seems also to regard collation as Somebody
    Else's Problem.

    The net effect of this is that the ISO POSIX working group (!) is
    currently running with the issue because it needs a solution: the
    UNIX shell and tools embody collation and related concepts
    (filename expansion and listing, the sort command, regular
    expressions), and a corresponding international standard must be
    internationally applicable.  Work in progress suggests that, by
    making up to four passes backwards and forwards through text,
    assigning different weights (including ``ignore'', ``high'' and
    ``low'') to each encoded character encountered on each pass, you
    can achieve useful real-world collation.  Although you probably
    can't do a telephone book sort even in New York, never mind Tokyo.

    Our work has been based primarily on encodings without the
    non-spacing diacritics (accents) of Unicode.  If it turns out that
    we can't accommodate these, we'll think again: the ability to
    handle Unicode is at the very least an important proof of concept
    for us.  (My feeling is that, compared to the handling of stateful
    encodings with locking shifts -- something else that we intend to
    accommodate -- non-spacing diacritics should be a piece of cake.)


Where am I coming from?  Clearly the ISO camp: I'm a delegate to
JTC1/SC22/WG15, the ISO POSIX working group, and the UK's designated
expert on internationalization -- a topic which starts out with coded
character sets and then gets worse.  The rapporteur group on
internationalization, where we ``experts'' hang around, works mainly on
the definition of ``national profiles'' -- sets of preferences which
mould a POSIX system to the needs of particular territories.  Much of
the groundwork on internationalization has been done by the UniForum
Technical Committee Subcommittee on Internationalization.  In
particular, it was UniForum which came up with the current collation
and regular expression handling.  (X/Open was heavily involved as
well.)

Clearly, ISO POSIX has to accommodate existing ISO-sanctioned encodings
-- although ISO 646 is giving us some grief because of the use by the
UNIX shell of characters in the national variant positions.  10646 is
coming down the pike, so we're looking at that too.  If you want to
help us look at Unicode, please keep us informed of relevant
developments -- perhaps by cross-posting where appropriate to our
public mail-list, i18n@dkuug.dk.  (Mail to i18n-request@dkuug.dk if you
want to join.)  (I18n is short for internationalization, a 20-letter
word.)

Thanks.

-- 
Dominic Dunlop