SC22/WG20 N775
L2/00-308
Collection of reactions to the WG20 convenor's
"Personal thoughts about the future of WG20"
Part 2, from September 7 through September 13, 2000
Akio Kido suggested that I collect all reactions to my proposal about the future of WG20 in one document for easy reference. Due to the interest in this subject, it became a rather lengthy document and I decided to put a linked index in front of it – that allows you to go straight to the contribution that interests you. I did not do any formatting – please apologize, if text in html does not look as good as it could be, but I wanted to maintain the original form of the e-mails the way I received them.
The document got too long – I had to split it into parts:
Parts |
SC22/WG20 |
NCITS/L2 - UTC |
Part 1, from August 30 – September 6, 2000 |
||
Part 2, from September 6 on … |
Index with the latest document on top: (Status September 13, 2000)
National Body |
Name |
Date |
Content |
Supports |
USA |
Ken Whistler |
2000-09-13 |
Y |
|
USA |
Ken Whistler |
2000-09-12 |
Y |
|
Norway |
Keld Simonsen |
2000-09-12 |
N |
|
W3C |
Martin Dürst |
2000-09-11 |
Y |
|
Norway |
Keld Simonsen |
2000-09-07 |
N |
|
Canada |
Glen Seeds |
2000-09-07 |
N |
|
France |
Antoine Leca |
2000-09-07 |
? |
Individual contributions on e-mail:
France, Antoine Leca, September 7, 2000
From: Kenneth Whistler [kenw@sybase.com]
>
> 3. Character Properties
>
> The most contentious issue regarding DTR 14652 is the effort to
> extend LC_CTYPE to cover the repertoire of ISO 10646-1. The contending
> positions effectively reflect a worldview divide among the participants
> regarding character properties:
>
> Position A: Character properties have not traditionally been covered
> by character encoding standards, and have not been viewed as the
> domain of the ISO committee responsible for encoding characters: SC2.
> Instead, character properties are an implementation issue, traditionally
> dealt with in the standards most directly concerned with character
> implementation -- namely the formal language standards -- and are
> dealt with in ISO by the working groups under SC22. In the context
> of 14652, the appropriate place to define character properties is
> LC_CTYPE, where the properties would be usable in a POSIX context as
> part of locale definitions.
May I point out that POSIX in this area just provide two things:
- a portable way to "formalize" LC_CTYPE (the localedef mechanism), which
is the very thing that PDTR 14652 is improving; this is covered by
Ken's previous discussion, as I see things;
- a mandatory implementation of the minimum subset, the "POSIX" locale,
which he inherited really from Unix V7 ff., but formally that he
inherits from the C Standard.
As such, one may also consider involving WG14. Furthermore, the new revision
of the C standard provides some support for the UCS. If this extensions
are used (and this is a pre-requesite for them to be used in POSIX context),
then it would be a natural extension in the next amendment/revision of the
C Standard to provide mandatory rules for the character properties: for
example, to somewhat require iswupper(L'\u0410') to return nonezero;
currently, this is not the case (nothing is required here).
<sidenote>
Furthermore, I had a discussion within the POSIX group some months ago.
As a result, in the "POSIX" locale, iswupper(L'\u0410') is expected to return 0.
</sidenote>
Canada, Glen Seeds, September 7, 2000
Sounds like we've started another really interesting thread.
My reactions on postings so far:
- I agree with those who say that trying to develop an API in a horizontal
group makes no sense. That should be left to the individual programming
language groups.
- I also agree that we should try to avoid invention wherever possible, and
search for existing practice that can be codified. Where invention is
unavoidable, we should try to keep it at as high a level as possible.
- I disagree that this topic is outside the proper province of programming
languages. All such specifications include libraries or similar facilities
that, while not central to the language syntax and semantics, are needed in
order to make the average programmer's life tenable, and to facilitate
common solutions to common problems. This includes things such as I/O and
string handling. i18n is another thing of this type.
- I agree that the single most important issue is enabling conformance to
10646. I don't agree that this is the only important issue. Handling of
other cultural conventions in a standard way is also extremely important.
- I don't agree that leaving this to individual vendors would be a
reasonable way to address this need. As a user, it costs my company a great
deal to have to work around the differences between vendors in this area,
and having standard solutions that they all conform to would be of
considerable value to us. I have to tell you that in the face of the absence
of this, we are adopting the same approach as was described for Metaphor: we
are forbidding use of the vendors' facilities, and implementing our own. We
are not at all happy about having to do this, and are critical of vendors'
slow adoption of things such as UTF-8 and 14651.
- I don't agree that the differences between the different approaches in
different programming languages is in the same category as the problem
above. However, I would like to make a point here that has not been made
yet:
The most significant objective that an i18n standardization group could
achieve is a specification for a minimum *set of cultural issues* where
conforming systems should support variability, and a standard way if
*interchanging the encoding rules* for that variability. This is where
international expertise is most needed and most effective. It is also where
existing vendors tend to have the fewest opinions and vested interests that
bog down the standardization process. The Austin Group in particular has
said that they are waiting for direction from ISO before doing any further
work in these areas.
(It's also the area where they get themselves into the most trouble on their
own. A good example of this is the current Austin Group discussion on
"collation order" versus "collation sequence" in regular expressions.)
We have already achieved a lot in this area in the form of 10646 and 14651.
It's unfortunate that the next step, 14652, failed to become anything beyond
a TR. As a user, I have a strong interest in seeing this work go forward. I
can't imagine a better place than SC22/WG20. The areas most affected are the
PL's and OS's, but none of them have the expertise to put together a
statement of generic issues and solutions.
Having said all that, I agree the lack of new progress shows that a review
is in order. We should ask the other groups, especially those in SC22, what
their concerns are in this area, and what sort of process they would buy
into that would allow us to move forward.
/glen
Norway, Keld Simonsen, September 7, 2000
On Wed, Sep 06, 2000 at 09:43:45AM -0400, Winkler, Arnold F wrote:
>
> From: Kenneth Whistler [kenw@sybase.com]
> Sent: Friday, September 01, 2000 4:40 PM
> Subject: Some technical issues regarding the future of SC22/WG20
>
> ================================================================
>
> Arnold Winkler has recently raised a number of issues regarding the future
> of SC22/WG20 and the standards that it maintains or has under
> development, for consideration at the upcoming SC22 plenary in Nara.
> Chief among the issues he raised is whether WG20 is now at the
> end of its useful life, and whether it should be sunsetted, with
> its various projects redistributed over time to other committees as
> appropriate for maintenance.
As I have written in another message, I see WG20 as just now
being able to get to work on real issues. WG20 has been in
the process of taking control over its subject, producing
standards that were the best in SC22 on the subject, but
in essence not much better than say, C, C++, Ada, Fortran, COBOL
or POSIX specifications.
There is a long way to go, if we want ISO standards to be leading in
the field, eg in the area of APIs. Current widespread APIs on the
market like Microsoft NT APIs or IBM ICU have maybe 3 times as many
APIs that what we have worked on in WG20.
This is very normal that ISO standards do not contain all functionality,
and for WG20 the work item was actually restricted to a specific
quite small set of functionality when the NP was accepted.
> 1. Collation
>
> Furthermore, among the active participants in WG2 are the experts
> on collation (with implementation experience) who actually ended
> up authoring much of the content of 14651. Comparable experience is
> not obviously available in the SC22 committees other than WG20.
> Furthermore, because of the current close working relationship
> between WG2 and the Unicode Technical Committee, WG2 is also the
> best place to maintain a standard that should stay in synch with
> the Unicode Collation Algorithm maintained by the UTC, to prevent
> unanticipated "drift" between the two standards.
The argument that the sorting expertise is in SC2 is a myth.
I do not encounter sorting experts in SC2 - beyound the ones I already
know in SC22. And some SC22 experts I rarely see in SC2.
Furthermore it is important that there be a strong realtion to SC22
producers so there not be a "drift" from other SC22 sorting
specifications in this area, such as POSIX or C specs.
> 3. Character Properties
>
> The most contentious issue regarding DTR 14652 is the effort to
> extend LC_CTYPE to cover the repertoire of ISO 10646-1. The contending
> positions effectively reflect a worldview divide among the participants
> regarding character properties:
>
> Position A: Character properties have not traditionally been covered
> by character encoding standards, and have not been viewed as the
> domain of the ISO committee responsible for encoding characters: SC2.
> Instead, character properties are an implementation issue, traditionally
> dealt with in the standards most directly concerned with character
> implementation -- namely the formal language standards -- and are
> dealt with in ISO by the working groups under SC22. In the context
> of 14652, the appropriate place to define character properties is
> LC_CTYPE, where the properties would be usable in a POSIX context as
> part of locale definitions.
>
> Position B: Character properties for the *universal* character set --
> namely ISO 10646 (= Unicode) are inherent to *characters*, and should
> *not* be defined in locales. The locale model and LC_CTYPE were an
> attempt to provide a mechanism for dealing with properties of characters
> in alternate encodings, but that model does not scale well for dealing
> with properties for the universal repertoire of 10646. Furthermore,
> it is inappropriate to assert that character properties are defined
> in locales, and are thus subject to locale-specific variation, since
> such a position would lead to inconsistent and inexplicable differences
> in application behavior, depending on locale, in ways that have
> no bearing on the usually understood issues of locale-specific
> formatting differences, etc. Because character properties are closely
> tied to the characters themselves, responsibility for defining them
> should belong with the character encoding committees, rather than
> with the language committees -- and thus in SC2, rather than SC22.
>
> It is clear that among the rather large community of implementers
> of 10646 (= Unicode), Position B has much more widespread support
> than Position A. Position A is, however, a vocally held minority
> opinion among those committed to the extension of the POSIX framework.
On the other hand, in the UNIX/POSIX/C circles Position A is much
more widespread. Position B is voiced very actively by a small
group of about 20 companies in the Unicode consortium.
In terms of machines actually employing the two different positions,
there is about 20 million or more in the UNIX/Linux community using
it in the Position A way, while Position B is only standard on
Windows 2000 which has less than 10 millions systems installed.
However, the difference between Position A and B is in practice
not big. Most agree that attributes are associated to characters,
however there are some culturally dependent character properties,
such as the Turkish mappings between uppercase and lowercase
for the letter "I" and display of native digits.
> In point of actual fact, the *real* work on standardization of
> 10646 character properties is being done almost entirely
> by the Unicode Technical Committee, which for years now has been
> publishing machine-readable tables of character properties and
> associated technical reports that are in widespread implementation
> in many products. A very few character properties, most notably
> "combining" and "mirroring", are also formally maintained by SC2/WG2 in
> ISO 10646 itself, and those properties are tracked in parallel by
> the UTC.
There has also been a lot of work going on in POSIX circles,
with character properties for more that 20.000 characters
already defined in the POSIX.2 standard that was finished
in 1992. It is maybe a sign of how well researched the Unicode
specifications are that this fact is still unnoticed by prominent
Unicode people.
> On balance, it would seem far preferable to conclude that within
> JTC1 any responsibility for character properties should belong
> to SC2, rather than SC22. Once again, this is a matter of expertise
> regarding the huge number of characters in 10646. That expertise
> is in SC2, and not in SC22. And the implementation experience
> regarding character properties resides in the UTC, which has a
> firm working relationship with SC2, but no close ties to SC22.
Again, the existence of SC2 experts in this area is a myth.
I believe that Unicode has experts, but they are as well connected
to WG20 as to SC2, having C liaison status in both groups.
Furthermore the Unicode technical committee chairman, Arnold Winkler,
is the convener of WG20. No high-ranking Unicode officers have
the same level of office in SC2.
SC2 has for a long time said that they were only into the encoding
of characters, not the meaning. I think still this is a reasonable
approach.
> Regarding LC_CTYPE in particular, the maintenance or extension of
> LC_CTYPE should be remanded to WG15, along with all of DTR 14652,
> but with the following recommendations: Rather than attempting to
> independently extend LC_CTYPE definitions to cover 10646, a mechanism
> should be developed whereby POSIX implementations using LC_CTYPE
> can make use of the more widespread and better researched and
> reviewed character property definitions developed by the UTC, in
> cooperation with SC2/WG2's development of 10646. This should be
> done by *reference*, rather than by enumerating lists of characters
> in SC22 standards or TR's, because of the danger of those lists
> getting out of synch or introducing errors that cause interoperability
> problems. Furthermore, this practice of dealing with character
> properties by reference to UTC and/or SC2 developed standards
> for them, should be recommended to *all* the SC22 committees, as
> the generic way to deal with character properties in formal
> language standards.
As said before, POSIX specs are more widespread than Unicode's,
in therms of systems employing them, and it seems like they may be
better researched, as they have included Unicode specifications
in their research, while Unicode still to this date is unaware of
their bigger competitor...
>
> 4. Internationalization API Standard
>
The i18n API project is another WG20 project to take control of
the subject of i18n, to become masters of our own house.
It is admittedly not very advanced, compared to some industry
APIs, this is partly due to SC22's decision to make a restricted
API. It is, however, with more functionalities than most
programming languages standardized in SC22, and aimed to take a
lead for SC22 standardization in this area.
> No one in WG20 but the project editor seems to be doing any active
> work to develop the API standard for internationalization, and the
> committee feedback to date has largely been that the quality of
> the drafts is poor. Fundamental questions regarding the nature
> of the API design have not been resolved. Furthermore, there has
> been a lot of hand-waving over the issue of how closely tied the
> proposed API is to the locale extension constructs of DTR 14652.
> The API under development for 15435 is locale-centric, in that
> it requires information in an "FDCC-set" defined a la DTR 14652,
> assuming API behavior will depend on that information, resident
> in some implementation-defined "database".
> Modern internationalization libraries have largely eschewed that
> kind of locale-centric design as too constrained, instead breaking up
> the problem of internationalization support into more modular
> designs that separate out different aspects of the problems
> involved.
Some modern i18n libraries still use locale-centric behaviour,
including POSIX compatible systems. As POSIX compatible
operating systems are the only major operating systems
gaining significant market shares these days, it cannot be all
that bad. The i18n system of POSIX furthermore has facilities
so that you can orchestrate you own localization, which is
a virtue of the model. This is only recently that these mechanisms
have been taken up eg in microsoft systems, while posix systems
have done this for years. Java also have very similar concepts,
although they may maintain that it is completely different.
Seen from a users perspective, i18n using the POSIX model works
very well, in my personal experience.
The POSIX model is extensible, and is the only ISO standardized
model.
> Furthermore, the proposed API standard aspires to platform
> independent design. That, however, inappropriately conflates the
> issue of designing appropriate behavior for internationalization
> with the problem of designing appropriately abstracted API's
> for that behavior on distinct platforms. In actual practice,
> implementers are tending to make use of available libraries that
> surface correct internationalization behavior (such as the
> ICU classes) and then writing whatever wrappers are necessary to
> abstract that behavior into their systems. The days of trying
> to define complex behavior via ISO API standards, to be rolled
> out by language compiler vendors in standard C libraries and such,
> are being overtaken by object-oriented design and software
> component models.
Portablility across platforms is one of SC22's hallmarks,
and we achive it well with other standards such as the programming
language standards. The situations that is described above is
just the ones SC22 is set up to solve.
> At this point, WG20's project 15435 should just be abandoned as
> a well-intentioned but obsolete project that has no demonstrated
> need or support for its development.
The 15435 standard is primarily set up for other PL standards.
And furthermore, it is already implemented on major platforms
in major compilers (GNU C/C++)
> 6. Identifiers
WG20 was quite capable of producing the annex on 10176 on identifiers
and quite successful in getting it adopted by the Programming
Languages. WG20 has thus demonstrated it capabilities in
this area and there is no need to move the subject to somebody else.
WG20 even succeded to get Unicode to adopt the specifications.
> This entire issue, is, by the way, also of intense interest to
> the Database standards arena, where it is of direct relevance
> to the SQL standard, for example. So the SC22 working groups are
> not the only JTC1 groups with an interest in standard,
> interoperable results in this area for 10646 characters.
WG20 has liaison to the SQL WG, and furthermore acts as a focal
point for i18n for all of JTC 1, according to JTC 1 decisions.
Kind regards
Keld Simonsen
W3C, Martin Dürst, September 11, 2000
From: Martin J. Duerst [duerst@w3.org]
Sent: Monday, September 11, 2000 2:10 AM
To: John Hill, ISO/IEC JTC1 SC22 Chair
Cc: Lisa Rajchel ISO/IEC JTC1 SC22 Secretariat at ANSI
Arnold Winkler, ISO/IEC JTC1 SC22 WG20 Convener
Type of document: Liaison Contribution
Subject: Future directions for WG20
For consideration at the Nara meeting of SC22
W3C herewith supports Arnold Winkler's recent proposal for the
future of SC22/WG20.
The experience with the internationalization of a wide range of
specifications at W3C strongly shows the following:
- The range of specifications with internationalization needs
extends far beyond programming languages and includes document
and data formats and protocols.
- Programming languages become more and more diverse, and most
of a program's internationalization functionality is handled
as part of libraries (input/output and user interface) where
diversity is even bigger than in the programming language core.
- Internationalization cannot be done in isolation, but needs to
be done by the committee responsible for the 'base' standard,
with the participation, contribution, and review from
internationalization experts. The main common base is the
universal character set (ISO/IEC 10646).
With respect to the current work items of SC22/WG20, our input
is as follows:
- Sorting/Collation Standard (14651): The standard itself is close
to completion, and should be completed by SC22/WG20. SC2/WG2 is the
optimal place for further work on the data needed for the standard.
- List of characters for identifiers (Appendix to TR 10176):
Again SC2/WG2 is the optimal place to extend this work to
newly encoded characters.
- API for Internationalization (15435): Given the large variance
across programming languages, and the increased importance of
libraries and user interface components, a general API for
internationalization is highly inappropriate.
- Registry for cultural conventions (ISO/IEC 15897): A good
documentation on cultural conventions is very helpful for
implementers of all kinds of information technology. In order
to be of real value, the registry should:
- Make the full information available on the World Wide Web.
- Accept incomplete contributions (e.g. when only part
of some cultural conventions are known or established).
- Provide a full revision history for official registrations.
- Accept contributions not only from the relevant national
bodies, but also from the general public (and e.g. label
them as 'not verified').
- Accept multiple contributions for the same locale
(and label them appropriately).
- Besides registered information, provide pointers to related
information elsewhere, in print or on the WWW.
Once the registry is set up appropriately, the task of
WG20 in this area can be considered completed.
The Type C Liaison between SC22/WG20 and the World Wide Web
Consortium (W3C), in particular the W3C Internationalization
Working Group (SC22 N3073) has been established to coordinate
internationalization issues between these two groups. Completion
of the current SC22/WG20 tasks as proposed by Arnold Winkler
and as discussed above, and transfer of the remaining character-
related responsibilities to SC2/WG2 completely satisfy the
needs of W3C and simplify the interaction between W3C and
ISO/IEC TC1 in the area of internationalization, because
W3C has already established a liaison with SC2/WG2.
Yours sincerely, Martin J. Dürst.
Norway, Keld Simonsen, September 12, 2000
Arnold Winkler has recently raised a number of issues regarding the future
of SC22/WG20 and the standards that it maintains or has under
development, for consideration at the upcoming SC22 plenary in Nara.
Chief among the issues he raised is whether WG20 is now at the
end of its useful life, and whether it should be sunsetted, with
its various projects redistributed over time to other committees as
appropriate for maintenance.
However, I see WG20 as just now
being able to get to work on real issues. WG20 has been in
the process of taking control over its subject, producing
standards that were the best in SC22 on the subject, but
in essence not much better than say, C, C++, Ada, Fortran, COBOL
or POSIX specifications.
There is a long way to go, if we want truly internationalized,
portable applications, and ISO standards to be leading in
the field, here in the area of APIs. Current widespread APIs on the
market like Microsoft NT APIs or IBM ICU have maybe 3 times as many
APIs that what we have worked on in WG20.
This is very normal that ISO standards do not contain all functionality,
and for WG20 the work item was actually restricted to a specific
quite small set of functionality when the NP was accepted.
In general, I think the standardization of APIs and formats for data
specifications are best done in SC22, which standardizes
libraries, and also interacts with the many ISO programming languages.
Moving WG20 activities into SC2, as Arnold Winkler proposes,
would be an error, IMHO.
APIs are not in the scope of SC2. Neither are sorting or
character attributes. And sorting and character attributes
have for a long time been a SC22 issue, viz. C, and other
programming languages islower(), isupper() etc.
In the following I will give some comments on each of WG20's projects.
1. Collation
The argument that the sorting expertise is in SC2 is a myth.
The only sorting expert I encounter in SC2 - beyond the ones I already
know in SC22, is Michael Everson. And a number of SC22 experts that
always come the the WG20 meetings (at least during the last 2 years)
comes less regularily to SC2 meetings, this includes Ken Whistler,
Marc Küster, Kent Karlsson, Takata-San, and myself.
3. Character Properties
One school of thought, represented foremost by Unicode people,
think that character properties, such as what is a letter, digit,
and cpecial character, is an inherent property of the character itself
and cannot be changed, while another school thinks that character
properties may be culturally dependent, as per a C/C++/POSIX locale.
In terms of machines actually employing the two different positions,
there is about 20 million or more in the UNIX/Linux community using
it in the locale way, while the Unicode way is only standard on
Windows 2000 which has less than 10 millions systems installed.
However, the difference between the two schools of thought is in practice
not big. Most agree that attributes are associated to characters in a
fixed way, however there are some culturally dependent character
properties, such as the Turkish mappings between uppercase and lowercase
for the letter "I" and display of native digits.
On character properties, there has been some work going on
in Unicode, but also work going on in POSIX circles,
with character properties for more that 20.000 characters
already defined in the POSIX.2 standard that was finished
in 1992. It seems that this work has not till this date been
noticed by prominent Unicode people.
There is also a myth that the existence of experts in this area is
foremost in SC2. I believe that Unicode has experts, but they are as
well connected to WG20 as to SC2, having C liaison status in both
groups. Beyond the Unicode people I see very few experts in SC2 on
this matter. On the other hand there are experts in SC22, including
experts in the different language WGs, the POSIX WG, and myself.
That Unicode should be less conntected to WG20 than to SC2 is for
me hard to fnderstand, with Unicode having C category liaison both
places, and furthermore the Unicode technical committee chairman,
Arnold Winkler, being the convener of WG20. No high-ranking Unicode
officers have the same level of office in SC2.
SC2 has for a long time said that they were only into the encoding
of characters, not the meaning. I think still this is a reasonable
approach.
3. cultural conventions specification standard, TR 14652
As said before, POSIX specs are more widespread than Unicode's,
in terms of systems employing them, and it seems like they may be
better researched, as they have included Unicode specifications
in their research, while Unicode still to this date is unaware of
their bigger competitor...
4. Internationalization API Standard
Some modern i18n libraries use locale-centric behaviour,
including POSIX compatible systems. As POSIX compatible
operating systems are the only major operating systems
gaining significant market shares these days, it cannot be all
that bad. The i18n system of POSIX furthermore has facilities
so that you can orchestrate you own localization, which is
a virtue of the model. This is only recently that these mechanisms
have been taken up eg in microsoft systems, while posix systems
have done this for years. Java also have very similar concepts,
although they may maintain that it is completely different.
Seen from a users perspective, i18n using the POSIX model works
very well, in my personal experience.
The POSIX model is extensible, and is the only ISO standardized
model.
Portablility across platforms is one of SC22's hallmarks,
and we achive it well with other standards such as the programming
language standards. Also in the area of i18n SC22 and JTC 1
shouldstrve for applications portablilty.
The 15435 standard is primarily set up for other PL standards.
And furthermore, it is already implemented on major platforms
in major compilers (GNU C/C++).
6. Identifiers
WG20 was quite capable of producing the annex on 10176 on identifiers
and quite successful in getting it adopted by the Programming
Languages. WG20 has thus demonstrated it capabilities in
this area and there is no need to move the subject to somebody else.
WG20 even succeded to get Unicode to adopt the specifications.
WG20 has liaison to many parties inside and outside of SC22,
including the SQL WG, and furthermore acts as a focal
point for i18n for all of JTC 1, according to JTC 1 decisions.
Kind regards
Keld Simonsen
USA, Ken Whistler, September 12, 2000
Keld responded to a number of the concerns I had surfaced on
behalf of the U.S. committee. Here are some countercomments
which may lead into the discussion which is sure to ensue during
the upcoming Malvern meeting of WG20.
> > From: Kenneth Whistler [kenw@sybase.com]
> > Sent: Friday, September 01, 2000 4:40 PM
> > Subject: Some technical issues regarding the future of SC22/WG20
> >
> > ================================================================
> >
> > Arnold Winkler has recently raised a number of issues regarding the future
> > of SC22/WG20 and the standards that it maintains or has under
> > development, for consideration at the upcoming SC22 plenary in Nara.
> > Chief among the issues he raised is whether WG20 is now at the
> > end of its useful life, and whether it should be sunsetted, with
> > its various projects redistributed over time to other committees as
> > appropriate for maintenance.
>
> As I have written in another message, I see WG20 as just now
> being able to get to work on real issues. WG20 has been in
> the process of taking control over its subject, producing
> standards that were the best in SC22 on the subject, but
> in essence not much better than say, C, C++, Ada, Fortran, COBOL
> or POSIX specifications.
This is, unfortunately, a sad commentary on the quality of the
I18N work coming out of WG20 to date, and I concur with Keld's
assessment!
>
> There is a long way to go, if we want ISO standards to be leading in
> the field, eg in the area of APIs. Current widespread APIs on the
> market like Microsoft NT APIs or IBM ICU have maybe 3 times as many
> APIs that what we have worked on in WG20.
...and much greater sophistication, as well as precision of definition.
And you neglected to mention Java in this list.
As for the presuppostion here, that ISO standards should be leading
this field, see below. I agree with the essential assessment
that WG20 is *way* behind. But I differ with Keld in that I don't
think there is any feasible way for WG20 to do a decent job of
providing an I18N API standard.
>
> This is very normal that ISO standards do not contain all functionality,
> and for WG20 the work item was actually restricted to a specific
> quite small set of functionality when the NP was accepted.
I don't think there was any "specific quite small set of
functionality" defined in the NP. All along, the coverage of
15435 has essentially been precisely what the editor intended
it to be; I see no evidence of principled direction from the
committee that set or constrained the initial scope of the
proposed standard.
>
> > 1. Collation
> >
> > Furthermore, among the active participants in WG2 are the experts
> > on collation (with implementation experience) who actually ended
> > up authoring much of the content of 14651. Comparable experience is
> > not obviously available in the SC22 committees other than WG20.
> > Furthermore, because of the current close working relationship
> > between WG2 and the Unicode Technical Committee, WG2 is also the
> > best place to maintain a standard that should stay in synch with
> > the Unicode Collation Algorithm maintained by the UTC, to prevent
> > unanticipated "drift" between the two standards.
>
> The argument that the sorting expertise is in SC2 is a myth.
> I do not encounter sorting experts in SC2 - beyound the ones I already
> know in SC22. And some SC22 experts I rarely see in SC2.
Perhaps this is a result of attending more to SC2 committee matters
per se, rather than to WG2 or its liaison relation to the UTC.
Here are some examples: 4 experts on Myanmar sorting issues
at WG2 in London; 1 expert on Tibetan sorting at WG2 in London,
and *megabytes* of Tibetan input on a UTC hosted discussion list;
input on Kannada sorting from an expert just last week at the
International Unicode Conference; numerous other Indic inputs
from Jeroen Hellingham and other experts on the Unicode discussion
lists; Chinese input on Yi sorting issues in WG2 in London,
Fukuoka, and Beijing; participation from Arabic and Syriac
experts; Joe Becker; Asmus Freytag; Tex Texin (implemented at
Progress); Gary Richards (implemented at NCR); implementers
from Oracle; the designers and implementers of sorting in Java;
the designers and implementers of sorting in the IBM ICU; and
last, but not least, the designers and implementers of ML sorting
at Microsoft.
Would you care to make a corresponding, explicit list of the
SC22 experts in sorting that you rarely see in SC2, and what
their contributions might be to solving issues that must be
faced in extending the 14651 tables to cover such scripts as
Myanmar, Khmer, Mongolian, and Yi?
> Furthermore it is important that there be a strong realtion to SC22
> producers so there not be a "drift" from other SC22 sorting
> specifications in this area, such as POSIX or C specs.
Cute, but irrelevant. The standards to maintain in synch now
are ISO 14651 (when it is published), and the Unicode Collation
Algorithm. Everything else constitute defined deltas from those
standards, if you are talking about the tables to specify
ordering. If, on the other hand, you are talking about drift
in API's, that is also irrelevant, since my claim is that WG20
should not be making an API in this area.
>
> > 3. Character Properties
I'll pick up this topic separately.
> > 4. Internationalization API Standard
> >
> The i18n API project is another WG20 project to take control of
> the subject of i18n, to become masters of our own house.
> It is admittedly not very advanced, compared to some industry
> APIs, this is partly due to SC22's decision to make a restricted
> API. It is, however, with more functionalities than most
> programming languages standardized in SC22, and aimed to take a
> lead for SC22 standardization in this area.
This point was addressed by the W3C contribution on this topic from
Martin Dürst.
It is one thing to set general direction and requirements for
internationalization of programming languages, as in TR 11017
and TR 10176, but it is quite another to set out to create and
standardize an API in this area. The approach that WG20 is taking
flies in the face of good practice in API design: it has no
clear set of requirements to begin with, it has no guiding
architecture for the specifics of the API, and it has no well-defined
relationship to the *particular* language standards it is supposedly
being developed for.
The editor of 15435 has been pointedly ignoring the message from
internationalization experts from the OS and tools vendors that
no such ISO standard is needed or desired, and instead seems to
be listening primarily to the GNU C/C++ developers and to
plaintive calls from other SC22 working groups hoping that
WG20 will *solve* their internationalization problems. Even the
Linux internationalization experts have rejected involvement with
15435, and that is a community that the editor indirectly keeps
pointing to in order to justify WG20 projects.
>
> > Modern internationalization libraries have largely eschewed that
> > kind of locale-centric design as too constrained, instead breaking up
> > the problem of internationalization support into more modular
> > designs that separate out different aspects of the problems
> > involved.
>
> Some modern i18n libraries still use locale-centric behaviour,
> including POSIX compatible systems. As POSIX compatible
> operating systems are the only major operating systems
> gaining significant market shares these days, it cannot be all
> that bad.
Excuse me, but unless you have something strange in mind, you are
talking here about the growth in popularity of Linux. But the
Linux I18N group has rejected 15435 as an approach to dealing
with international of Linux. How is that an argument for WG20
continuing work on 15435?
> The i18n system of POSIX furthermore has facilities
> so that you can orchestrate you own localization, which is
> a virtue of the model. This is only recently that these mechanisms
> have been taken up eg in microsoft systems, while posix systems
> have done this for years. Java also have very similar concepts,
> although they may maintain that it is completely different.
> Seen from a users perspective, i18n using the POSIX model works
> very well, in my personal experience.
I am afraid this is looking at the problem with rose-colored
microscopes.
It is generally acknowledged that Unix systems have the least
flexible internationalization, least complete localization, and
least advanced Unicode support of all the major platforms. That
doesn't mean the Unix implementers aren't working on it, but from
an end-user's point of view they don't hold a candle to what is
available on Microsoft, Apple, or Java platforms.
Sure the POSIX model lets a Unix *developer* "orchestrate your
own localization", but you have to be a programmer, a standards
reader, and a system administrator as well to do so on most
systems. Most Unix end users are simply enslaved to whatever
templates got rolled out by their company's system administrators,
and cannot change a damn thing on their own. Most *real* Unix
installations, as opposed to developer machines with Unixoids
playing with the source code, simply run out some defined list
of precompiled locales, and the SA establishes settings for
the installation scripts. Woe to any end user who actually tries
to create some non-standardized behavior by manipulating LC_XXX
environment settings on their own -- that usually just results in
some program refusing to run or dishing out error messages
about missing files or messages.
And the computer end user who actually knows enough to even
try manipulating LC_XXX environment values is already in
the 99th percentile of computer experts. Try talking to some
*real* end users sometime.
>
> The POSIX model is extensible, and is the only ISO standardized
> model.
Another sad state of affairs.
>
> > Furthermore, the proposed API standard aspires to platform
> > independent design. That, however, inappropriately conflates the
> > issue of designing appropriate behavior for internationalization
> > with the problem of designing appropriately abstracted API's
> > for that behavior on distinct platforms. In actual practice,
> > implementers are tending to make use of available libraries that
> > surface correct internationalization behavior (such as the
> > ICU classes) and then writing whatever wrappers are necessary to
> > abstract that behavior into their systems. The days of trying
> > to define complex behavior via ISO API standards, to be rolled
> > out by language compiler vendors in standard C libraries and such,
> > are being overtaken by object-oriented design and software
> > component models.
>
> Portablility across platforms is one of SC22's hallmarks,
> and we achive it well with other standards such as the programming
> language standards. The situations that is described above is
> just the ones SC22 is set up to solve.
This is baloney.
The portability across platforms that SC22 aspires to (and largely
achieves) is portability of the source code for a particular
language (and its associated algorithmic semantics) across
platforms. This enables the building of conformant language
compilers on many platforms, and even cross-platform compilers
that merely substitute out machine-specific code-generation and
optimizer modules.
What I was alluding to is the issue of portability of API's across
different language standards, which is a whole different kettle
of fish. Even between C and C++, which are designed to be closely
compatible, you cannot take an object oriented C++ API and simply
"port" it to C -- the principles are just entirely different,
and C doesn't have the mechanisms to express an object-oriented
API (although you can try to emulate it with clever fakery). Now
consider trying to do the same thing for C++ and FORTRAN. It would
be ludicrous.
>
> > At this point, WG20's project 15435 should just be abandoned as
> > a well-intentioned but obsolete project that has no demonstrated
> > need or support for its development.
>
> The 15435 standard is primarily set up for other PL standards.
> And furthermore, it is already implemented on major platforms
> in major compilers (GNU C/C++)
"The 15435 standard ... is already implemented on major platforms..."
Pardon me if I do a double take on this particular one.
15435 has not yet even seen a draft that has been approved to go
out for a CD ballot. It is 2 years away from being a standard, even
if we all agreed on its content in November and decided to progress
it to a CD ballot.
And if the whole point of this exercise is to standardize some
practice in the GNU C/C++ compiler community, why isn't that
implementation clearly on the table, identified as such, with
the appropriate manpages and documentation, so that WG20 can
evaluate existing practice and its appropriateness for
standardization? Instead, WG20 has been treated to very bad
drafts in 15435 that have waffled all over the map about their
approach, and which have no clear relation to *any* implementation.
>
> > 6. Identifiers
>
> WG20 was quite capable of producing the annex on 10176 on identifiers
> and quite successful in getting it adopted by the Programming
> Languages. WG20 has thus demonstrated it capabilities in
> this area and there is no need to move the subject to somebody else.
> WG20 even succeded to get Unicode to adopt the specifications.
No, this is a misrepresentation of the facts.
WG20 succeeded in getting Unicode's attention regarding unprincipled
differences between what WG20 was recommending and what the Unicode
Consortium was recommending. Then there was joint work which resulted
in some changes to both (including an Amendment to 10176), so as
to minimize interoperability differences in the two approaches.
The UTC still recommends a superset of what TR 10176 suggests,
and TR 10176 has not yet addressed issues of normalization or
other specifics regarding use of extended identifiers on the
Internet.
>
> > This entire issue, is, by the way, also of intense interest to
> > the Database standards arena, where it is of direct relevance
> > to the SQL standard, for example. So the SC22 working groups are
> > not the only JTC1 groups with an interest in standard,
> > interoperable results in this area for 10646 characters.
>
> WG20 has liaison to the SQL WG, and furthermore acts as a focal
> point for i18n for all of JTC 1, according to JTC 1 decisions.
We all know there has been zero input either direction from
the SQL WG and WG20. The input on internationalization in the
SQL WG has all been coming in from external connections -- through
communications between the internationalization experts and the
SQL experts in the database companies, largely. The internationalization
in SQL is the result of Jim Melton working in database companies,
not the result of Jim Melton talking to WG20.
JTC 1 may decide that WG20 *shall* act as a focal point for all
internationalization in JTC 1 committees, but that doesn't make it
happen.
--Ken
USA, Ken Whistler, September 13, 2000
Now I am going to take up Keld's assertions about character properties.
> > 3. Character Properties
> >
> > The most contentious issue regarding DTR 14652 is the effort to
> > extend LC_CTYPE to cover the repertoire of ISO 10646-1. The contending
> > positions effectively reflect a worldview divide among the participants
> > regarding character properties:
[snip]
> >
> > It is clear that among the rather large community of implementers
> > of 10646 (= Unicode), Position B has much more widespread support
> > than Position A. Position A is, however, a vocally held minority
> > opinion among those committed to the extension of the POSIX framework.
>
> On the other hand, in the UNIX/POSIX/C circles Position A is much
> more widespread. Position B is voiced very actively by a small
> group of about 20 companies in the Unicode consortium.
This "small group" includes Sun, IBM, HP, and Compaq, which companies,
between them, account for the majority of enterprise Unix installations.
It also includes Oracle, Sybase, NCR, IBM, Microsoft, and Progress, which
between them account for the vast majority of commercial database
installations, many of them running on Unix platforms -- including
Linux.
> In terms of machines actually employing the two different positions,
> there is about 20 million or more in the UNIX/Linux community using
> it in the Position A way, while Position B is only standard on
> Windows 2000 which has less than 10 millions systems installed.
Well, this just goes to show, there are lies, damn lies, and then
there are statistics.
How about some counter-statistics...
Information from Information Data Center, IT Forecaster, August 8, 2000.
Worldwide Client Operating Environment New License Shipment Shares 1999
Windows 87.7%
MacOS 5.0%
Linux 4.1%
Other 3.8%
Worldwide Server Operating Environment New License Shipment Shares 1999
Windows NT 36%
Linux 24%
NetWare 19%
Unix 15%
Host/Server 3%
Other 3%
Linux is growing rapidly in the low-end server market, it is true.
But a very large proportion of the Linux installations are running
web servers, and or file and print services, with no significant
front end user interaction. And on the web servers at least, the
installations are dishing up HTML pages, Java script, and Java aps
that are Unicode compliant. So even if the lowest level OS is
POSIX compliant, they are running layers of software that deal
with characters the Unicode way. Another example: Sybase ships its
database software for Linux platforms now (as do most other database
companies). That software supports character properties in databases
the Unicode way, and does not depend on POSIX-compliant localization
at the OS level to make decisions about how to treat data.
So simple-minded tossing out of numbers about X-million systems
installed as a way of supporting a particular technical approach
to defining character properties is nothing more than smoke and
mirrors.
>
> However, the difference between Position A and B is in practice
> not big. Most agree that attributes are associated to characters,
> however there are some culturally dependent character properties,
> such as the Turkish mappings between uppercase and lowercase
> for the letter "I"
The local specifics for Turkish case mapping are well known. All
this (and all other instances we know about) is documented in
SpecialCasing.txt on the Unicode website.
But case mapping is not even properly a "property" of characters --
it is a *relation* between pairs (or triples) of characters, and
only the most obvious of many such types of relations. The
relation between Hiragana and Katakana is another such relation.
And simply because there are some celebrated (and acknowledged)
locale-specific differences in case mapping for a few characters
does not mean that one therefore needs to specify the entire
apparatus of character property definition *inside* locale
definitions. That is definitely an instance of the tail wagging the
dog.
> and display of native digits.
This is an artifact of incorrect old implementations of Arabic that
did not have proper encodings for characters. Those mistakes,
which are not necessary to duplicate in a Unicode implementation,
have no bearing on which committee is in a better position *now*
to specify character properties for 10646.
>
> > In point of actual fact, the *real* work on standardization of
> > 10646 character properties is being done almost entirely
> > by the Unicode Technical Committee, which for years now has been
> > publishing machine-readable tables of character properties and
> > associated technical reports that are in widespread implementation
> > in many products. A very few character properties, most notably
> > "combining" and "mirroring", are also formally maintained by SC2/WG2 in
> > ISO 10646 itself, and those properties are tracked in parallel by
> > the UTC.
>
> There has also been a lot of work going on in POSIX circles,
> with character properties for more that 20.000 characters
> already defined in the POSIX.2 standard that was finished
> in 1992. It is maybe a sign of how well researched the Unicode
> specifications are that this fact is still unnoticed by prominent
> Unicode people.
Well, now, let's just take a look, shall we?
ISO/IEC 9945-2:1993 (E) (= IEEE Std 1003.2-1992), in two volumes, right?
Part 2 is Shell and Utilities, and the majority of the normative
text constitutes the specification of the behavior of the shell,
and of all the POSIX utility programs. Other than a short few pages
about character set definition in general, the only significant
specification of character properties in the entire document can
be found in Annex G (informative) Sample National Profile, pp. 1063 -
1192. And guess what, that sample national profile is none other
than the *Danish* National Profile Example, authored by Keld.
Note also, that Annex G is *informative* in POSIX.2, though
it contains normative-sounding "shall" terminology that sounds
as if it is lifted from a DSA specification.
Annex G is effectively the source of the "i18n" FDCC-set definition
proposed in DTR 14562, minus its Danish-specific component, which
lives on, instead, in Annex B.1.3.3, the Sample FDCC-set
specification for Danish.
The Unicode participants in the WG20 work have assumed all along
that the "son of POSIX" work in 14652 represented extensions,
corrections, and emendations of any previous work. That would mean
that the 14652 drafts would be the more pertinent to consider in
comparison with current work on Unicode character properties. And
such would not conflict with the editor's own representations about
the status of the tables in the DTR 14652.
But since Keld claims that the Unicode character property work
is not well-researched, because it hasn't taken into account the
published POSIX.2 standard from 1992, maybe it does make sense
to skip past the 14652 drafts and go back to the earlier source.
The only mention of "more tha[n] 20.000 characters" in Annex G
can be found on p. 1066:
"The symbolic ellipsis benefits especially those locale definitions
with large character sets. For example there are about 6000 Kanji
characters in JIS X0208 {B26} and about 20 000 ideographic
characters (in a different order) in ISO/IEC 10646-1 {B13}. To
create a Japanese locale that can support JIS X0208 {B26} and
ISO/IEC 10646-1 {B13} code sets with code-value ellipses, two
separate charmaps and two separate locale definitions must be
created."
Note this is telling you how to create a shortcut representation
of a *charmap*, and doesn't in fact specify any character properties
for anything.
If we actually look for character properties per se, they are found
in the LC_CTYPE section of the Danish sample, pp. 1141 - 1148.
But the characters defined in the LC_CTYPE section depend on
the charmap itself, as specified in section G.6.1, starting on
p. 1152. The introduction to that charmap states: "Symbolic
character names are defined for about 1900 characters, covering
many coded character sets." By the way, notably *not* covering
10646-1:1993. So we are down quite a peg here, from a wild
claim about more than 20,000 characters, to an actual list
of "about 1900" characters.
And just as for the i18n repertoire in DTR 14652, this list is
seemingly arbitrarily culled from 10646-1:1993 to fit some
preconceived notion of what characters might be of particular
interest in Europe (or Denmark in particular, I suppose), but
with all kinds of omissions. Most Latin letters are included,
including those for Vietnamese, but not those for African
languages. Exactly one IPA character is in the list (ezh).
Greek spacing accents are included from the 1FXX block, but
not the rest of precomposed polytonic Greek. In addition to
8859-5 Cyrillic, 5 historic letters (but not all) are included
for OCS, plus one letter for Old Ukrainian, but no other
Cyrillic extensions. Hebrew but no Hebrew points. Basic
Arabic, but only 3 (arbitrary) Arabic extension letters, insufficient
to cover either Persian or Urdu. A bunch of fixed-width spaces,
but not the zero-width space. 4 arbitrarily picked currency signs
from the currency block -- but not all of them. Hiragana and
Katakana, but not halfwidth Katakana forms, nor sufficient
Asian symbols to cover any of the Asian standards. And so
on. In other words, a complete implementation hodge-podge,
full of all kinds of holes.
Well, so the repertoire is an arbitrarily chosen subset, but
what about the properties defined on that repertoire? Let's
take a look.
digit
Defines 0..9, but not the Arabic digits, which are in the repertoire
(unlike all the other digit sets from 10646-1). Presumably this
is not just an oversight, but is related to the claims about
culturally-specific implementation of national digit shapes.
But the fact remains that the Arabic digit characters in the
repertoire are not themselves given any character properties
in the LC_CTYPE definition -- and that just has to be the
wrong approach to those characters.
blank
space
Both of these classes completely overlook the various fixed-width
space characters which are included in the charmap. Oops!
upper
Somehow this definition manages to miss the uppercase Roman
numerals and parenthesized letter compatibility characters whose
lowercase forms *are* listed in the lower class. Oops!
lower
This section incorrectly specifies small Hiragana and small Katakana
characters as belonging to this lower class. Oops!
alpha
This class incorrectly specifies the parenthesized letter and
circled letter compatibility characters as being in the alpha
class. Any parsing operation depending on isalpha() will get the
wrong answer in that case, since those characters are used as
bullet symbols, not as letters per se, and cannot form parts of
words. Oops!
punct
This class follows the venerable (and incorrect) POSIX tradition
of conflating true punctuation with all other kinds of symbols
that happen not to be letters, spaces or digits. Included in
this list are Roman numerals (which are letterlike numeric
symbols), arrows, and math operators, for example. Also included
are the masculine and feminine ordinal symbols, which *do* form
parts of words, and which therefore should be part of the alpha
class, not the punct class. Clearly there is no particular
lesson to be gained here for Unicode character properties, except
a further demonstration that the "punct" class was misconceived
in POSIX in the first place.
So what do we learn by doing the research in the POSIX.2 document?
Basically that the character properties defined there are of
very little elucidative value for Unicode as a whole. And that
the particular class definitions have obvious errors in them,
as well as being incomplete and out-of-date.
Not much of a model to rely on, I would say.
>
> > On balance, it would seem far preferable to conclude that within
> > JTC1 any responsibility for character properties should belong
> > to SC2, rather than SC22. Once again, this is a matter of expertise
> > regarding the huge number of characters in 10646. That expertise
> > is in SC2, and not in SC22. And the implementation experience
> > regarding character properties resides in the UTC, which has a
> > firm working relationship with SC2, but no close ties to SC22.
>
> Again, the existence of SC2 experts in this area is a myth.
Untrue.
For each new script that is encoded in 10646, WG2 (and UTC) depends
on information provided by experts on that script (or elicitable
from experts on that script) to help determine character properties
for those characters. Many of those experts can only participate
in this work through they national body, and may attend WG2
meetings, but not UTC meetings. They certainly don't come to
WG20 meetings.
> I believe that Unicode has experts, but they are as well connected
> to WG20 as to SC2, having C liaison status in both groups.
The people who actually maintain lists of character properties,
write technical reports about them, implement them in libraries
or languages, do tend to be in the UTC, rather than in WG2, for
sure. But they depend on the experts from WG2 (among other sources)
for the primary information about character behavior this is
required for newly encoded characters.
> Furthermore the Unicode technical committee chairman, Arnold Winkler,
> is the convener of WG20. No high-ranking Unicode officers have
> the same level of office in SC2.
Arnold is the *vice*-chair of the UTC, but that is just a quibble.
However, your claim about SC2 is misleading. Michel Suignard is
a technical director of Unicode, Inc., and he is editor of
10646-2. Mike Ksar is a member of the board of directors
of Unicode, Inc., and he is convenor of WG2. Asmus Freytag is
a VP of Unicode, Inc., and he is the UTC liaison officer to WG2.
The UTC has good, working lines of communication into both working
groups. Any attempt to decide where to deal with character properties
on an imagined difference in these lines of communication is doomed
to the dustheap.
> SC2 has for a long time said that they were only into the encoding
> of characters, not the meaning. I think still this is a reasonable
> approach.
This is an incorrect characterization of the current *facts* about
10646, which does include some character semantic specifications
(combining and mirroring). Furthermore, your assertion that it
is a reasonable approach, when it comes to consideration of the
UCS, is not shared by many of the people participating in this
effort.
> > Furthermore, this practice of dealing with character
> > properties by reference to UTC and/or SC2 developed standards
> > for them, should be recommended to *all* the SC22 committees, as
> > the generic way to deal with character properties in formal
> > language standards.
>
> As said before, POSIX specs are more widespread than Unicode's,
> in therms of systems employing them,
This claim is just ludicrous -- either in terms of systems or
in terms of the availability and use of the specifications.
> and it seems like they may be
> better researched, as they have included Unicode specifications
> in their research, while Unicode still to this date is unaware of
> their bigger competitor...
I'll leave others to draw the obvious conclusion here.
--Ken
Please find links to more reactions on the top of this document.