ISO/IEC JTC 1/SC34 N0387

ISO/IEC JTC 1/SC34

Information Technology --

Document Description and Processing Languages

Title: Community contribution: problems with ISO entity sets (ISO 8879 and 9573-13)
Source: David Carlisle, UK
Project: SGML ISO-8879
Public entity sets for mathematics and science ISO 9573-13
Project editor:
Status: Text of solicited community contribution
Action: For consideration of future SC34 activities
Date: 2003-04-02 16:10UTC
Summary: Real-world use of ISO entity sets has revealed issues that need to be addressed for the user community.
Distribution: SC34 and Liaisons
Refer to:
Supercedes:
Reply to: Dr. James David Mason
(ISO/IEC JTC1/SC34 Chairman)
Y-12 National Security Complex
Information Technology Services
Bldg. 9113 M.S. 8208
Oak Ridge, TN 37831-8208 U.S.A.
Telephone: +1 865 574-6973
Facsimile: +1 865 574-1896
E-mailk: mailto:mxm@y12.doe.gov
http://www.y12.doe.gov/sgml/sc34/sc34oldhome.htm

Ms. Sara Hafele, ISO/IEC JTC 1/SC 34 Secretariat
American National Standards Institute
25 West 43rd Street
New York, NY 10036
Tel: +1 212 642-4937
Fax: +1 212 840-2298
E-mail: shafele@ansi.org

Problems with ISO entity sets (ISO 8879 and 9573-13)


This document is a personal note on problems encountered whilst working
with ISO entity sets, and in particular in attempting to produce
mappings of the ISO entities to Unicode to enable XML DTD declarations
to be made.  Much of this work has been undertaken as the Editor of the
W3C MathML DTD, but this is a personal note, produced in response to an
email request quoted at the end.


http://www.w3.org/Math/characters/

contains further information including mappings to Unicode for all the
ISO 8859 and 9573-13 entity sets, whether or not they are used in
MathML.

Docbook has a similar page describing its mappings at

http://www.oasis-open.org/docbook/specs/wd-docbook-xmlcharent-0.3.html






Problem 1: Fixed set
====================

The first problem that could be mentioned is that as a fixed (and small)
set of characters that is "blessed" with a name, the set of characters
chosen, and their names, will always be an essentially arbitrary
choice. I don't offer any insight here but wanted to mention this
"problem" first as a desire to change the set of named characters is one
possible source of pressure to change the entity sets. (I understand
the 9573-13 isonum set was recently extended to include euro for example.)


Problem 2: Lack of canonical mapping to Unicode
===============================================

The lack of a canonical mapping to UNICODE is currently a severe, almost
critically fatal, problem when trying to use the entity sets in an XML
context.

Traditionally the ISO sets were defined as SGML SDATA entities.
In this form a canonical mapping to Unicode was not required, in fact
Unicode might not be used at all, an SGML system could hold these
characters as "special character data" until the last minute when some
suitable rendering could be used.

SDATA is not available in XML and this completely changes the situation.

Problem 2 splits into subproblems:


Problem 2a: Unicode needed in XML
=================================

As SDATA is not available to define the character entities one only has
the choice of defining the entities to be (Unicode) character data Or
some combination of elements and attributes. If the latter choice is
taken then the entity definitions are only usable in document types that
include the elements used. This severely restricts the use of the entity
sets definitions as a public standard (and also prevents use of the
characters in attribute values). Thus, practically speaking, definitions
that map the entity names to one or more Unicode characters are the only
real possibility as a standard set of definitions usable across a range
of XML vocabularies.

Note that after an SGML SDATA definition all applications that process
the document know the original character was entered as a named entity
and can, if necessary, provide special handling of that named character.
In XML however, after the first XML parse subsequent applications will
typically not be informed that a named entity is used. The use of the
entity will appear identical to the application as direct use of the
character (or use of numeric character references). Also as entity
definitions have document scope it is important that all XML
vocabularies that could conceivably be used together share the same
definitions. If MathML is used inside DocBook (or XHTML) then it is not
possible for α to mean one thing specified by DocBook (or XHTML)
in the textual parts and another specified by MathML, in the
mathematics. If it happens (and it does) that the DocBook and MathML DTD
disagree on some definitions, then the meaning of an entity in a
combined MathML+Docbook DTD will be either the MathML one or the Docbook
one depending on the technicalities on the way in which the combined DTD
was built from the two constituent parts. This has proved to be a major
source of complication. There are certain arbitrary choices that have to
be made when assigning Unicode definitions for the names. The only way
to have a consistent set is for someone with authority to just state
a "good" mapping. This would not force everyone to use that if they knew
what they were doing and wanted to do their own thing, but it would be
a massive help to anyone trying to specify a vocabulary that uses the
ISO entity sets and wanted to be compatible with another language.
ISO as keeper of the entity set names and half the keeper of
Unicode/ISO10646 seems to be the natural authority to promote such a
mapping table.


Problem 2b: Characters missing in Unicode
=========================================

There are several characters in the ISO entity sets which have no
plausible mapping to Unicode. One I particularly miss is
jnodot in ISO AMSO.
in MathML I map this to "j" which seems to lose something in the
translation... 
DocBook maps it to Unicode character "FFFD", the "replacement character"
which is just a technical way of saying it isn't supported at all.

There are other characters (character names) in this situation.
these cause difficulty in specifying any reliable conversion of "legacy"
SGML data into XML.  I believe that there are two possible solutions to
this problem:

  a) (much preferred) Unicode accepts supporting ISO entity sets as a
     requirement for the Unicode character sets and adds the few extra
     characters needed to Unicode x.y for some x.y > 3.2.
 
or failing that

  b) ISO deprecates (or removes) these problem characters from an
     updated version of 9573-13, together with some specific
     recommendations on what to do when faced with legacy documents
     using these characters. (If its your own document, finding another
     character isn't hard, but if it is not your document, or you are a
     machine, some specific instructions would be good.)



Problem 2c: Inconsistent mappings
=================================

As historically the ISO entity sets have not had a canonical mapping,
authors of vocabulary have had to produce their own, and human nature
being what it is, inconsistencies have arisen.


If the ISO WG agrees to promote a "recommended" mapping of the sets to
Unicode this problem would (as far as possible) be gone. Conversations
in the bar with maintainers of several popular sets (MathML, DocBook,
HTML, etc) indicate that while no one wants to introduce
incompatibilities by changing definitions too often, changing to a grand
unified scheme would be worth the pain. If the Working group decided that
it can not promote such a mapping then it could at least deprecate the
names that currently cause problems. These include


circ

  Most vocabularies (and common expectation) maps this to
  005e (CIRCUMFLEX ACCENT)
  to match similar mappings eg acute to 00b4 (ACUTE ACCENT)
  HTML and XHTML have the somewhat eccentric mapping to
  02c6 (MODIFIER LETTER CIRCUMFLEX ACCENT
  MathML2 (but not MathML 1), somewhat reluctantly decided to follow HTML here.


asymp

  The HTML mapping is to a "double tilde"
   2248 (ALMOST EQUAL TO)

  The ISO mapping is normally to a "cupcap" symbol
  224d (EQUIVALENT TO)

  Interestingly neither of these uses
  2243 ASYMPTOTICALLY EQUAL TO


  MathML2 decided to deprecate asymp but to make it compatible with
  HTML (2248), it introduced a new asympeq entity mapping to 224d (the
  old definition of asymp). Perhaps ISOTECH should to likewise?


phi
 In MathML
  the textual greek set (ISOGRK1) uses
  3c6 (GREEK SMALL LETTER PHI) (entity phgr)

   the mathematical greek set (ISOGRK3)  uses
  3d5 (GREEK PHI SYMBOL) for the (entity "phi" in ISO 9573-13, and "phis" in
   ISO 8879)

 In HTML4, which predates Unicode 3 where the description of the two phi
  characters was changed uses defines phi to be
  GREEK SMALL LETTER PHI

  Again it would be best if a combined canonical mappings for phi and phis
  could be given but failing that, deprecate phi and think of new names
  for the two phi variants.


rang (lang and related names)

  Currently Docbook, HTML, MathML and presumably most other vocabularies
  consistently map this to

  232a (RIGHT-POINTING ANGLE BRACKET)

 However there are problems with this character as it has a canonical
 normalisation to a bracket in the CJK block and Unicode 3.2 introduced
 a new
    027E9 MATHEMATICAL RIGHT ANGLE BRACKET
 There is some pressure for MathML (at least) to use this but that would
 introduce incompatibilities with everyone else. It would be easier if
 everyone (or at least HTML, TEI, Docbook, MathML, for example) could
 move together to a newly specified canonical ISO mapping (Or have I
 said that already:-)



Problem 3: Invisibility of ISO 9573-13 definitions
==================================================
Searching in google (or anywhere else) for the ISO 9573-13 entity sets
will give lots of hits but almost all of them are to files at NAG
or the W3C MathML pages, or other archived copies of MathML information.
that is, they come ultimately from me. This is a far from ideal
situation. I based my information on a file "ptext13.zip" which I
obtained from Robin Cover's SGML pages (or a page linked from there) 
but can no longer find that source and can find no other authoratitive
source on any pages linked to the ISO activity. It seems strange (to say
the least) that a set of definitions of public character sets should not
be available to the public.

--end N0387--