| Title: | Community contribution: problems with ISO entity sets (ISO 8879 and 9573-13) |
| Source: | David Carlisle, UK |
| Project: | SGML ISO-8879 |
| Public entity sets for mathematics and science ISO 9573-13 | |
| Project editor: | |
| Status: | Text of solicited community contribution |
| Action: | For consideration of future SC34 activities |
| Date: | 2003-04-02 16:10UTC |
| Summary: | Real-world use of ISO entity sets has revealed issues that need to be addressed for the user community. |
| Distribution: | SC34 and Liaisons |
| Refer to: | |
| Supercedes: | |
| Reply to: | Dr. James David Mason (ISO/IEC JTC1/SC34 Chairman) Y-12 National Security Complex Information Technology Services Bldg. 9113 M.S. 8208 Oak Ridge, TN 37831-8208 U.S.A. Telephone: +1 865 574-6973 Facsimile: +1 865 574-1896 E-mailk: mailto:mxm@y12.doe.gov http://www.y12.doe.gov/sgml/sc34/sc34oldhome.htm Ms. Sara Hafele, ISO/IEC JTC 1/SC 34 Secretariat American National Standards Institute 25 West 43rd Street New York, NY 10036 Tel: +1 212 642-4937 Fax: +1 212 840-2298 E-mail: shafele@ansi.org |
This document is a personal note on problems encountered whilst working
with ISO entity sets, and in particular in attempting to produce
mappings of the ISO entities to Unicode to enable XML DTD declarations
to be made. Much of this work has been undertaken as the Editor of the
W3C MathML DTD, but this is a personal note, produced in response to an
email request quoted at the end.
http://www.w3.org/Math/characters/
contains further information including mappings to Unicode for all the
ISO 8859 and 9573-13 entity sets, whether or not they are used in
MathML.
Docbook has a similar page describing its mappings at
http://www.oasis-open.org/docbook/specs/wd-docbook-xmlcharent-0.3.html
Problem 1: Fixed set
====================
The first problem that could be mentioned is that as a fixed (and small)
set of characters that is "blessed" with a name, the set of characters
chosen, and their names, will always be an essentially arbitrary
choice. I don't offer any insight here but wanted to mention this
"problem" first as a desire to change the set of named characters is one
possible source of pressure to change the entity sets. (I understand
the 9573-13 isonum set was recently extended to include euro for example.)
Problem 2: Lack of canonical mapping to Unicode
===============================================
The lack of a canonical mapping to UNICODE is currently a severe, almost
critically fatal, problem when trying to use the entity sets in an XML
context.
Traditionally the ISO sets were defined as SGML SDATA entities.
In this form a canonical mapping to Unicode was not required, in fact
Unicode might not be used at all, an SGML system could hold these
characters as "special character data" until the last minute when some
suitable rendering could be used.
SDATA is not available in XML and this completely changes the situation.
Problem 2 splits into subproblems:
Problem 2a: Unicode needed in XML
=================================
As SDATA is not available to define the character entities one only has
the choice of defining the entities to be (Unicode) character data Or
some combination of elements and attributes. If the latter choice is
taken then the entity definitions are only usable in document types that
include the elements used. This severely restricts the use of the entity
sets definitions as a public standard (and also prevents use of the
characters in attribute values). Thus, practically speaking, definitions
that map the entity names to one or more Unicode characters are the only
real possibility as a standard set of definitions usable across a range
of XML vocabularies.
Note that after an SGML SDATA definition all applications that process
the document know the original character was entered as a named entity
and can, if necessary, provide special handling of that named character.
In XML however, after the first XML parse subsequent applications will
typically not be informed that a named entity is used. The use of the
entity will appear identical to the application as direct use of the
character (or use of numeric character references). Also as entity
definitions have document scope it is important that all XML
vocabularies that could conceivably be used together share the same
definitions. If MathML is used inside DocBook (or XHTML) then it is not
possible for α to mean one thing specified by DocBook (or XHTML)
in the textual parts and another specified by MathML, in the
mathematics. If it happens (and it does) that the DocBook and MathML DTD
disagree on some definitions, then the meaning of an entity in a
combined MathML+Docbook DTD will be either the MathML one or the Docbook
one depending on the technicalities on the way in which the combined DTD
was built from the two constituent parts. This has proved to be a major
source of complication. There are certain arbitrary choices that have to
be made when assigning Unicode definitions for the names. The only way
to have a consistent set is for someone with authority to just state
a "good" mapping. This would not force everyone to use that if they knew
what they were doing and wanted to do their own thing, but it would be
a massive help to anyone trying to specify a vocabulary that uses the
ISO entity sets and wanted to be compatible with another language.
ISO as keeper of the entity set names and half the keeper of
Unicode/ISO10646 seems to be the natural authority to promote such a
mapping table.
Problem 2b: Characters missing in Unicode
=========================================
There are several characters in the ISO entity sets which have no
plausible mapping to Unicode. One I particularly miss is
jnodot in ISO AMSO.
in MathML I map this to "j" which seems to lose something in the
translation...
DocBook maps it to Unicode character "FFFD", the "replacement character"
which is just a technical way of saying it isn't supported at all.
There are other characters (character names) in this situation.
these cause difficulty in specifying any reliable conversion of "legacy"
SGML data into XML. I believe that there are two possible solutions to
this problem:
a) (much preferred) Unicode accepts supporting ISO entity sets as a
requirement for the Unicode character sets and adds the few extra
characters needed to Unicode x.y for some x.y > 3.2.
or failing that
b) ISO deprecates (or removes) these problem characters from an
updated version of 9573-13, together with some specific
recommendations on what to do when faced with legacy documents
using these characters. (If its your own document, finding another
character isn't hard, but if it is not your document, or you are a
machine, some specific instructions would be good.)
Problem 2c: Inconsistent mappings
=================================
As historically the ISO entity sets have not had a canonical mapping,
authors of vocabulary have had to produce their own, and human nature
being what it is, inconsistencies have arisen.
If the ISO WG agrees to promote a "recommended" mapping of the sets to
Unicode this problem would (as far as possible) be gone. Conversations
in the bar with maintainers of several popular sets (MathML, DocBook,
HTML, etc) indicate that while no one wants to introduce
incompatibilities by changing definitions too often, changing to a grand
unified scheme would be worth the pain. If the Working group decided that
it can not promote such a mapping then it could at least deprecate the
names that currently cause problems. These include
circ
Most vocabularies (and common expectation) maps this to
005e (CIRCUMFLEX ACCENT)
to match similar mappings eg acute to 00b4 (ACUTE ACCENT)
HTML and XHTML have the somewhat eccentric mapping to
02c6 (MODIFIER LETTER CIRCUMFLEX ACCENT
MathML2 (but not MathML 1), somewhat reluctantly decided to follow HTML here.
asymp
The HTML mapping is to a "double tilde"
2248 (ALMOST EQUAL TO)
The ISO mapping is normally to a "cupcap" symbol
224d (EQUIVALENT TO)
Interestingly neither of these uses
2243 ASYMPTOTICALLY EQUAL TO
MathML2 decided to deprecate asymp but to make it compatible with
HTML (2248), it introduced a new asympeq entity mapping to 224d (the
old definition of asymp). Perhaps ISOTECH should to likewise?
phi
In MathML
the textual greek set (ISOGRK1) uses
3c6 (GREEK SMALL LETTER PHI) (entity phgr)
the mathematical greek set (ISOGRK3) uses
3d5 (GREEK PHI SYMBOL) for the (entity "phi" in ISO 9573-13, and "phis" in
ISO 8879)
In HTML4, which predates Unicode 3 where the description of the two phi
characters was changed uses defines phi to be
GREEK SMALL LETTER PHI
Again it would be best if a combined canonical mappings for phi and phis
could be given but failing that, deprecate phi and think of new names
for the two phi variants.
rang (lang and related names)
Currently Docbook, HTML, MathML and presumably most other vocabularies
consistently map this to
232a (RIGHT-POINTING ANGLE BRACKET)
However there are problems with this character as it has a canonical
normalisation to a bracket in the CJK block and Unicode 3.2 introduced
a new
027E9 MATHEMATICAL RIGHT ANGLE BRACKET
There is some pressure for MathML (at least) to use this but that would
introduce incompatibilities with everyone else. It would be easier if
everyone (or at least HTML, TEI, Docbook, MathML, for example) could
move together to a newly specified canonical ISO mapping (Or have I
said that already:-)
Problem 3: Invisibility of ISO 9573-13 definitions
==================================================
Searching in google (or anywhere else) for the ISO 9573-13 entity sets
will give lots of hits but almost all of them are to files at NAG
or the W3C MathML pages, or other archived copies of MathML information.
that is, they come ultimately from me. This is a far from ideal
situation. I based my information on a file "ptext13.zip" which I
obtained from Robin Cover's SGML pages (or a page linked from there)
but can no longer find that source and can find no other authoratitive
source on any pages linked to the ISO activity. It seems strange (to say
the least) that a set of definitions of public character sets should not
be available to the public.
--end N0387--