Title: | Community contribution: problems with ISO entity sets (ISO 8879 and 9573-13) |
Source: | David Carlisle, UK |
Project: | SGML ISO-8879 |
Public entity sets for mathematics and science ISO 9573-13 | |
Project editor: | |
Status: | Text of solicited community contribution |
Action: | For consideration of future SC34 activities |
Date: | 2003-04-02 16:10UTC |
Summary: | Real-world use of ISO entity sets has revealed issues that need to be addressed for the user community. |
Distribution: | SC34 and Liaisons |
Refer to: | |
Supercedes: | |
Reply to: | Dr. James David Mason (ISO/IEC JTC1/SC34 Chairman) Y-12 National Security Complex Information Technology Services Bldg. 9113 M.S. 8208 Oak Ridge, TN 37831-8208 U.S.A. Telephone: +1 865 574-6973 Facsimile: +1 865 574-1896 E-mailk: mailto:mxm@y12.doe.gov http://www.y12.doe.gov/sgml/sc34/sc34oldhome.htm Ms. Sara Hafele, ISO/IEC JTC 1/SC 34 Secretariat American National Standards Institute 25 West 43rd Street New York, NY 10036 Tel: +1 212 642-4937 Fax: +1 212 840-2298 E-mail: shafele@ansi.org |
This document is a personal note on problems encountered whilst working with ISO entity sets, and in particular in attempting to produce mappings of the ISO entities to Unicode to enable XML DTD declarations to be made. Much of this work has been undertaken as the Editor of the W3C MathML DTD, but this is a personal note, produced in response to an email request quoted at the end. http://www.w3.org/Math/characters/ contains further information including mappings to Unicode for all the ISO 8859 and 9573-13 entity sets, whether or not they are used in MathML. Docbook has a similar page describing its mappings at http://www.oasis-open.org/docbook/specs/wd-docbook-xmlcharent-0.3.html Problem 1: Fixed set ==================== The first problem that could be mentioned is that as a fixed (and small) set of characters that is "blessed" with a name, the set of characters chosen, and their names, will always be an essentially arbitrary choice. I don't offer any insight here but wanted to mention this "problem" first as a desire to change the set of named characters is one possible source of pressure to change the entity sets. (I understand the 9573-13 isonum set was recently extended to include euro for example.) Problem 2: Lack of canonical mapping to Unicode =============================================== The lack of a canonical mapping to UNICODE is currently a severe, almost critically fatal, problem when trying to use the entity sets in an XML context. Traditionally the ISO sets were defined as SGML SDATA entities. In this form a canonical mapping to Unicode was not required, in fact Unicode might not be used at all, an SGML system could hold these characters as "special character data" until the last minute when some suitable rendering could be used. SDATA is not available in XML and this completely changes the situation. Problem 2 splits into subproblems: Problem 2a: Unicode needed in XML ================================= As SDATA is not available to define the character entities one only has the choice of defining the entities to be (Unicode) character data Or some combination of elements and attributes. If the latter choice is taken then the entity definitions are only usable in document types that include the elements used. This severely restricts the use of the entity sets definitions as a public standard (and also prevents use of the characters in attribute values). Thus, practically speaking, definitions that map the entity names to one or more Unicode characters are the only real possibility as a standard set of definitions usable across a range of XML vocabularies. Note that after an SGML SDATA definition all applications that process the document know the original character was entered as a named entity and can, if necessary, provide special handling of that named character. In XML however, after the first XML parse subsequent applications will typically not be informed that a named entity is used. The use of the entity will appear identical to the application as direct use of the character (or use of numeric character references). Also as entity definitions have document scope it is important that all XML vocabularies that could conceivably be used together share the same definitions. If MathML is used inside DocBook (or XHTML) then it is not possible for α to mean one thing specified by DocBook (or XHTML) in the textual parts and another specified by MathML, in the mathematics. If it happens (and it does) that the DocBook and MathML DTD disagree on some definitions, then the meaning of an entity in a combined MathML+Docbook DTD will be either the MathML one or the Docbook one depending on the technicalities on the way in which the combined DTD was built from the two constituent parts. This has proved to be a major source of complication. There are certain arbitrary choices that have to be made when assigning Unicode definitions for the names. The only way to have a consistent set is for someone with authority to just state a "good" mapping. This would not force everyone to use that if they knew what they were doing and wanted to do their own thing, but it would be a massive help to anyone trying to specify a vocabulary that uses the ISO entity sets and wanted to be compatible with another language. ISO as keeper of the entity set names and half the keeper of Unicode/ISO10646 seems to be the natural authority to promote such a mapping table. Problem 2b: Characters missing in Unicode ========================================= There are several characters in the ISO entity sets which have no plausible mapping to Unicode. One I particularly miss is jnodot in ISO AMSO. in MathML I map this to "j" which seems to lose something in the translation... DocBook maps it to Unicode character "FFFD", the "replacement character" which is just a technical way of saying it isn't supported at all. There are other characters (character names) in this situation. these cause difficulty in specifying any reliable conversion of "legacy" SGML data into XML. I believe that there are two possible solutions to this problem: a) (much preferred) Unicode accepts supporting ISO entity sets as a requirement for the Unicode character sets and adds the few extra characters needed to Unicode x.y for some x.y > 3.2. or failing that b) ISO deprecates (or removes) these problem characters from an updated version of 9573-13, together with some specific recommendations on what to do when faced with legacy documents using these characters. (If its your own document, finding another character isn't hard, but if it is not your document, or you are a machine, some specific instructions would be good.) Problem 2c: Inconsistent mappings ================================= As historically the ISO entity sets have not had a canonical mapping, authors of vocabulary have had to produce their own, and human nature being what it is, inconsistencies have arisen. If the ISO WG agrees to promote a "recommended" mapping of the sets to Unicode this problem would (as far as possible) be gone. Conversations in the bar with maintainers of several popular sets (MathML, DocBook, HTML, etc) indicate that while no one wants to introduce incompatibilities by changing definitions too often, changing to a grand unified scheme would be worth the pain. If the Working group decided that it can not promote such a mapping then it could at least deprecate the names that currently cause problems. These include circ Most vocabularies (and common expectation) maps this to 005e (CIRCUMFLEX ACCENT) to match similar mappings eg acute to 00b4 (ACUTE ACCENT) HTML and XHTML have the somewhat eccentric mapping to 02c6 (MODIFIER LETTER CIRCUMFLEX ACCENT MathML2 (but not MathML 1), somewhat reluctantly decided to follow HTML here. asymp The HTML mapping is to a "double tilde" 2248 (ALMOST EQUAL TO) The ISO mapping is normally to a "cupcap" symbol 224d (EQUIVALENT TO) Interestingly neither of these uses 2243 ASYMPTOTICALLY EQUAL TO MathML2 decided to deprecate asymp but to make it compatible with HTML (2248), it introduced a new asympeq entity mapping to 224d (the old definition of asymp). Perhaps ISOTECH should to likewise? phi In MathML the textual greek set (ISOGRK1) uses 3c6 (GREEK SMALL LETTER PHI) (entity phgr) the mathematical greek set (ISOGRK3) uses 3d5 (GREEK PHI SYMBOL) for the (entity "phi" in ISO 9573-13, and "phis" in ISO 8879) In HTML4, which predates Unicode 3 where the description of the two phi characters was changed uses defines phi to be GREEK SMALL LETTER PHI Again it would be best if a combined canonical mappings for phi and phis could be given but failing that, deprecate phi and think of new names for the two phi variants. rang (lang and related names) Currently Docbook, HTML, MathML and presumably most other vocabularies consistently map this to 232a (RIGHT-POINTING ANGLE BRACKET) However there are problems with this character as it has a canonical normalisation to a bracket in the CJK block and Unicode 3.2 introduced a new 027E9 MATHEMATICAL RIGHT ANGLE BRACKET There is some pressure for MathML (at least) to use this but that would introduce incompatibilities with everyone else. It would be easier if everyone (or at least HTML, TEI, Docbook, MathML, for example) could move together to a newly specified canonical ISO mapping (Or have I said that already:-) Problem 3: Invisibility of ISO 9573-13 definitions ================================================== Searching in google (or anywhere else) for the ISO 9573-13 entity sets will give lots of hits but almost all of them are to files at NAG or the W3C MathML pages, or other archived copies of MathML information. that is, they come ultimately from me. This is a far from ideal situation. I based my information on a file "ptext13.zip" which I obtained from Robin Cover's SGML pages (or a page linked from there) but can no longer find that source and can find no other authoratitive source on any pages linked to the ISO activity. It seems strange (to say the least) that a set of definitions of public character sets should not be available to the public.
--end N0387--