SC22/WG20 N958
Problem statement for expressing DIN 5007
in 14651
Status:
Expert contribution
Author:
Marc Wilhelm Küster
Date: 2002-06-11
Action:
For discussion
Umlaut and trema
DIN 5007, the long-established German
standard for ordering, as well as current
practice in German libraries distinguishes between two diacritics
of very
similar - in fact today mostly identical - appearance:
-
the umlaut
-
the trema
Both diacritics are encoded in the UCS as U0308, the COMBINING
DIAERESIS.
However, in a number of traditional library coding schemes and software
with
a
long history such as the *Tübingen System für TextVerarbeitungs Programme
(TUSTEP)* these two are, distinguished in their encoding.
Both diacritics have a very different roots and traditional German
typography visually set both diacritics apart through the relative
distance
of the two dots and sometimes through their diameter. That distinction
has,
however, largely disappeared with the advent of PostScript and is now
almost obsolete.
Traditional ordering
This analysis may sound like a plea for
the encoding of two separate
diacritics. This is not the case. A
disunification for umlaut and trema
would, for a variety of reasons, be
undesirable.
However, the current unification poses a
problem is with German ordering, as
DIN 5007 treats letters with an umlaut
different from letters with trema. In
the ordering of entities which are not
names letters with umlaut come
directly after the respective base letter
whereas letters with trema follow
after many of the remaining diacritics.
Hence, you have a sequence of the type of
a ä á if ä is an a with umlaut,
but you get a á ä if the ä is an a with
trema. This distinction is mandatory
in DIN 5007.
Both versions of ä would, from the point
of view of the UCS, be encoded
identically, namely as U00E4 or U0061 +
U0308. Tailoring in 14651 can only
be on the level of individual characters
and character / diacritic
combinations. For this reason, there is no
way to express DIN 5007 as a
profile of 14651. This is unfortunate and
causes problems in German
libraries, especially in large research
libraries.
Analogous problems exist for the ordering
of names.
Desired
guidance
The author would like guidance from WG20 on
how to handle this problem within
the 14651 framework. Such guidance could take
the form of either:
* a note in 14651 stating the best practice
or
* some other WG20 best practice document
that can be readily referenced or
* the resolution that WG20 has no views on
this matter and leaves it entirely
up to the national ordering standards to
take provisions for this and similar
cases.
A technical recommendation for this a
technical solution could work along the
following lines:
In order to maintain the difference between a
letter with umlaut and the same
letter with trema
* Mark up the distinction between the two
diacritics through a higher level
protocol if this distinction is deemed
necessary in a particular context
* Decompose the string at least with regards
to the ambiguous letters
* Map the markup + combining trema
combination to a character in the private
use area and treat that character as a
combining diacritic for ordering
purposes
* Tailor the template on that assumption.
The author is open for any other suggestion.
Ideally, such suggestions should be generic
enough to be applicable in comparable
cases such as may arise with regards to other
cultural practices.