SC22/WG20 N958

 

Problem statement for expressing DIN 5007 in 14651

 

   Status: Expert contribution

   Author: Marc Wilhelm Küster

   Date:  2002-06-11

   Action: For discussion

  

 Umlaut and trema

 

       DIN 5007, the long-established German standard for ordering, as well as current

       practice in German libraries distinguishes between two diacritics of very

       similar - in fact today mostly identical - appearance:

       

        - the umlaut

       

        - the trema

       

        Both diacritics are encoded in the UCS as U0308, the COMBINING DIAERESIS.

        However, in a number of traditional library coding schemes and software with

        a long history such as the *Tübingen System für TextVerarbeitungs Programme

        (TUSTEP)* these two are, distinguished in their encoding.

       

        Both diacritics have a very different roots and traditional German

        typography visually set both diacritics apart through the relative distance

        of the two dots and sometimes through their diameter. That distinction has,

        however, largely disappeared with the advent of PostScript and is now

        almost obsolete.

       

  Traditional ordering

   

    This analysis may sound like a plea for the encoding of two separate

    diacritics. This is not the case. A disunification for umlaut and trema

    would, for a variety of reasons, be undesirable.

   

    However, the current unification poses a problem is with German ordering, as

    DIN 5007 treats letters with an umlaut different from letters with trema. In

    the ordering of entities which are not names letters with umlaut come

    directly after the respective base letter whereas letters with trema follow

    after many of the remaining diacritics.

   

    Hence, you have a sequence of the type of a ä á if ä is an a with umlaut,

    but you get a á ä if the ä is an a with trema. This distinction is mandatory

    in DIN 5007.

   

    Both versions of ä would, from the point of view of the UCS, be encoded

    identically, namely as U00E4 or U0061 + U0308. Tailoring in 14651 can only

    be on the level of individual characters and character / diacritic

    combinations. For this reason, there is no way to express DIN 5007 as a

    profile of 14651. This is unfortunate and causes problems in German

    libraries, especially in large research libraries.

   

    Analogous problems exist for the ordering of names.

   

Desired guidance

 

 The author would like guidance from WG20 on how to handle this problem within

 the 14651 framework. Such guidance could take the form of either:

 

  * a note in 14651 stating the best practice or

 

  * some other WG20 best practice document that can be readily referenced or

 

  * the resolution that WG20 has no views on this matter and leaves it entirely

  up to the national ordering standards to take provisions for this and similar

  cases.

 

 A technical recommendation for this a technical solution could work along the

 following lines:

 

 In order to maintain the difference between a letter with umlaut and the same

 letter with trema

 

  * Mark up the distinction between the two diacritics through a higher level

  protocol if this distinction is deemed necessary in a particular context

 

  * Decompose the string at least with regards to the ambiguous letters

 

  * Map the markup + combining trema combination to a character in the private

  use area and treat that character as a combining diacritic for ordering

  purposes

 

  * Tailor the template on that assumption.

 

 The author is open for any other suggestion.

 

 Ideally, such suggestions should be generic enough to be applicable in comparable

 cases such as may arise with regards to other cultural practices.