CEN Guide to the Use of Character Sets in EuropeTC 304

Source: TC304 PT on the Guide
Date: 26 Oct 1998
Status: This draft was prepared to be presented in a Plenary meeting of CEN/TC304 in Brussel Nov 98 and to be discussed in an open meeting 25 November 1998. It was also distributed on paper as document N852 to all CEN/TC304 members.

Guide to the use of character sets in Europe


Table of Contents

1 Introduction

2 Characters and their coding

3 The character handling model

4 International character sets

5 European character sets

6 Procurement issues

7 Procurement clauses

8 Other reference material

Annex A 8-bit character sets

Annex B The Universal Character Set (UCS)


1 Introduction

There exist today a large number of standards and related specifications concerning character repertoires and their coding in the form of public as well as industry standards and intended for a wide range of applications and uses. This can be very confusing to the non-expert user and to people involved in the procurement.

The user of IT systems normally does not have to concern himself with these types of standards. However, there may be situations where he has to be able to express his needs for certain character repertoires necessary for his work; it may also happen that he, when involved in work together with other parties using other systems, needs to be able to interpret other people's specifications given in the form of reference to standards.

The procurer of IT systems should be able to specify his requirements in the form of reference to established standards.

The purpose of this report is to give guidance to users and procurers by explaining the purposes and relationships of the public standards in this domain. The proprietary standards which also exist will be mentioned in the appropriate context in the form of references, but not described in detail. The guide concentrates on European issues; thus character set standards for non-European languages will not be treated.

The text is presented on two levels. The first level, contained in the body of the report, provides a general coverage of character repertoires, coding and uses. The second level, contained in the two annexes, provides much more detailed, tutorial information.

2 Characters and their coding

2.1 Characters for writing and control characters

For the presentation of written text we use letters, digits and punctuation marks. Often we also use special symbols such as currency signs. All of these are called characters, and the collection of characters for a specific purpose, such as the presentation of text in a specific language, is called in the standardisation context a character repertoire. The most common type of repertoire is of course the alphabet of a language, complemented by the ten digits and a set of special characters.

In Europe, there are some 160 languages, many of which share a large number of characters (something which is fortunate for the specification of character repertoires). In CEN/TC 304 there is a separate activity on providing a catalogue of the alphabets of indigenous languages, information on which can be found at http://www.stri.is/tc304/......

For IT processing purposes, it is necessary to indicate within a data stream where some action is required, e.g. a carriage return or new line. Such actions are called control functions and do not normally have a graphical representation. Some 14 types of control function have been defined. Some control functions, such as the carriage return, are represented by a single control character that is distinct from any graphic character, whilst others are represented a sequence of characters with a special introducing control character at the beginning of the sequence. The subsequent characters in such a sequence may be either graphic or control characters.

2.2 Coding

In IT systems a character is represented by a numeric code. A character repertoire and its corresponding set of codes is called a coded character set or just character set.

A letter with a diacritical mark may sometimes be represented by two character codes, the base letter and the free standing diacritical mark itself.

Coded character sets are used for different purposes in computer systems and the code structure may be different in these cases. For instance, a coded character set used for interchange purposes often needs codes to be reserved for control characters so that they may be included in the interchange data stream. However, a coded character set used for processing purposes may not need such reserved areas which instead are often used to represent more graphic characters. Examples of the latter are PC code pages.

Early IT systems had severe size limitations. Therefore, the character codes had to be kept small. The earliest codes occupied 5 bits; later 7 and 8 bits have been used. These have provided the coding capacity for 32, 128 and 256 characters respectively. Even with an 8-bit representation it was not possible to support all European languages in a single coded character set. Thus coded characters sets proliferated. As long as an application (and the character set it used) was restricted in use to a single country or geographical region, there was not a problem, since a working character set could be chosen to support the limited number of languages for that region. However, due to the requirements of international trade and the increase in travel, this limitation of code size has caused severe problems of application interoperability.

In order to avoid a very large number of private character set specifications many with overlapping scope and leading to major interoperability problems, standardisation activity has been needed with the aim of reducing the number of character set specifications. It should be noted that industry specifications have also been developed most notably by IBM, Apple and Microsoft.

Another solution to the problem caused by the limitation of character set size is to increase the capacity of 7 and 8 bit codes by the introduction of code extension facilities. These have also been subject to standardisation.

However, modern IT systems no longer have the earlier restrictions in size, and a solution is now available which uses a code space size sufficient to accommodate every language in the world. However, since the old solutions seem likely to continue to exist until perhaps 2025, the old problem may remain acute for some time.

3 The character handling model

Figure 1 below illustrates the character handling model. It represents a simplified IT system which consists of two computer systems connected by a communications link. One computer system is shown as having an input capability whilst the other is shown as having an output capability. The purpose of this model is to show the different aspects of the handling of characters by users and computer systems and thus introduce basic concepts that will be used in the following sections of the report.

Figure 1 - Character Handling Model

3.1 The input function

The input function is the means by which character based information is entered into a computer system. Figure 1 uses a keyboard for input, but any device capable of entering character data may be used. For the user, the main issue is whether or not he can choose, as an option within the computer system, a character repertoire for input which is sufficient for his needs. For the procurer, the main issue is how to specify such a requirement.

Note that the representation of the input text on the monitor screen is a result of both the processing function, e.g. a word processor, and an output function (here: Coding OP1). Both must of course be able to fulfil the requirements of the user, or he will not be able to see on the screen the intended output, even if the keyboard input functions are satisfactory. More on that in the later clauses.

The main keyboard standard in this arena is ISO/IEC 9995-3. As far as this guide is concerned, keyboard standards are related to character set standards but are not central to the theme of this guide. CEN TC304 has a separate activity on European keyboard standardisation. Information on this activity may be found at http://www.stri.is/tc304/......

3.2 The processing function

The processing function provides for the manipulation of character based data according to the needs of an application.

Once input, the data is expressed in some internal computer system code (here: coding P1). In addition, other information may be associated with each character such as colour, emphasis level (e.g. bold, italic) and font (e.g. Times Roman). Such information is usually intended for some document processing function. Thus the system internal code structure may be quite complex. However, at its heart is the character code itself. Document handling and processing is outside the scope of this guide.

Most commercially available computer systems do not use standardised character sets for internal representation of character data, rather they use proprietary character sets or industry specifications. For the user, the main issue is whether or not he can choose, as an option within the computer system, a character set for processing which has a repertoire sufficient for his needs. For the procurer, the main issue is how to specify such a requirement.

The input repertoire may not contain all the characters of the processing repertoire. However, all characters in the input repertoire will be contained in the processing repertoire.

A particularly common requirement on the processing function is that it be able to order character based data. Standards for ordering, while related to character set standards, are not central to the theme of this guide. In CEN/TC 304, there is a separate activity on European standardisation of ordering, information on which may be found at http://www.stri.is/tc304/......

3.3 The Interchange function

The interchange function allows character based data to be exchanged between computer systems. Since the character sets for processing are generally defined by industry specifications the processing character sets of two different computer systems between which data are to be exchanged are likely to be different. A character set for interchange is needed (here: Coding IC). This is where character set standards become very relevant to reduce the number of interchange character sets - potentially one for every possible pair combination of different computer systems. The user needs to be able to identify an interchange character set and the procurer needs to be able to specify requirements in this area, taking into account all foreseeable needs.

The main problem here is that there may not be a one to one correspondence between the characters in the character set for processing and the characters in the character set for interchange. In such cases less than trivial conversion functions are required at the boundaries between processing and interchange.

Two very different conversion functions may be identified here. First, there is the situation where a character in the processing repertoire is not contained in the interchange repertoire. A character conversion is required. The requirements on such a conversion function may vary. For instance, the requirement may be that it is reversible (i.e. it must be possible to reconstruct the original character from its converted form). An example of this would be where the character é is replaced by the HTML representation &eacute. In another case, a non-reversible transformation may be all that is required. Here, characters that are defined in both character sets (i.e. Coding P1 and IC at one end and/or Coding IC and P2 at the other end) are passed through whilst other characters may be substituted by some common substitute character, or some approximation may be made (e.g. by replacing é by e and so on). Another requirement may be that, should a non-reversible substitution be made, it must be shown that this has taken place(e.g. by a marker where this has occurred).

The second type of conversion is where a character is contained in both the processing and the interchange repertoire but it is coded differently. Here a code conversion is needed. For instance, the processing repertoire may be coded according to a PC code page whilst the interchange repertoire may be coded according to ISO/IEC 10646-1.

Often, both character and code conversions are required.

Another characteristic of conversion is whether or not the effects of the conversion are externally visible. A character conversion where, for instance é is replaced by &eacute in the sending system and &eacute is replaced by é in the receiving system, is not externally visible. On the other hand, a character conversion of ö to oe in a sending system which is not reversed (since this conversion is not reversible) in the receiving system will become visible in the receiving system once oe is output to a screen.

Conversion is related to character set standards but is not central to the theme of this guide. In CEN TC304 there is separate activity on conversion which is developing a model and is examining various conversion techniques. Information may be found at http://www.stri.is/tc304/......

Another type of conversion is caused by the situation when the output characters are represented in a different script from the input characters. Transliteration is then required. It should be noted that transliteration is dependent not only on the scripts concerned but also on the languages. Thus the rules for transliteration from Russian is different if the target language is German than if it is French.

In Europe there are five recognised scripts for the indigenous languages: Latin, Greek, Cyrillic, Armenian and Georgian. In CEN/TC 304 there is not at this time any activity devoted to transliteration. Standards in this area are developed by ISO/TC 46 (Information and Documentation).

Transliteration may take place at the boundary between the processing function and the interchange function or at the boundary between the processing function and the output function.

3.4 The output function

The output function is the reverse, or complement, of the input function. It is the process of converting the internal coded representations of the characters (here: Coding P2) to a visual representation on a display or hard copy device. The output character sets may be different depending of the output medium (here: Coding OP2 or OP3).

The handling of output to physical devices is usually an internal computer system function. Application programs, such as word processing packages, usually have the ability to control also the rendition of the output. This includes the use of fonts, both type and size, and also the use of various levels of emphasis and colour. In some cases, information which specifies particular values of these attributes is carried with the individual character codes right from the time of input. As already stated, these features are outside the scope of this guide.

The main problem is when the output character set is smaller than the character set for processing. The computer system software has to substitute one or more characters for those which cannot be represented by the output function.

If this causes loss of information (i.e. the conversion is not reversible), it is known as fall-back. The computer system may provide means by which this information can be viewed (for instance with a "reveal codes" function), but this is only of use if there is some indication on the output device that such a substitution has actually taken place (e.g. by the use of a specific substitute character, a special level of emphasis or colour).

Another method is a one to many character mapping from which the identity of the original character can be deduced. As an example, élève may be converted to e/le\ve. However, such a technique can cause problems with the tabular formatting of information.

Conversion is not central to the theme of this guide. In CEN TC304 there is a separate activity on conversion which is developing a model and is examining various conversion techniques. Information may be found at http://www.stri.is/tc304/......

3.5 Cultural issues

Each country or region in Europe (and elsewhere) has cultural conventions which affect the manner in which character sets are used by application programs. Such conventions include, but are not restricted to, ordering (already mentioned), numeric formatting, monetary formatting, date and time conventions, affirmative and negative answers, the use of special characters and personal name rules.

Cultural conventions primarily affect the processing function but there may be an impact on the other functions. They are related to the use of character set standards but are not central to the theme of this guide. In CEN TC304 there is a separate activity on cultural conventions which is developing and maintaining a registry of such conventions. Information may be found at http://www.stri.is/tc304/......

4 International character sets

This section describes standardised character sets (i.e. both repertoires and coding) used primarily for input (mainly for the specification of repertoires) and interchange. As already stated, for processing, most IT systems use proprietary standards.

Because of the need to combine different standards and also in order to make the standardisation itself more consistent and effective, a common platform standardising the principles for code structure, code extension, implementation and for registration has been established. The standards which define that platform are called infrastructure standards.

The standards in this area may be classified in the following way:

4.1 7 and 8 bit infrastructure standards

Note: a profile is a choice of options from one or more standards, constituting a more narrow form of specification than the base standard(s) it refers to.

The registration authority allocates values for the escape sequences to be used with the coded character set that is registered. The registration authority is the Information Processing Society of Japan/Information Technology Standards Commission of Japan (IPSJ/ITSCJ) and the register may be accessed on-line at http://www.itscj.ipsj.or.jp/ISO-IR/. All standardised coded character sets are registered but not all registered coded character sets are standardised. There are over 200 registrations. A coded character set may be identified in text, such as a procurement specification, by the sequence <ISO-IR nnn> where nnn is the registration number.

Most users and procurers should not concern themselves overmuch ISO/IEC 2022:1994 or ISO/IEC 4873:1991 since their provisions are usually called up by the referring standards. For this reason, they need not be referred to directly in procurement specifications. However, the sophisticated user with a complex requirement who may be mixing and matching the use of registered character sets rather than standardised character sets will probably need to become familiar with them. In this case, the procurer will also need to make reference ISO/IEC ISP 120701. This ISP is also relevant in procurement situations which include the proposed use of products using OSI protocols.

Most users and procurers should not need to refer to the registration standard ISO 2375. However, they may well need to know the identities of one or more registered character sets.

4.2 7 and 8 bit character set standards

The set of characters which does not include the optional characters is called the Invariant Set. There are default allocations of characters to these optional code positions and when these are used, the character set is called the International Reference Version (IRV). The IRV is registered as ISO-IR 6. The IRV was changed in the 1991 edition of the standard with the CURRENCY SIGN being replaced by the DOLLAR SIGN. The standard also defines a default set of control characters to be used where applicable. The IRV (1991) together with the default control character set is identical to ASCII (American Standard Code for Information Interchange), a name in common usage in the industry. This character set does not contain any letters with accents or diacritical marks. It is therefore unsuitable for use in representing many European languages.

Each part of ISO/IEC 8859:nnnn contains two code tables. The first table is identical to the graphic characters of the IRV of ISO/IEC 646:1991 coded in 8 bits. The second table is specific to that part. Together the two tables define a self contained 8-bit coded character set of up to 191 characters. There are currently 15 parts to ISO/IEC 8859 and these fall into two categories. In the first category, containing 9 parts, the second table contains extra Latin characters. The title of each of these parts is Latin Alphabet No.n where n lies in the range 1 to 9. In the second category, the second table contains characters from a non-Latin script. The title of each of these parts is Latin/XXX where XXX is the name of the second script. Each part of ISO/IEC 8859 supports a range of languages. These are listed in Annex A in the section on ISO/IEC 8859.

In products supporting ISO/IEC 8859, the code extension techniques of ISO/IEC 2022:1994 are not used. However, since the code tables are all registered, any one of them can form part of larger coded character sets when the code extension techniques are used, but it is incorrect to refer to parts of ISO/IEC 8859:nnnn for such implementations.

4.3 The universal character set (UCS standard)

It is obvious that the limitations of both 7 and 8 bit coding creates problems in a world where increasingly texts in very different languages need to be communicated between computer systems. Some years ago, the radical step was taken to create a standard with enough code capacity to allow the coding of all alphabets in the world within one framework.

The standard defines a character set and a code structure of up to 4 octets. The main design aim was to provide sufficient code capacity so that the alphabets of all the known languages in the world together with a large range of special characters could be accommodated. It has to be said that four octets is an overkill, but this was kept to cater for any possible future requirement. It was decided to concentrate at first on populating the lowest two octets of the code space and this space is known as the Basic Multilingual Plane (BMP).

Since publication, the standard has been subject to a number of amendments, most of which add further language support to the BMP. It should be noted that some characters may be combined to generate further characters. This has the effect of conserving code space something which is particularly relevant when there is base character and a range of derived characters with added marks (combining characters).

It is planned that the BMP will contain, with the exception of the Chinese and Japanese ideographs, the characters, including combining characters, needed to write all the known living languages of the world. It certainly already contains all the characters needed for the vast majority of European languages. For this reason, a two octet as well as a four octet representation is specified for use when the application environment only makes use of the BMP. Other coded representations are specified to cater for situations where more than the BMP is used but where better efficiency than use of the four octet form is needed. More detailed technical information may be found in Annex B.

4.4 Control functions

5 European character sets

In order to provide specifications particularly suited for European applications, CEN has standardised implementations of ISO/IEC standards including subsets of the UCS standard.

The four repertoires for the Latin script are:

This standard specifies how these repertoires are coded using the extension mechanisms of ISO/IEC 2022:1994 and how they can be combined. A set of combination options is given. It should be noted that not all combinations are permitted.

MES-2 is defined as a non-fixed subset (MES-2A). This permits automatic inclusion in the subset of any new characters added to the BMP whose code points fall within the collections which define this subset. It is also defined as a fixed subset (MES-2B) which makes it invariant over time.

6 Procurement issues

The procurement issues described here are related to specific character handling functions (see section 3).

6.1 Repertoires and code structures

The user does not need to be concerned with the coded representation of the input character set since a character, once input, has a code defined by the internal processing. What may be an issue here is the specification of the input character repertoire, which must be able to support the requirements of the user.

For the same reason, neither does the user need to be concerned with the coded representation of the output character set. Again, the repertoire needs to be sufficient to support his requirements. In particular, there should be font support for that repertoire where that is applicable. Fall-back may be an issue if the product cannot support the full output repertoire needed.

The procurement issues will centre around the interchange character set. As in the other cases, the repertoire needs to be sufficient to support the requirements of the user. The main issue will be what code structure to use. The industry is in a state of transition. The 8-bit code structure of ISO/IEC 2022:1994 has been around for some considerable time and is the safe, but limited, option. The future lies with the multi-octet code structure of ISO/IEC 10646-1:1993, but it may have limited availability across platforms. This guide recommends that new procurements should specify the multi-octet code structure wherever practical. However, there may be cases where the new system has to operate in an environment which overwhelmingly uses 8-bit coding. If so, that code structure should be used. Also, there may be situations where the new system has to operate with both code structures and then the availability of dual support may have to be considered.

Next is the choice of the repertoire. If the 8-bit code structure is chosen for the interchange function, this guide recommends that, in Europe, a repertoire(s) is chosen from those identified in EN 1923:1998. For Latin script applications, there is a choice of four repertoires. It is unlikely that the IVL or IL repertoires will be sufficient for European operation and a choice of these alone is deprecated. For Western European language application, the BL repertoire should be sufficient. If Eastern European languages have to be taken into account, then the LL8 repertoire is more applicable. The Greek (BG) and Cyrillic (BC) repertoires can be added where appropriate, but these cannot be combined with LL8. It should be noted, however, that there may be significant limitations in these repertoires for some minority language operation and for some specialist technical applications. For these cases, it is recommended to seek specialist advice in the choice of coded character sets outside EN 1923:1998.

If the multi-octet code structure is chosen, this guide recommends that one of the Multilingual European Subsets (MES) be selected for the interchange repertoire. The Repertoire of EN 1923 LL8 is almost identical to LL8. MES-1 provides a much more comprehensive coverage for European languages, whilst MES-2 guarantees support for nearly all indigenous European languages, specifically including minority languages.

6.2 Conversion and Fall-back

It should be noted that the repertoire selection indicated above is a minimum requirement. The supplier may go further. Note that the choices of repertoires for the four character handling functions should not be made independently of each other. For instance, choice of a large processing repertoire such as MES-2 may not be sensible if the interchange repertoire is BL within the 8-bit code structure. It is recommended that, if possible, the same repertoire is used throughout.

If this is not possible, the procurer must take into account the conversion arrangements that are necessary. They may be analysed by considering the interfaces between the character handling functions in the model.

An example of this would be where someone in the UK needed to communicate with someone in Greece in Greek and the UK system did not support the Greek script.

6.3 Code structure interoperability

As has already been described, the mis-match of repertoire and coding on either side of the interfaces in the character handling model can cause interoperability problems which are overcome by various conversion functions. A more fundamental issue of interoperability arises over the use of the code structures themselves. Such issues arise during interchange at the Processing/Interchange interface. For both the 8-bit code structure and the UCS code structure, many options have to be defined for the receiver to be able to interpret the incoming data stream correctly. For example, in the 8-bit code structure, how does the receiver know which 8-bit character set is in use? Another example, in the UCS code structure, how does the receiver know which coding form is being used? As a final example, how does the receiver know which code structure is in use - 8-bit or UCS?

If the 8-bit code structure is being used during interchange, to increase the chances of interoperability, either an a priori agreement has to be in place over the character sets and the code structure options to be used, or the code extension facilities of ISO/IEC 2022:1994 must be used in the exchange to establish such an understanding. For the former, a higher level protocol may determine the agreement before the interchange of character data takes place - an example would be Internet Mail. For the latter, ISO/IEC 12070-1:1996 contains detailed conformance requirements on the use of ISO/IEC 2022:1994. As has already been indicated, products rarely support the code extension facilities of ISO/IEC 2022:1994, so in most cases the user has to rely on the establishment of a priori agreements which, in some cases may be automated.

For the multi-byte UCS code structure, the following information is needed by the receiver:

Again, either an a priori agreement is needed or the designation, identification and signature facilities as specified in ISO/IEC 10646-1:1993 must be used. It is believed that the supply industry is favouring the use of signatures. A signature is a sequence of octets sent at the start of an interchange to signal what code format will be used and what octet ordering is going to be used for the transmission of each multi-octet code representing a character. The other two items in the list can be signalled with the use of Escape sequences but are less important for interoperability.

7 Procurement clauses

This section contains sample text that may be used within procurement specifications for products that need to support character set operation. The clauses given apply to general purpose products intended for commercial and administrative applications. They may be tailored to the needs of the specific procurement. If the procurement is for specialist applications which may have unusual requirements or may have restricted capabilities, it is recommended that the procurement officer seek expert advice.

7.1 Input character repertoire

"The product shall support the input repertoire(s) xxxx specified in EN 1923:1998." where xxxx is one or more of the following:

BL, LL8, BG, BC, BL & BG, BL & BC, BL & BG & BC.

(Note that this guide deprecates the use of the repertoires IVL and IL.)

"The product shall support the input repertoire(s) XXXX specified in CEN CWA nnn:1998." where XXXX is one or more of Repertoire of EN 1923 LL8, MES-1, MES-2A and MES-2B.

Note: If more than one input repertoire is chosen, the following clause should be included in the procurement specification.

"The product shall be configured for use of the specified input repertoires together with applications as follows:

Repertoire(s) xxxx together with application(s) zzzz." where xxxx are the repertoires as specified in preceding clauses and zzzz is (are) the name(s) of application(s) as appropriate. The latter sentence is to be repeated as necessary.

7.2 Output character repertoire

"The product shall support the output repertoire(s) xxxx specified in EN 1923:1998." where xxxx is one or more of the following:

BL, LL8, BG, BC, BL & BG, BL & BC, BL & BG & BC.

(Note that this guide deprecates the use of the repertoires IVL and IL.)

"The product shall support the input repertoire(s) XXXX specified in CEN CWA nnn:1998." where XXXX is one or more of Repertoire of EN 1923 LL8, MES-1, MES-2A and MES-2B.

Note: If more than one output repertoire is chosen, the following clause should be included in the procurement specification.

"The product shall be configured for use of the specified output repertoires together with applications as follows:

Repertoire(s) xxxx together with application(s) zzzz." where xxxx are the repertoires as specified in preceding clauses and zzzz is (are) the name(s) of application(s) as appropriate. The latter sentence is to be repeated as necessary.

"If the processing repertoire contains characters that cannot be rendered by the output repertoire, each such character should be represented in such a way as to indicate that it is not the original character."

or

"If the processing repertoire contains characters that cannot be rendered by the output repertoire, each such character should be represented in such a way as to indicate that it is not the original character and also in such a way as to make it possible for the end user to identify that original character, e.g. by way of identifying its coded representation."

7.3 Processing character repertoire

"The product shall support the processing repertoire(s) xxxx specified in EN 1923:1998." where xxxx is one or more of the following:

BL, LL8, BG, BC, BL & BG, BL & BC, BL & BG & BC.

(Note that this guide deprecates the use of the repertoires IVL and IL.)

"The product shall support the processing repertoire XXXX specified in CEN CWA nnn:1998." where XXXX is one or more of Repertoire of EN 1923 LL8, MES-1, MES-2A or MES-2B

Note: If more than one processing repertoire is chosen, the following clause should be included in the procurement specification.

"The product shall be configured for use of the specified processing repertoires together with applications as follows:

Repertoire(s) xxxx together with application(s) zzzz." where xxxx are the repertoires as specified in preceding clauses and zzzz is (are) the name(s) of application(s) as appropriate. The latter sentence is to be repeated as necessary.

7.4 Interchange character repertoire

"The product shall support the interchange repertoire(s) xxxx specified in EN 1923:1998." where xxxx is one or more of the following:

BL, LL8, BG, BC, BL & BG, BL & BC, BL & BG & BC.

(Note that this guide deprecates the use of the repertoires IVL and IL.)

"The product shall support the interchange repertoire(s) XXXX specified in CEN CWA nnn:1998." where m is one or more of Repertoire of EN 1923 LL8, MES-1, MES-2A and MES-2B.

Note: If more than one interchange repertoire is chosen, the following clause should be included in the procurement specification.

"The product shall be configured for use of the specified interchange repertoires together with applications as follows:

Repertoire(s) xxxx together with application(s) zzzz." where xxxx are the repertoires as specified in preceding clauses and zzzz is (are) the name(s) of application(s) as appropriate. The latter sentence is to be repeated as necessary.

"If the processing repertoire contains characters that are not contained in the interchange repertoire, each such character should be represented during interchange in such a way as to indicate to the receiving system that it is not the original character."

or

"If the processing data contain characters that are not contained in the interchange repertoire, each such character should be represented during interchange in such a way as to indicate to the receiving system that it is not the original character and also in such a way as to make it possible for the end user to identify that original character, e.g. by way of identifying its coded representation."

There may be cases where the interchange repertoire contains characters that are not contained in the processing repertoire. If any particular fall-back or other conversion functions are requested, one of the following clauses should be included.

"If the interchange repertoire contains characters that are not contained in the processing repertoire, each such character should be represented in the latter repertoire in such a way as to indicate to the processing function that it is not the original character."

or

"If the interchange repertoire contains characters that are not contained in the processing repertoire, each such character should be represented in the latter repertoire in such a way as to indicate to the processing function that it is not the original character and also in such a way as to make it possible for the end user to identify that original character, e.g. by way of identifying its coded representation."

7.5 Additional requirements when using the 8-bit code structure for interchange

If an interchange repertoire has been selected from EN 1923:1997 and the 8-bit code structure is required for interchange, the following clause should be included in the procurement specification.

"For the specified interchange repertoires, the product shall support the 8-bit code structure requirements as specified in EN 1923:1998."

If the user is particularly concerned about the achievement of interoperability in the 8-bit code structure environment, the following clause should be included in the procurement specification.

"The product shall satisfy the conformance requirements of ISO/IEC ISP 12070-1:1996 for operation of the 8-bit code structure."

7.6 Additional requirements when using the multi-byte UCS code structure for interchange

The following requirements are specified in order to create an environment which maximises interoperability. The approach taken is to ask for a minimum requirement on senders and a maximum requirement on receivers.

If the Repertoire of EN 1923 LL8 or MES-1 is chosen as the interchange repertoire, the following clause should be added to the procurement specification.

"For sending, the product shall support at least the level-1 operation using at least the UCS-2 form as specified in ISO/IEC 10646-1:1993."

If the MES-2A or MES-2B is chosen as the interchange repertoire, the following clause should be added to the procurement specification.

"For sending, the product shall support the level-3 operation using at least the UCS2 form as specified in ISO/IEC 10646-1:1993."

Regardless of which repertoire is specified, the following clause should be included.

"For receiving, the product shall support at the level-3 operation using at least the UCS-2 form, the UCS-4 form and the UTF-8 transformation format as specified in ISO/IEC 10646-1:1993."

"For sending, the product shall support at least the normal ordering of octets for each character sent as specified in ISO/IEC 10646-1:1993."

"For receiving, the product shall support both the normal ordering and reverse ordering of octets for each character sent as specified in ISO/IEC 10646-1:1993."

"For sending, the product shall support the use of the appropriate signatures as specified in ISO/IEC 10646-1:1993."

For receiving, all the appropriate signatures should be accepted so that the receiver may understand what format is being used and in which order character octets are being sent. The following clause should be included in the procurement specification.

"For receiving, the product shall support the use of all signatures as specified in ISO/IEC 10646-1:1993."

8 Other reference material

Manual: standards for the electronic interchange of personal data: Part 5.

character sets

ISBN 90-5414-019-4

Published by the Directorate of Departmental Relations and Provision of Information

Directorate-General for Public Information

Ministry of the Interior

P.O. Box 20011

250 EA The Hague

Netherlands

This manual was produced for the Ministry of the Interior in the Netherlands in 1995. It provides an overview of character sets and their standards which is somewhat longer than this guide. It makes certain recommendations targeted at the use of character sets for administrative purposes within the public sector in the Netherlands. It provides an alternative view to some of the issues on the use of character sets and further detailed technical information.

Comparisons of Standardized Character Sets for Europe

ISBN 91-7220-275-0

Published by Statskontoret

The Swedish Agency for Administrative Development

Box 2280, 103 17 Stockholm

Sweden

Orders may be directed by e-mail to publikations.service@statskontoret.se

This document was produced for Statskontoret in Sweden in 1996. It provides in tabular form a detailed comparison of character sets, both standardised and industry defined. The character sets are mainly 8-bit coded sets and the coverage is comprehensive covering most of the Latin based sets likely to be encountered in Europe. It provides a useful source of code page specification, especially of the industry code pages.

Note to the reviewer and to TC304: what other European publications known to members of TC304 which could provide further reading should be referenced? I have left out the EPHOS guidance since it is not liked by TC304 and the EWOS PT01 report since it is now rather old.

Note to the reviewer and TC304: should we be referencing industry publications like the one below.

IBM Character Data Representation Library

Character Data Representation Architecture

Level 2 - Reference

Publication Number SC09-1390-01

This document was published by IBM in 1993. To quote "The overall objective of CDRA is to define a method of assigning and preserving the meaning and rendering of coded graphic characters through various stages of processing and interchange." This is a comprehensive architectural document which provides a basis for consistent implementation of character set support across a range of IBM platforms. It is very detailed but will reward the dedicated reader who is intent on researching character set issues in depth from an implementation point of view.

Annex A - 2022 code structure reference

- the 2022 EWOS Web material brought up to date.

Annex B - 10646 code structure reference

- the 10646 Web material of Graham Dixon brought up to date.