CEN Guide to the Use of Character Sets in EuropeTC 304

8-Bit Character Sets - Concepts and terminology


Many of the 7-bit and 8-bit coded character sets in use today share a common structure. This structure, together with notation and terminology for referring to its various elements, is laid down in ISO/IEC 2022. Some familiarity with the main features of that structure is needed to read this guide. These are summarised in this section.

The Universal Multiple-Octet Coded Character Set (UCS) that has been developed more recently, specified in ISO/IEC 10646-1, lies outside the ISO/IEC 2022 structure. Detailed guidance on the structure of the UCS is found in another part of the Guide to the Use of Character Sets in Europe.

Table of Contents

Basic principles of ISO/IEC 2022

The construction of a character code according to ISO/IEC 2022 is most simply explained by a mechanical analogy. It is like a typewriter that takes interchangeable typeheads ("golfballs"). A typewriter without a typehead can't actually type anything but its mechanisms are all in place. The non-printing keys, such as the space bar and backspace, still operate. It is only the printing characters that are missing.

The typehead itself is an inert object, but once placed on the typewriter then each key on the typewriter will print the character that is at a specific position on the typehead. Change the typehead and the typewriter prints different characters, but the relationship between keys and character positions does not change.

The role of the typewriter is taken in ISO/IEC 2022 by a code table. There is one code table for 7-bit codes and another for 8-bit codes. Each code table provides a linkage between character positions and bit combinations. Certain of these positions are already assigned, for the SPACE, DELETE and ESCAPE characters, but the vast majority of character positions are empty. The table is waiting for its equivalent of a typehead.

The role of the typehead is taken in ISO/IEC 2022 by a code element of graphic characters. Such a code element contains a pattern of graphic characters that can be overlaid on (part of) the empty code table. Once overlaid, it provides a graphic character at each of the overlaid positions. The combination of code table and code element completes (part of) the code; the character at a particular position is coded by the bit combination assigned to that position.

The next few paragraphs expand on this model in a more precise way.

Code tables

Layout and notation

ISO/IEC 2022 defines the structure of a code table separately for 7-bit and for 8-bit codes. A 7-bit code table consists of 128 positions arranged in 8 columns and 16 rows. An 8-bit code table consists of 256 positions arranged in 16 columns and 16 rows. The rows and columns are numbered starting from 0, and by convention a leading zero is included where necessary to make all row and column numbers have two (decimal) digits. A diagram to illustrate the 8-bit case is given below.

The notation xx/yy, e.g. 01/15, is used to label the table position that is in column xx and row yy. The same notation is used to identify a bit combination, with yy being the decimal number whose binary form consists of the least significant four bits of the bit combination and xx being similarly related to the most significant four (for an 8-bit code) or three (for a 7-bit code) bits. This notation provides a natural correspondence between positions in the code table and bit combinations of the code.

Structure

The 8-bit code table is divided into four named areas:

The 7-bit code table is similarly divided, but it only has CL and GL areas. The 8-bit code table is illustrated in this diagram:

8-bit code table layout

The 8-bit code table

ISO/IEC 2022 requires that the bit combinations in the CL and CR areas shall be used to represent control functions or be left unused. Only those in the GL and GR areas may be used to represent graphic (printing) characters.

Certain characters have fixed fixed assignments in both the 7-bit and 8-bit code tables as follows:

These are also shown in the diagram. The reasons behind the assignments for SPACE and DELETE are described in the section on ASCII in the historical introduction.

Escape sequences

The third of the characters with fixed assignments is the ESCAPE character. This is also in the position in which it was put during the design of ASCII. However, the reason that the ESCAPE character is so important as to require permanent assignment is more recent.

As the development of coded character sets has progressed, there has become an increasing need to be able to code control information that contains parameters. To achieve this, the concept of a control character such as CARRIAGE RETURN (CR) has given way to the more general concept of a control function. The coding of a control function is introduced by a distinctive bit combination, but it is followed by further bit combinations that pass parameter information in coded form. The syntax of these further combinations ensures that they are self-delimiting, i.e. that the end of the coding of the control function may be identified by a suitable algorithm. The ESCAPE character is one such introducer. The complete sequence of bit combinations that represents a control function coded in this manner is known as an escape sequence.

More details of the coding of escape sequences are given in the section of this guide on control functions.

Code elements

ISO/IEC 2022 constructs a complete code from a selection of the following code elements:

All of these elements may be present in either a 7-bit or an 8-bit code. Each of these types of code element will now be considered in more detail.

Code elements G0, G1, G2 and G3 of graphic characters

The code elements G1, G2 and G3 may each provide assignments for either 94 or 96 character positions. A set with 94 positions would provide assignments for positions 2/1 to 7/14 of the GL area or 10/1 to 15/14 of the GR area, i.e. excluding the positions assigned to SP and DEL in the GL area and the two corresponding shaded positions in the diagram above for the GR area. A set with 96 positions would provide assignments for all the 96 positions of either the GL or GR area. The code element G0 is similar but only the 94-position option is permitted.

Here is a diagram of a 94-position code element that is suitable for use as any of G0 to G3. It is in fact the ASCII character set:

A 94-position code element
A 94-position code element

The process described above as overlaying is known technically, in ISO/IEC 2022 terminology, as invocation. After being invoked, the code element concerned is said to have GL or GR shift status, as the case may be. Clearly at most two of the code elements G0 to G3 may be invoked simultaneously in an 8-bit code, one in each of the GL and GR areas, and at most one in a 7-bit code where there is no GR area. The mechanism of invocation is described in more detail below.

For the code element illustrated, when it has GL shift status the character "A" is represented by the bit combination 04/01. When it has GR shift status it is represented by the bit combination 12/01.

More details of the use of the code elements of graphic characters are given elsewhere in this guide.

Code elements C0 and C1 of control characters

Control characters have a name and an identifying acronym, but no graphic representation. Examples of control characters are BACKSPACE (BS), BELL (BEL), START OF HEADING (SOH), SINGLE-SHIFT 2 (SS2) and ESCAPE (ESC). They are a special case of a more general concept, the control function, as explained above concerning escape sequences.

The code elements C0 and C1 each provide assignments to control characters for the 32 character positions of either the CL or CR area. If the code has a C0 code element then this is permanently invoked in the CL area. A C0 code element is required to have the ESCAPE character in position 01/11 so that its invocation does not affect the availability or coding of this control character.

If an 8-bit code has a C1 code element, it would normally be permanently invoked in the CR area. This is not possible for a 7-bit code since there is no CR area in the code table. Instead, the characters of the C1 code element are represented in a 7-bit code by means of an escape sequence. This representation is also permitted for an 8-bit code, as an alternative to invocation in the CR area. In a particular code, only one of the two alternatives is permitted. The choice should form part of the specification of an 8-bit code.

Other control functions

A code with a full range of Cn and Gn code elements requires access to a substantial number of control functions. ISO/IEC 2022 makes provision for control functions to meet the following needs, and others besides:

Announcement of facilities permits the choice of particular options to be notified to the remote party to the communication, such as whether the characters of the C1 code element are to be coded in an 8-bit code by means of escape sequences or by invocation of the element to the CR area. Further information about control functions of each of the above types is given in the section of this guide on control functions. See locking shifts, single shifts, dynamical designation or announcement functions as required.

Although some of these control functions may be coded by a single control character in either the C0 or C1 code elements, many of them require the use of an escape sequence. The set of available control functions constitutes the final element of a code specification.

The presence of the ESCAPE character in both the 7-bit and 8-bit code tables, before any other code elements are designated, permits all of the Cn and Gn code elements to be designated dynamically. Since Cn code elements do not require separate invocation, the control functions they provide are immediately available for use. These in turn can be used to invoke the Gn code elements as required. The combined effect of all available facilities is to permit a complete code specification to be communicated to a remote party if required.

Repertoire of a code

It is sometimes convenient to be able to refer to the set of characters that can be represented by a code, in a manner abstracted from the details of that representation. This set of characters is known as the repertoire of the code.

The concept of a repertoire is more subtle than it may seem to be at first. Certain character set standards permit two or more characters to be combined in specified ways to create new characters that belong to the repertoire but which are not themselves represented in the code. It is this distinction between representation in, and representation by, a code that causes the subtlety.

One means of combining two characters is by use of the BACKSPACE control character to superpose two images. This is still permitted in the 7-bit code of ISO/IEC 646, but not in more recent character set standards. A more recent technique is to specify that certain characters of a code are non-spacing, so that superposition may be achieved without the use of BACKSPACE. The most significant use of non-spacing characters is that of ISO/IEC 6937. Non-spacing characters are in fact only one example of the more general concept of a combining character.

Formal definitions

To ensure precision, the character set standards provide formal definitions of the terms that they use. The following extract from ISO/IEC 2022 gives the definitions of terms used above in this discussion of concepts and terminology.

bit combination:
An ordered set of bits used for the representation of characters.
byte:
A bit string that is operated upon as a unit.
[Note that this definition permits 7-bit, 8-bit and even 16-bit bytes, although common parlance uses the term exclusively for 8 bits. Character set standards use the term "octet" when a restriction to 8 bits is intended.]
coded character set; code:
A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their bit combinations.
code table:
A table showing the character allocated to each bit combination in a code.
code extension:
The techniques for the encoding of characters that are not included in the character set of a given code.
combining character:
A member of an identified subset of a coded character set, intended for combination with the preceding or following graphic character, or with a sequence of combining characters preceded or followed by a non-combining character.
control character:
A control function the coded representation of which consists of a single bit combination.
control function:
An action that affects the recording, processing, transmission or interpretation of data, and that has a coded representation consisting of one or more bit combinations.
to designate:
To identify a set of characters that are to be represented, in some cases immediately and in others on the occurrence of a further control function, in a prescribed manner.
escape sequence:
A string of bit combinations that is used for control purposes in code extension procedures. The first of these bit combinations represents the control function ESCAPE.
graphic character:
A character, other than a control function, that has a visual representation normally handwritten, printed or displayed, and that has a coded representation consisting of one or more bit combinations.
to invoke:
To cause a designated set of characters to be represented by one or more bit combinations of a coded character set.
repertoire:
A specified set of characters that are each represented by one or more bit combinations of a coded character set.

To Top of 8-Bit Guide