ISO/IEC JTC1 SC22 WG21 P2491R0
Author: Jens Maurer
Target audience: SG16, LEWG
2021-11-15

P2491R0: Text encodings follow-up

1 Abstract

This paper discusses some design decisions of P1885 "Naming Text Encodings to Demystify Them" by Corentin Jabot and Peter Brett that, in the view of the author, are going in the wrong direction for a number of seemingly small, but crucial design decisions.

In short,

2 Paper history

R0: initial revision

3 Use cases for the std::text_encoding facility

In all of these cases, the character set (i.e. the set of individual characters) supported in a specific context is indirectly defined by the encoding, but not explicitly specified.

4 Building blocks

Text encodings are used in a variety of situations:

4.1 C++ ordinary and wide literal encodings

[lex.charset] specifies in the current working draft:
A code unit is an integer value of character type (6.8.2). Characters in a character-literal [...] or in a string-literal are encoded as a sequence of one or more code units [...]; this is termed the respective literal encoding. The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal.
Then, [lex.string] p10.1 specifies
The sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences (5.13.3), and universal-character-names (5.3) is encoded to a code unit sequence using the string-literal’s associated character encoding.
Thus, an encoding for ordinary and wide literals in C++ relates a sequence of characters with a sequence of integer values of the respective character type (char or wchar_t).

4.2 IANA list of character sets

The IANA (Internet Assigned Numbers Authority) maintains a registry of encodings (called "character sets") at https://www.iana.org/assignments/character-sets/character-sets.xhtml, as instigated by RFC2978.

As described by RFC2978:

The term "charset" (referred to as a "character set" in previous versions of this document) is used here to refer to a method of converting a sequence of octets into a sequence of characters.
Thus, an encoding in the IANA registry relates a sequence of octets with a sequence of characters.

4.3 Unicode

Unicode provides the concept of an encoding form for the relationship between a sequence of characters (specifically, a sequence of code points) and a sequence of integer values. Unicode further provides the concept of an encoding scheme for the relationship between a sequence of characters and a sequence of octets.

Regrettably, the specified encoding forms and encoding schemes have overlapping naming; "UTF-16" refers both to an encoding form and an encoding scheme.

UTF-8

ISO 10646:2020 section 10.2 specifies the encoding form as follows:
UTF-8 is the UCS encoding form that assigns each UCS scalar value to an octet sequence of one to four octets, as specified in table 2.
The encoding scheme is defined as follows in section 11.2:
The UTF-8 encoding scheme serializes a UTF-8 code unit sequence in exactly the same order as the code unit sequence itself.
Thus, for UTF-8, the code units are octets and those octets also constitute the encoding scheme. This encoding does not depend on endianness (byte order in the object representation of an integer) at all.

UTF-16

ISO 10646:2020 section 10.3 specifies the encoding form called "UTF-16" as follows:
UTF-16 is the UCS encoding form that assigns each UCS scalar value to a sequence of one to two unsigned 16-bit code units, as specified in table 4.
The encoding scheme called "UTF-16" is specified in section 11.5 as follows:
The UTF-16 encoding scheme serializes a UTF-16 code unit sequence by ordering octets in a way that either the less significant octet precedes or follows the more significant octet. In the UTF-16 encoding scheme, the initial signature read as <FE FF> indicates that the more significant octet precedes the less significant octet, and <FF FE> the reverse. The signature is not part of the textual data. In the absence of signature, the octet order of the UTF-16 encoding scheme is that the more significant octet precedes the less significant octet.
The "initial signature" is otherwise known as a byte order mark (BOM).

The Unicode standard version 14.0.0 specifies in section 3.10:

UTF-16 encoding scheme: The Unicode encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian format.

[...]

In the UTF-16 encoding scheme, an initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark; it is used to distinguish between the two byte orders. An initial byte sequence <FE FF> indicates big-endian order, and an initial byte sequence <FF FE> indicates little-endian order. The BOM is not considered part of the content of the text.

The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

Note the caveat of an undefined "higher-level protocol", which does not exist in ISO 10646.

In either standard, there are also encoding schemes UTF-16LE and UTF-16BE that do not interpret a signature (byte order mark) at all, but use the given big-endian or little-endian layout unconditionally.

UTF-32

The specifiation of UTF-32 is analogous to UTF-16. There is no provision for endianness other than big-endian or little-endian.

4.4 iconv

POSIX

iconv is a transcoding function specified by POSIX:
size_t iconv(iconv_t cd, char **restrict inbuf,
       size_t *restrict inbytesleft, char **restrict outbuf,
       size_t *restrict outbytesleft);
The conversion descriptor (the first argument) is created using iconv_open:
iconv_t iconv_open(const char *tocode, const char *fromcode);
with the following specification:
The iconv_open() function shall return a conversion descriptor that describes a conversion from the codeset specified by the string pointed to by the fromcode argument to the codeset specified by the string pointed to by the tocode argument. [...]

Settings of fromcode and tocode and their permitted combinations are implementation-defined.

As a non-normative note, iconv says:

The objects indirectly pointed to by inbuf and outbuf are not restricted to containing data that is directly representable in the ISO C standard language char data type. The type of inbuf and outbuf, char **, does not imply that the objects pointed to are interpreted as null-terminated C strings or arrays of characters. Any interpretation interpretation of a byte sequence that represents a character in a given character set encoding scheme is done internally within the codeset converters. For example, the area pointed to indirectly by inbuf and/or outbuf can contain all zero octets that are not interpreted as string terminators but as coded character data according to the respective codeset encoding scheme. The type of the data (char, short, long, and so on) read or stored in the objects is not specified, but may be inferred for both the input and output data by the converters determined by the fromcode and tocode arguments of iconv_open().
Thus,

GNU iconv

GNU iconv implements POSIX iconv as follows:

4.5 ICU

ICU also comes with an encoding converter; the list of supported aliases is here.

5 Special handling for UTF-16 and UTF-32

As described above, UTF-16 as an encoding scheme has the following different interpretations: P1885 elects to map the correct UTF16LE/BE encoding scheme identifier possibly returned from the std::text_encoding::wide_literal() function to UTF16. This is user-unfriendly for the following reasons: Futher, iconv, presumably one of the premier consumers of the object representation model (see below), was designed with the understanding that the encoding name also conveys the object type for each code unit (e.g. char or int or, presumably, wchar_t). This distinction is lost when both network data (in a char buffer) and wchar_t literals are expected to be described with the same std::text_encoding value.

It is conceivable to introduce a new enumerator UTF16NE that has the value of either of the existing enumerators UTF16LE or UTF16BE (as appropriate) and return that value from std::text_encoding::wide_literal() on e.g. Windows platforms. This approach, as well as an earlier approach in P1885 that returns either UTF16LE or UTF16BE, but never UTF16, would redundantly represent information about platform endianness in an unrelated part of the standard. Platform endianness should be handled exclusively by the existing targeted facility std::endian (see 26.5.8 [bit.endian]).

P1885 also elects to map UTF16LE/BE to UTF16 for the non-wide std::text_encoding::literal(). Since CHAR_BIT == 8 is required for this function, the ordinary literal encoding can never be UTF-16. If it were, two consecutive char elements would be used to represent a single code unit, but some char elements might have the value 0 without representing the null character. This is not a valid encoding per [lex.charset]. The mapping is thus superfluous for the result of std::text_encoding::literal().

Everything said above also applies analogously to UTF-32.

6 No special handling for UCS2

UCS-2 was effectively used on the Microsoft Windows (little-endian) platform for a decade or so before it was switched to UTF-16.

The usage situation is approximately the same as that for UTF-16, yet P1885 does not even attempt to perform any mapping that could be viewed as removing endianness assumptions from the name. Adding to that, the IANA registry appears to define "UCS2" as big-endian, but does not make any allowance for a little-endian UCS-2 encoding scheme. This leaves the relevant (admittedly outdated) Microsoft Windows platforms conceptually unsupported.

7 Looking at the object representation breaks an abstraction barrier

The C++ object model carefully avoids considering the object representation. Where it must do so (e.g. for bit_cast), lots of care needs to be applied to properly deal with padding bits, partially uninitialized values, and other obscure situations.

I believe it is a mistake that P1885 talks about specifying the object representation by applying an encoding scheme. The object representation should never be in the focus of a user or the specification of a user-facing facility.

The following alternative model avoids talking about the object representation, naturally supports implementations with CHAR_BIT > 8 or with sizeof(wchar_t) == 1, and allows proper differentiation between literal encodings and network data.

Observations: This proposal does not limit the choice of encoding for the platform, but does allow to express all reasonable encodings even for fringe (but valid) abstract machine parameters such as CHAR_BIT >= 16 or sizeof(wchar_t) == 1.

8 Wording plan

Relative to P1885, the wording should be adjusted as follows: