In short,
std::text_encoding
facilityA code unit is an integer value of character type (6.8.2). Characters in a character-literal [...] or in a string-literal are encoded as a sequence of one or more code units [...]; this is termed the respective literal encoding. The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal.Then, [lex.string] p10.1 specifies
The sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences (5.13.3), and universal-character-names (5.3) is encoded to a code unit sequence using the string-literal’s associated character encoding.Thus, an encoding for ordinary and wide literals in C++ relates a sequence of characters with a sequence of integer values of the respective character type (
char
or wchar_t
).
As described by RFC2978:
The term "charset" (referred to as a "character set" in previous versions of this document) is used here to refer to a method of converting a sequence of octets into a sequence of characters.Thus, an encoding in the IANA registry relates a sequence of octets with a sequence of characters.
Regrettably, the specified encoding forms and encoding schemes have overlapping naming; "UTF-16" refers both to an encoding form and an encoding scheme.
UTF-8 is the UCS encoding form that assigns each UCS scalar value to an octet sequence of one to four octets, as specified in table 2.The encoding scheme is defined as follows in section 11.2:
The UTF-8 encoding scheme serializes a UTF-8 code unit sequence in exactly the same order as the code unit sequence itself.Thus, for UTF-8, the code units are octets and those octets also constitute the encoding scheme. This encoding does not depend on endianness (byte order in the object representation of an integer) at all.
UTF-16 is the UCS encoding form that assigns each UCS scalar value to a sequence of one to two unsigned 16-bit code units, as specified in table 4.The encoding scheme called "UTF-16" is specified in section 11.5 as follows:
The UTF-16 encoding scheme serializes a UTF-16 code unit sequence by ordering octets in a way that either the less significant octet precedes or follows the more significant octet. In the UTF-16 encoding scheme, the initial signature read as <FE FF> indicates that the more significant octet precedes the less significant octet, and <FF FE> the reverse. The signature is not part of the textual data. In the absence of signature, the octet order of the UTF-16 encoding scheme is that the more significant octet precedes the less significant octet.The "initial signature" is otherwise known as a byte order mark (BOM).
The Unicode standard version 14.0.0 specifies in section 3.10:
UTF-16 encoding scheme: The Unicode encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian format.Note the caveat of an undefined "higher-level protocol", which does not exist in ISO 10646.[...]
In the UTF-16 encoding scheme, an initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark; it is used to distinguish between the two byte orders. An initial byte sequence <FE FF> indicates big-endian order, and an initial byte sequence <FF FE> indicates little-endian order. The BOM is not considered part of the content of the text.
The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.
In either standard, there are also encoding schemes UTF-16LE and UTF-16BE that do not interpret a signature (byte order mark) at all, but use the given big-endian or little-endian layout unconditionally.
iconv
is a transcoding function specified by
POSIX:
size_t iconv(iconv_t cd, char **restrict inbuf, size_t *restrict inbytesleft, char **restrict outbuf, size_t *restrict outbytesleft);The conversion descriptor (the first argument) is created using
iconv_open
:
iconv_t iconv_open(const char *tocode, const char *fromcode);with the following specification:
Theiconv_open()
function shall return a conversion descriptor that describes a conversion from the codeset specified by the string pointed to by thefromcode
argument to the codeset specified by the string pointed to by thetocode
argument. [...]Settings of
fromcode
andtocode
and their permitted combinations are implementation-defined.
As a non-normative note, iconv
says:
The objects indirectly pointed to byThus,inbuf
andoutbuf
are not restricted to containing data that is directly representable in the ISO C standard languagechar
data type. The type ofinbuf
andoutbuf
,char **
, does not imply that the objects pointed to are interpreted as null-terminated C strings or arrays of characters. Any interpretation interpretation of a byte sequence that represents a character in a given character set encoding scheme is done internally within the codeset converters. For example, the area pointed to indirectly by inbuf and/or outbuf can contain all zero octets that are not interpreted as string terminators but as coded character data according to the respective codeset encoding scheme. The type of the data (char
,short
,long
, and so on) read or stored in the objects is not specified, but may be inferred for both the input and output data by the converters determined by thefromcode
andtocode
arguments oficonv_open()
.
char
.char*
parameter possibly pointing to objects of other
integer types.std::text_encoding::wide_literal()
function to UTF16.
This is user-unfriendly for the following reasons:
char
, a BOM is expected
to be present) vs. the UTF16LE/BE text that is produced from a
wide literal (objects of type wchar_t
, without a
BOM).iconv
always creates a BOM when writing the UTF-16
encoding. If a user were to convert third-party text from
e.g. UTF-8 to "UTF16" for use with std::wstring
and
string literals, BOMs are likely to end up in the middle of a
string.iconv
, presumably one of the premier consumers of
the object representation model (see below), was designed with the
understanding that the encoding name also conveys the object type for
each code unit (e.g. char
or int
or,
presumably, wchar_t
). This distinction is lost when both
network data (in a char
buffer) and wchar_t
literals are expected to be described with the
same std::text_encoding
value.
It is conceivable to introduce a new enumerator UTF16NE
that has the value of either of the existing
enumerators UTF16LE
or UTF16BE
(as
appropriate) and return that value
from std::text_encoding::wide_literal()
on e.g. Windows
platforms. This approach, as well as an earlier approach in P1885
that returns either UTF16LE or UTF16BE, but never UTF16, would
redundantly represent information about platform endianness in an
unrelated part of the standard. Platform endianness should be handled
exclusively by the existing targeted facility std::endian
(see 26.5.8 [bit.endian]).
P1885 also elects to map UTF16LE/BE to UTF16 for the non-wide
std::text_encoding::literal()
. Since CHAR_BIT ==
8
is required for this function, the ordinary literal encoding
can never be UTF-16. If it were, two consecutive char
elements would be used to represent a single code unit, but
some char
elements might have the value 0 without
representing the null character. This is not a valid encoding per
[lex.charset]. The mapping is thus superfluous for the result of
std::text_encoding::literal()
.
Everything said above also applies analogously to UTF-32.
The usage situation is approximately the same as that for UTF-16, yet P1885 does not even attempt to perform any mapping that could be viewed as removing endianness assumptions from the name. Adding to that, the IANA registry appears to define "UCS2" as big-endian, but does not make any allowance for a little-endian UCS-2 encoding scheme. This leaves the relevant (admittedly outdated) Microsoft Windows platforms conceptually unsupported.
bit_cast
),
lots of care needs to be applied to properly deal with padding bits,
partially uninitialized values, and other obscure situations.
I believe it is a mistake that P1885 talks about specifying the object representation by applying an encoding scheme. The object representation should never be in the focus of a user or the specification of a user-facing facility.
The following alternative model avoids talking about the object
representation, naturally supports implementations with CHAR_BIT
> 8
or with sizeof(wchar_t) == 1
, and allows proper
differentiation between literal encodings and network data.
char
or wchar_t
); each octet value of an IANA encoding is
thus understood to be a code unit.char
)
is at least 8 bits and thus can hold the value of an octet.iconv
(and likely other implementations) does
not currently support the "WIDE.*" names. This can reasonably be
expected to change when the names are standardized.std::wstring
if sizeof(wchar_t) >
1
.wchar_t
strings if sizeof(wchar_t) ==
1
.char
strings if CHAR_BIT >= 16
. There is no
difference regarding the string literal encoding approach between
a char
with 16 bits and a wchar_t
with 16
bits, regardless of whether the latter consists of one or two
bytes.CHAR_BIT >=
16
or sizeof(wchar_t) == 1
.
std::text_encoding
.std::text_encoding::(wide_)literal()
unless sizeof(char_type) == 1
.CHAR_BIT == 8
or sizeof(wchar_t) > 1
(if any).