ISO/IEC JTC1 SC22 WG21 N2249 = 07-0109 - 2007-04-19
Lawrence Crowl
This document replaces N2149 = 07-0009 - 2007-01-10.
Many users of C++ need to manipulate Unicode character strings. Unfortunately, there is no C++ standard means to do so.
The ISO C committee has addressed this issue extensively. See ISO/IEC TR 19769:2004 "Extensions for the programming language C to support new character data types" as described in draft report ISO/IEC JTC1 SC22 WG14 N1040 at http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf.
This proposal adopts their work, but with those changes necessary for effective use within C++. In particular, we propose new types to support overloading.
A separate proposal will address specializations for numeric_limits, character traits, basic strings, streams, and insertion operations.
See section 2.5 "Encoding Forms" in
The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)The online version (printing prohibited) is at http://www.unicode.org/versions/Unicode5.0.0/.
See Annex C of ISO 10646-1, which is online at http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc.
See ISO/IEC 10646:2003, which is publicly available in several text and PDF files within a zip archive from http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip.
See UTF-8, UTF-16, UTF-32 & BOM.
The document ISO/IEC TR 19769 (WG14 N1040) provides motivation, new typedefs for the (at least) 16-bit and (at least) 32-bit character types, macros for reporting ISO 10646 encoding, character and string literals, mixed string concatenation, four library functions, and a new header with appropriate declarations.
The document ISO/IEC TR 19769 (WG14 N1040) can be adopted with few changes. Further changes are possible, but this proposal minimizes the changes to ensure maximum interoperability.
Define char16_t to be a distinct new type, that has the same size and representation as uint_least16_t. Likewise, define char32_t to be a distinct new type, that has the same size and representation as uint_least32_t.
[N1040 defined char16_t and char32_t as typedefs to uint_least16_t and uint_least32_t, which make overloading on these characters impossible.]
[The experiments on open-source software indicate that these identifiers are not commonly used, and when used, used in a manner consistent with the proposal.]
Add a new C++ header <cuchar> corresponding to the new C header <uchar.h>.
Clarify the handling of universal character names that do not fit with char16_t. In particular, the interaction with ISO 10646 UTF-16 is underspecified in the C proposal.
The C TR makes the encoding of char16_t and char32_t implementation-defined. It also provides macros to indicate whether or not the encoding is UTF. In contrast, this proposal requires UTF encoding.
To "Table 3 -- keywords", add char16_t and char32_t.
To the grammar, add
- character-literal:
- u' c-char-sequence '
- U' c-char-sequence '
In paragraph 1, edit
A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by one of the letters u, U, or L, as in u'y', U'z', or L'x', respectively. A character literal that does not begin with u, U, or L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined value.
To paragraph 2, edit
A character literal that begins with the letter u,
such as u'y',
is a character literal of type char16_t.
The value of a char16_t literal
containing a single c-char
is equal to its ISO 10646 code point value,
provided that the code point is representable with a single 16-bit code unit.
(That is, provided it is a basic multi-lingual plane code point.)
If the value is not representable within 16 bits,
the program is ill-formed.
A char16_t literal containing multiple c-chars
is ill-formed.
A character literal that begins with the letter U,
such as U'z',
is a character literal of type char32_t.
The value of a char32_t literal
containing a single c-char
is equal to its ISO 10646 code point value.
A char32_t literal containing multiple c-chars
is ill-formed.
A character literal that begins with the letter L,
such as L’x’, is a wide-character literal.
A wide-character literal has type wchar_t
.26)
The value of a wide-character literal containing a single c-char
has value equal to the numerical value of the encoding of the c-char
in the execution wide-character set.
The value of a wide-character literal containing multiple
c-chars is implementation-defined.
In paragraph 4, edit
The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for ordinary literals), char16_t (for literals prefixed by 'u'), char32_t (for literals prefixed by 'U'), or wchar_t (for wide literals).
To the grammar, add
- string-literal:
- u" s-char-sequenceopt "
- U" s-char-sequenceopt "
In paragraph 1, edit
A string literal is a sequence of characters (as defined in 2.13.2) surrounded by double quotes, optionally beginning with one of the letters u, U, or L, as in "...", u"...", U"..." or L"...", respectively. A string literal that does not begin with u, U, or L, is an ordinary string literal, also referred to as a narrow string literal. An ordinary string literal has type "array of n const char" and has static storage duration (3.7), where n is the size of the string as defined below, and is initialized with the given characters. A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type "array of n const char16_t" and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters. A single c-char may produce more than one char16_t in the form of surrogate pairs. A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type "array of n const char32_t" and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters. A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type "array of n const wchar_t" and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters.
In paragraph 3, replace
In translation phase 6 (2.1), adjacent string literals are concatenated.If a narrow string literal token is adjacent to a wide string literal token, the result is a wide string literal.If both string literals have the same prefix, the resulting concatenated string literal has that prefix. If one string literal has no prefix, it is treated as a string literal of the same prefix as the other operand. Any other concatenations are conditionally supported with implementation-defined behavior. Note that this concatenation is an interpretation, not a conversion. [Example: Here are some examples of valid concatenations:] Characters in concatenated strings are kept distinct. [ Example:
source means source means source means u"a" u"b" u"ab" U"a" U"b" U"ab" L"a" L"b" L"ab" u"a" "b" u"ab" U"a" "b" U"ab" L"a" "b" L"ab" "a" u"b" u"ab" "a" U"b" U"ab" "a" L"b" L"ab" "\xA" "B"contains the two characters ’\xA’ and ’B’ after concatenation (and not the single hexadecimal character ’\xAB’). -- end example ]
In paragraph 5, edit
Escape sequences and universal-character-names in string literals have the same meaning as in character literals (2.13.2), except that the single quote ’ is representable either by itself or by the escape sequence \’, and the double quote " shall be preceded by a \. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L’\0’. The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'. [Note: The size of a char16_t string literal is the number of code units, not the number of characters.] Within char32_t or char16_t literals, any universal-character-names must be within the range 0x0 to 0x10FFFF. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating ’\0’.
In paragraph 5, edit
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.1.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.9) as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <stdint.h>, called the underlying types.
The <stdint.h> header is from ISO C as proposed in document WG21 N1835 = 05-0095, and subsequently adopted into ISO/IEC TR 19768: C++ Library Extensions TR1.
In paragraph 7, edit
Types bool, char, char16_t, char32_t, wchar_t, and the signed and unsigned integer types are collectively called integral types.48) A synonym for integral type is integer type. The representations of integral types shall define values by use of a pure binary numeration system.49) ....
In paragraph 2, edit
A string literal (2.13.4)that is not a wide string literalwith no prefix, with u prefix, with U prefix, or with L prefix can be converted to an rvalue of type "pointer to char"; a wide string literal can be converted to an rvalue of type"pointer to char16_t", "pointer to char32_t", or "pointer to wchar_t", respectively. Ineitherany case, the result is a pointer to the first element of the array. ....
In paragraph 1, edit
An rvalue of an integer type other than bool, char16_t, char32_t, or wchar_t whose integer conversion rank (4.13) is less than the rank of int can be converted to an rvalue of type int if int can represent all the values of the source type; otherwise, the source rvalue can be converted to an rvalue of type unsigned int.
In paragraph 2, edit
An rvalue of type char16_t, char32_t, or wchar_t (3.9.1) can be converted to an rvalue of the first of the following types that can represent all the values of its underlying type: int, unsigned int, long int, unsigned long int, long long int, or unsigned long long int. If none of the types in that list can represent all the values of its underlying type, An rvalue of type char16_t, char32_t, or wchar_t can be converted to an rvalue of its underlying type.
In paragraph 1, bullet 8, edit
The ranks of char16_t, char32_t, and wchar_t shall equal the rank ofitstheir underlying types (3.9.1).
In paragraph 10, bullet 4, footnote 59, edit
As a consequence, operands of type bool, char16_t, char32_t, wchar_t, or an enumerated type are converted to some integral type.
In paragraph 1, note 1, edit
[ Note: in particular, sizeof(bool), sizeof(char16_t), sizeof(char32_t), and sizeof(wchar_t) are implementation-defined.73) -- end note ]
To the grammar in paragraph 1, add
- simple-type-specifier:
- char16_t
- char32_t
To Table 8 "simple-type-specifiers and the types they specify", add
char16_t "char16_t" char32_t "char32_t"
In paragraph 15, bullet 2, edit
If the destination type is an array of characters, an array of char16_t, an array of char32_t, or an array of wchar_t, and the initializer is a string literal, see 8.5.2.
In paragraph 1, edit
A char array (whether plain char, signed char, or unsigned char), char16_t array, char32_t array, or wchar_t array can be initialized by a string-literal (optionally enclosed in braces); a wchar_t array can be initialized by a wide string-literal (optionally enclosed in braces)with no prefix, with u prefix, with U prefix, or with L prefix, respectively; successive characters of the string-literal initialize the members of the array. ....
In paragraph 3, note 1, edit
[ Note: the temporary object created for a throw-expression that is a string literal is never of type char*, char16_t, char32_t, or wchar_t*; that is, the special conversions for string literals from the types "array of const char", "array of const char16_t", "array of const char32_t", and "array of const wchar_t" to the types "pointer to char", "pointer to char16_t", "pointer to char32_t", and "pointer to wchar_t", respectively (4.2), are never applied to a throw-expression. -- end note ]
In paragraph 4, edit
The strings components provide support for manipulating text represented as sequences of type char, sequences of type char16_t, sequences of type char32_t, sequences of type wchar_t, or sequences of any other "character-like" type. The localization components extend internationalization support for such text processing.
In paragraph 1, edit
character
in clauses 21, 22, and 27, means any object which, when treated sequentially, can represent text. The term does not only mean char, char16_t, char32_t, and wchar_t objects, but any value that can be represented by a type that provides the definitions specified in these clauses.
A char16-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type char16_t (3.9.1), optionally qualified by any combination of const and volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.
A null-terminated char16-character string, or NTC16S, is a char16-character sequence whose highest-addressed element with defined content has the value zero. [Footnote: Many of the objects manipulated by function signatures declared in <cuchar> are char16-character sequences or NTC16Ss.]
The length of an NTC16S is the number of elements that precede the terminating null char16 character. An empty NTC16S has a length of zero.
The value of an NTC16S is the sequence of values of the elements up to and including the terminating null character.
A static NTC16S is an NTC16S with static storage duration. [Footnote: A char16 string literal, such as u"abc", is a static NTC16S.]
A char32-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type char32_t (3.9.1), optionally qualified by any combination of const and volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.
A null-terminated char32-character string, or NTC32S, is a char32-character sequence whose highest-addressed element with defined content has the value zero. [Footnote: Many of the objects manipulated by function signatures declared in <cuchar> are char32-character sequences or NTC32Ss.]
The length of an NTC32S is the number of elements that precede the terminating null char32 character. An empty NTC32S has a length of zero.
The value of an NTC32S is the sequence of values of the elements up to and including the terminating null character.
A static NTC32S is an NTC32S with static storage duration. [Footnote: A char32 string literal, such as U"abc", is a static NTC32S.]
To table 12, add <cuchar>.
In paragraph 5, footnote 168, add <cuchar>.
Add paragraph 20,
Table 50 describes headers <cuchar> and <uchar.h>. The distinction is that <cuchar> defines the function names within namespace std and that <uchar.h> defines them at global scope.
Add Table 50,
Table 50 -- Headers <cuchar> and <uchar.h> synopsis Macro Names __STDC_UTF_16__ __STDC_UTF_32__ Function Names mbrtoc16 c16rtomb mbrtoc32 c32rtomb
Add <cuchar> to table 38 under "Null-terminated sequence utilities".
The headers <cuchar> and <uchar.h> define macros and declare functions for use with at-least-16-bit and at-least-32-bit characters.
The headers <cuchar> and <uchar.h> define the macro __STDC_UTF_16__, and values of type char16_t shall be valid UTF-16 code units, as defined by ISO 10646.
The headers <cuchar> and <uchar.h> shall define the macro __STDC_UTF_32__, and values of type char32_t shall be valid UTF-32 code units, as defined by ISO 10646.
#include <cuchar>
size_t std::mbrtoc16(char16_t * pc16, const char * s, size_t n, mbstate_t * ps);
If s is a null pointer, the mbrtoc16 function is equivalent to the call:
mbrtoc16(NULL, "", 1, ps)In this case, the values of the parameters pc16 and n are ignored.
If s is not a null pointer, the mbrtoc16 function inspects at most n bytes beginning with the byte pointed to by s to determine the number of bytes needed to complete the next multibyte character (including any shift sequences). If the function determines that the next multibyte character is complete and valid, it determines the value of the corresponding wide character and then, if pc16 is not a null pointer, stores that value in the object pointed to by pc16. If the corresponding wide character is the null wide character, the resulting state described is the initial conversion state.
Note: When n has at least the value of the MB_CUR_MAX macro, this case can only occur if s points at a sequence of redundant shift sequences (for implementations with state-dependent encodings).
#include <cuchar>
size_t std::c16rtomb(char * s, char16_t c16, mbstate _t * ps);
If s is a null pointer, the c16rtomb function is equivalent to the call
c16rtomb(buf, L'\0', ps)where buf is an internal buffer.
If s is not a null pointer, the c16rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given by c16 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s. At most MB_CUR_MAX bytes are stored. If c16 is a null wide character, a null byte is stored, preceded by any shift sequence needed to restore the initial shift state; the resulting state described is the initial conversion state.
The c16rtomb function returns the number of bytes stored in the array object; this may be 0 (including any shift sequences). When c16 is not a valid wide character, an encoding error occurs: the function stores the value of the macro EILSEQ in errno and returns (size_t)(-1); the conversion state is unspecified.
#include <cuchar>
size_t std::mbrtoc32(char32_t * pc32, const char * s, size_t n, mbstate_t * ps);
If s is a null pointer, the mbrtoc32 function is equivalent to the call:
mbrtoc32(NULL, "", 1, ps)In this case, the values of the parameters pc32 and n are ignored.
If s is not a null pointer, the mbrtoc32 function inspects at most n bytes beginning with the byte pointed to by s to determine the number of bytes needed to complete the next multibyte character (including any shift sequences). If the function determines that the next multibyte character is complete and valid, it determines the value of the corresponding wide character and then, if pc32 is not a null pointer, stores that value in the object pointed to by pc32. If the corresponding wide character is the null wide character, the resulting state described is the initial conversion state.
Note: When n has at least the value of the MB_CUR_MAX macro, this case can only occur if s points at a sequence of redundant shift sequences (for implementations with state-dependent encodings).
#include <cuchar>
size_t std::c32rtomb(char * s, char32_t c32, mbstate_t * ps);
If s is a null pointer, the c32rtomb function is equivalent to the call
c32rtomb(buf, L'\0', ps)where buf is an internal buffer.
If s is not a null pointer, the c32rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given by c32 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s. At most MB_CUR_MAX bytes are stored. If c32 is a null wide character, a null byte is stored, preceded by any shift sequence needed to restore the initial shift state; the resulting state described is the initial conversion state.
The c32rtomb function returns the number of bytes stored in the array object; this may be 0 (including any shift sequences). When c32 is not a valid wide character, an encoding error occurs: the function stores the value of the macro EILSEQ in errno and returns (size_t)(-1); the conversion state is unspecified.
At the end of Subclause _lex.string: Change:, add
The type of a char16 string literal is changed from array of some-integer-type to array of const char16_t. The type of a char32 string literal is changed from array of some-integer-type to array of const char32_t.
Add section.
The types char16_t and char32_t are distinct types rather than typedefs to existing integral types.
Replace "18 C headers" with "18 C headers and 1 C technical report header".
To table 101, add
<uchar.h>