SC22/WG20
N992
Date: 2002-11-15
WG14 meeting in April 2002 discussed document N969 , and in October 2002 discussed document N977 . During the discussion, the following basic criteria were considered to be important when forming an outline of further discussions on additional character data types:
There is a consensus to call the new data type char16_t and char32_t. The names suggest that the width of the new data types are well defined; the encoding of those data types is implementation-defined.
1.1 Simple approach with a prefix for literals
Using a one-letter prefix, similar to the notation L"str" for wide string literals,
u"str"The literal is used to initialize an array of char16_t. The corresponding character constants are
u'c'and have the type char16_t.
This proposal covers a 32-bit type, using char32_t , U"str" and U'c'.
C99 subclause 6.10.8 specifies that the value of the macro __STDC_ISO_10646__ shall be "an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." C99 subclause 6.4.5p5 specifies that wide string literals are initialized with a sequence of wide characters as defined by the mbstowcs function with an implementation-defined current locale.
There shall be a macro __STDC_UTF_16__ (or similar) to indicate that char16_t uses UTF-16. This also allows the use of UTF-16 in char16_t even if wchar_t uses a non-Unicode encoding. In certain cases the compile-time conversion to UTF-16 may be restricted to members of the basic character set and universal character names (\Unnnnnnnn and \unnnn) because for these the conversion to UTF-16 is defined unambiguously.
The encoding of char32_t can be defined in the same manner using __STDC_UTF_32__.
The encoding of new data types and string literals become implementation defined when the macro __STDC_ UTF_nn __ is not set.
The new string literal formats (u”str” and U”str”) should follow the same catenation rules as the existing L”str” strings; i.e., when adjacent literals of the same format are catenated, also if one of the adjacent literals is a “narrow” string, the result is widened to the representation of the other string literal. Here some examples
u”a” u”b” à u”ab” U”a” U”b” à U”ab” L”a” L”b” à L”ab”
u”a” ”b” à u”ab” U”a” ”b” à U”ab” L”a” ”b” à L”ab”
”a” u”b” à u”ab” ”a” U”b” à U”ab” ”a” L”b” à L”ab”
Any other catenations are implementation-defined (they might or might not be supported).
Last modified: Wed Nov 13 2002