ISO/IEC JTC1 SC22 WG21
N4267 / EWG 119
Richard Smith
richard@metafoo.co.uk
2014-11-05

Adding u8 character literals

Wording

Change in 2.14.3 (lex.ccon):

character-literal:
        ' c-char-sequence '
        u' c-char-sequence '
        U' c-char-sequence '
        L' c-char-sequence '
        encoding-prefix_opt ' c-char-sequence '
encoding-prefix: one of
        u8 u U L

[…]

Change in 2.14.3 (lex.ccon) paragraph 1 and split it into two paragraphs:

A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by ~~one of the letters~~ u8, u, U, or L, as in u8'w', u'y', U'z', or L'x', respectively.
A character literal that does not begin with u8, u, U, or L is an ordinary character literal~~, also referred to as a narrow-character literal~~. An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.

Drafting note: the term "narrow-character literal" was not used anywhere else in the standard, and confusingly sometimes referred to literals of non-narrow-character type.

Change in 2.14.3 (lex.ccon) paragraph 2 and split it into four paragraphs:

A character literal that begins with u8, such as u8'w', is a character literal of type char, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple c-chars is ill-formed.
A character literal that begins with the letter u, such as u'y', is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed. A char16_t literal containing multiple c-chars is ill-formed.
A character literal that begins with the letter U, such as U'z', is a character literal of type char32_t. The value of a char32_t literal containing a single c-char is equal to its ISO 10646 code point value. A char32_t literal containing multiple c-chars is ill-formed.
A character literal that begins with the letter L, such as L'x', is a wide-character literal. A wide-character literal has type wchar_t. [Footnote: …] The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined. [ Note: The type wchar_t is able to represent all members of the execution wide-character set (see 3.9.1). ]. The value of a wide-character literal containing multiple c-chars is implementation-defined.

Change in 2.14.3 (lex.ccon) paragraph 4:

[…] The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for literals with no prefix)~~, char16_t (for literals prefixed by 'u'), char32_t (for literals prefixed by 'U'),~~ or wchar_t (for literals prefixed by 'L'). [ Note: If the value of a character literal prefixed by u, u8, or U is outside the range defined for its type, the program is ill-formed. ]

Change in 2.14.5 (lex.string):

[…]

encoding-prefix:
        u8
        u
        U
        L

[…]