Document Number: P1139R2
Date: 2019-02-18
Audience: SG16, CWG
Author: R. Martinho Fernandes
Reply-to: cpp@rmf.io
Review of some editorial fixes following the recent update of the normative reference to ISO 10646 has unearthed a series of wording issues around the subject. This paper intends to fix those issues by rewording relevant paragraphs.
This paper addresses all of the following issues:
For example, \U99004141
and \U00110000
. Neither of these designates a code point in ISO 10646, but the standard is silent about this, which makes the behaviour undefined by omission.
This paper addresses this by making such uses ill-formed, maintaining consistency with the current treatment of surrogate values (\U0000D800
is already ill-formed).
This paper addresses this by removing the need for this term.
This paper changes all the relevant wording to use U+ notation.
This paper moves such explanations to non-normative text, and clarifies some existing explanations.
In this description, text that should be deleted is marked red and striked out; text that should be added is marked green and underlined. Apply these changes on top of the current draft, N4800.
Edit 5.3 [lex.charset], paragraph 2 as follows.
2 The universal-character-name construct provides a way to name other characters.
hex-quad:
hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digituniversal-character-name:
\u hex-quad
\U hex-quad hex-quadThe character designated by the universal-character-name
\UNNNNNNNN
\U00NNNNNN
is that characterwhose character short name in ISO/IEC 10646 isthat has U+NNNNNN as a code point short identifier; the character designated by the universal-character-nameNNNNNNNN
\uNNNN
is that characterwhose character short name in ISO/IEC 10646 isthat has U+NNNN as a code point short identifier.0000NNNN
If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800-0xDFFF, inclusive)If a universal-character-name does not correspond to a code point in ISO/IEC 10646 or if a universal-character-name corresponds to a surrogate code point , the program is ill-formed. Additionally, ifthe hexadecimal value fora universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character(in either of the ranges 0x00-0x1F or 0x7F-0x9F, both inclusive)or to a character in the basic source character set, the program is ill-formed. [Note: ISO/IEC 10646 code points are within the range 0x0-0x10FFFF (inclusive). A surrogate code point is a value in the range 0xD800-0xDFFF (inclusive). A control character is a character whose code point is in either of the ranges 0x0-0x1F or 0x7F-0x9F (both inclusive).—end note]
Edit 5.13.3 [lex.ccon], paragraph 3 as follows.
3 A character literal that begins with
u8
, such asu8'w'
, is a character literal of typechar
, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point valueis representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block)can be encoded as a single UTF-8 code unit [Note: that is, provided it is in the range 0x0-0x7F (inclusive)—end note]. If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple c-chars is ill-formed.
Edit 5.13.3 [lex.ccon], paragraph 4 as follows.
4 A character literal that begins with the letter
u
, such asu'x'
, is a character literal of typechar16_t
. The value of achar16_t
character literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point value is representable with a single 16-bit code unit([Note: that is, provided it is inthe basic multi-lingual planethe range 0x0-0xFFFF (inclusive))—end note]. If the value is not representable with a single 16-bit code unit, the program is ill-formed. Achar16_t
character literal containing multiple c-chars is ill-formed.
Edit 5.13.3 [lex.string], paragraph 10 as follows.
10 A string-literal that begins with
u
, such asu"asdf"
, is achar16_t
string literal. Achar16_t
string literal has type “array of nconst char16_t
”, where n is the size of the string as defined below; it is initialized with the given characters. A single c-char may produce more than onechar16_t
character in the form of surrogate pairs. [Note: A surrogate pair is a representation for a single code point as a sequence of two 16-bit code units.—end note]
Edit 19.8 [cpp.predefined], item (2.4) as follows.
(2.4) —
__STDC_ISO_10646__
An integer literal of the formyyyymmL
(for example,199712L
). If this symbol is defined, then every character in the Unicode required set, when stored in an object of typewchar_t
, has the same value as theshort identifiercode point of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month.