ISO/IEC JTC1 SC22 WG21 N2170 = 07-0030 - 2007-02-02
Lawrence Crowl
The current standard prohibits using universal character names
to specify many characters,
in particular the control characters
(00-1F, 7F-9F)
and the basic source characters
(20-23
,
25-3F !"#
,
41-5F %&'()*+,-./0123456789:;<=>?
,
61-7E ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
).
By implication,
the standard permits specifying the printable ASCII characters
24 abcdefghijklmnopqrstuvwxyz{|}~
,
40 $
, and
60 @
.
`
While the prohibition against basic source characters is generally not a significant problem, the prohibition against control characters within character and string literals causes programmers to fall back upon traditional escape sequences, which makes the code more platform-dependent.
For example, the high control characters of Unicode (80-9F)
have code points with different meanings in windows-1252.
In UTF-8, those points also have a different representation.
For example, "\u0085"
would be "\xC2\x85"
.
The current C++ standard permits specification of universal characters within the range D800 through DFFF inclusive. These values do not identify characters, but rather identify half of surrogate pairs. The C 1999 standard prohibits specification of these values.
This problem is core issue number 558, and this paper proposes a solution to that issue.
The only potential need for values within this range is processing of strings. In those rare cases, use of direct numeric constants (e.g. 0xD83F) will suffice.
We propose to lift the prohibitions on control and basic source universal character names within character and string literals. We propose to add prohibitions against surrogate values in all universal character names.
The existing wording in the phases of translation (2.1) and existing grammar for character (2.13.2) and string (2.13.4) literals prevents problems parsing literals because interpretation of the universal character names occurs after tokenization. Because the prohibitions remain outside of string literals, the existing parse is not affected.
In paragraph 2, edit
The universal-character-name construct provides a way to name other characters.The character designated by the universal-character-name
- hex-quad:
- hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
- universal-character-name:
\u
hex-quad\U
hex-quad hex-quad\UNNNNNNNN
is that character whose character short name in ISO/IEC 10646 isNNNNNNNN
; the character designated by the universal-character-name\uNNNN
is that character whose character short name in ISO/IEC 10646 is0000NNNN
. If the hexadecimal value for a universal character name corresponds to a surrogate code point (in the range 0xD800-0xDFFF, inclusive), the program is ill-formed. Additionally, if the hexadecimal value for a universal character name outside a character or string literalis less than 0x20, or in the range 0x7F-0x9F (inclusive),corresponds to a control character (in either of the ranges 0x0-0x1F or 0x7F-0x9F, both inclusive) or toif the universal character name designatesa character in the basic source character set,thenthe program is ill-formed.