Universal Character Names in Literals

ISO/IEC JTC1 SC22 WG21 N2170 = 07-0030 - 2007-02-02

Lawrence Crowl

Problem: Excessive Exclusion of Some Characters

The current standard prohibits using universal character names to specify many characters, in particular the control characters (00-1F, 7F-9F) and the basic source characters (20-23 !"#, 25-3F %&'()*+,-./0123456789:;<=>?, 41-5F ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_, 61-7E abcdefghijklmnopqrstuvwxyz{|}~). By implication, the standard permits specifying the printable ASCII characters 24 $, 40 @, and 60 `.

While the prohibition against basic source characters is generally not a significant problem, the prohibition against control characters within character and string literals causes programmers to fall back upon traditional escape sequences, which makes the code more platform-dependent.

For example, the high control characters of Unicode (80-9F) have code points with different meanings in windows-1252. In UTF-8, those points also have a different representation. For example, "\u0085" would be "\xC2\x85".

Problem: Excessive Inclusion of Some Values

The current C++ standard permits specification of universal characters within the range D800 through DFFF inclusive. These values do not identify characters, but rather identify half of surrogate pairs. The C 1999 standard prohibits specification of these values.

This problem is core issue number 558, and this paper proposes a solution to that issue.

The only potential need for values within this range is processing of strings. In those rare cases, use of direct numeric constants (e.g. 0xD83F) will suffice.

Solution

We propose to lift the prohibitions on control and basic source universal character names within character and string literals. We propose to add prohibitions against surrogate values in all universal character names.

The existing wording in the phases of translation (2.1) and existing grammar for character (2.13.2) and string (2.13.4) literals prevents problems parsing literals because interpretation of the universal character names occurs after tokenization. Because the prohibitions remain outside of string literals, the existing parse is not affected.

2.2 Character sets [lex.charset]

In paragraph 2, edit

The universal-character-name construct provides a way to name other characters.

hex-quad:

hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit

universal-character-name:

\u hex-quad

\U hex-quad hex-quad

The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal character name corresponds to a surrogate code point (in the range 0xD800-0xDFFF, inclusive), the program is ill-formed. Additionally, if the hexadecimal value for a universal character name outside a character or string literal ~~is less than 0x20, or in the range 0x7F-0x9F (inclusive),~~ corresponds to a control character (in either of the ranges 0x0-0x1F or 0x7F-0x9F, both inclusive) or to ~~if the universal character name designates~~ a character in the basic source character set, ~~then~~ the program is ill-formed.