Doc. no.   N2295=07-0155
Date:        2007-06-23
Project:     Programming Language C++
Reply to:   Lawrence Crowl <lawrence at crowl.org>
                Beman Dawes <bdawes at acm.org>

Raw and Unicode String Literals; Unified Proposal

Introduction

Two papers, N2209, UTF-8 String Literals, by Lawrence Crowl, and N2146, Raw String Literals (Revision 1), by Beman Dawes, propose additional forms of string literals for C++. Both have been approved by the Evolution Working Group and are ready for processing by the Core Working Group. Both papers make changes to the same text in the Working Paper. This proposal unifies the changed wording to avoid race conditions in editing the text.

The motivation, discussion, and other details from the original proposals remains unchanged.

The proposed text below is the same as in the original papers, except:

The original raw string literal syntax allowed the 'R' that denotes a raw string literal either before or after other prefixes. Thus either LR or RL were valid. To reduce the combinatorial explosion caused by the addition of the u, U, and u8 prefixes, the R is now only valid following the other portion of a prefix. This is the same as in Python.
The original UTF-8 string literal wording made any source character set extensions to the basic source character set implementation-defined, but only for literals. It seemed awkward to make the source character set implementation-defined, but in only literals, so that was changed to apply to the entire source file. Non-normative encouragement to support all 16-bit ISO/IEC 10646 characters was added to encourage physical source file character set uniformity. That's existing practice for compilers such as VC++.

Proposed Text

Change 1.7 The C++ memory model [intro.memory] as indicated:

The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain ~~any member of the basic execution character set~~ the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.

Change 2.1 [lex.phases], paragraph 1 as indicated. (Note to reviewers: the ISO/IEC short name wording is the same as used in 2.2 Character sets [lex.charset] paragraph two.)

1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. [Note: Implementations are encouraged to accept as physical source file characters all the characters whose character short name in ISO/IEC 10646 is 0000NNNN. --end note] Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

Change 2.1 [lex.phases], paragraph 1 as indicated:

5. Each source character set member~~, escape sequence~~, or universal-character-name in character literals and string literals, and escape sequence in character literals and regular string literals, is converted to the corresponding member of the execution character set (2.13.2, 2.13.4); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.¹⁷⁾

Change 2.13.4 String literals [lex.string] as indicated:

string-literal:
        "s-char-sequence_opt"
        u8"s-char-sequence_opt"
        u"s-char-sequence_opt"
        U"s-char-sequence_opt"
        L"s-char-sequence_opt"     R"d-char-sequence_opt [r-char-sequence_opt ]d-char-sequence_opt "     u8R"d-char-sequence_opt [r-char-sequence_opt ]d-char-sequence_opt "     uR "d-char-sequence_opt [r-char-sequence_opt ]d-char-sequence_opt "     UR "d-char-sequence_opt [r-char-sequence_opt ]d-char-sequence_opt "     LR "d-char-sequence_opt [r-char-sequence_opt ]d-char-sequence_opt "

s-char-sequence:
        s-char
        s-char-sequence s-char

s-char:
        any member of the source character set except the double-quote ", backslash \, or new-line character
        escape-sequence
        universal-character-name

r-char-sequence:
    r-char
    r-char-sequence r-char

r-char:
    any member of the source character set, except the right square bracket ]
                        when followed by the initial d-char-sequence, if present, followed by the double quote ".
    universal-character-name

d-char-sequence:
    d-char
    d-char-sequence d-char

d-char:
    any member of the source character set, except the left square bracket [, the right square bracket ],
                        or the control characters representing horizontal tab, vertical tab, form feed, or new-line.

A string literal is a sequence of characters (as defined in 2.13.2) surrounded by double quotes, optionally ~~beginning with one of the letters~~ prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"[...]" , u8"...", u8R"**[...]**",u"...", uR[*@"...]*@", U"...",UR"zzz[...]zzz", L"...", or LR"[...]", respectively.

A string literal that does not have an R in the prefix is a regular string literal. A string literal that has an R in the prefix is a raw string literal. The terminating d-char-sequence of a raw string literal shall be the same sequence of characters as the initial d-char-sequence, The maximum length of d-char-sequence shall be 16 characters.

[Note: A source-file new-line in a raw string-literal results in a new-line in the resulting execution string-literal, unless preceded by a backslash. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

const char * p = R"[a\ b c]"; assert(std::strcmp(p, "ab\nc") == 0); -- end note]

A string literal that does not begin with u8, u, U, or L is an ordinary string literal, and is initialized with the given characters.

A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.

Ordinary string literals and UTF-8 string literals are also referred to as a narrow string literals. A~~n ordinary~~ narrow string literal has type “array of n const char”, where n is the size of the string as defined below, it and has static storage duration (3.7) ~~and is initialized with the given characters~~.

A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.

A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.

A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below, it has static storage duration and is initialized with the given characters.

Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined.

In translation phase 6 (2.1), adjacent string literals are concatenated. If both string literals have the same prefix, the resulting concatenated string literal has that prefix. If one string literal has no prefix, it is treated as a string literal of the same prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally supported with implementation-defined behavior. [ Note: This concatenation is an interpretation, not a conversion. —end note ] [ Example: Here are some examples of valid concatenations:

Table unchanged

—end example ] Characters in concatenated strings are kept distinct. [ Example:
"\xA" "B"

contains the two characters ’\xA’ and ’B’ after concatenation (and not the single hexadecimal character ’\xAB’). —end example ]

After any necessary concatenation, in translation phase 7 (2.1), ’\0’ is appended to every string literal so that programs that scan a string can find its end.

Escape sequences in regular string literals and universal-character-names in string literals have the same meaning as in character literals (2.13.2), except that the single quote ’ is representable either by itself or by the escape sequence \’, and the double quote " shall be preceded by a \. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’. The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’. [ Note: The size of a char16_t string literal is the number of code units, not the number of characters. —end note ] Within char32_t and char16_t literals, any universal-character-names must be within the range 0x0 to 0x10FFFF. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating ’\0’.