ISO/IEC JTC1 SC22 WG21 N2209 = 07-0069 - 2007-03-08
Lawrence Crowl
This document replaces N2159 = 07-0019 - 2007-01-10.
Many users of C++ need to manipulate Unicode character strings. While N2149 New Character Types for C++ addresses most low-level issues, it does not provide a mechanism to ensure UTF-8 literals. For portable international code, the standard needs such a mechanism.
We propose to add a new lexical token for UTF-8 string literals. No new types or other language changes are required. In particular, we do not propose character literals.
Adoption of this paper requires all conforming implementations to have bytes of at least eight bits in size. We believe that all existing systems already conform.
Note that this paper does not presume adoption of N2149 New Character Types for C++ and some editorial merge will be necessary.
Likewise, this paper does not presume adoption of N2053 Raw String Literals, for which some editorial merge will also be necessary.
See section 2.5 "Encoding Forms" in
The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)The online version (printing prohibited) is at http://www.unicode.org/versions/Unicode5.0.0/.
See Annex C of ISO 10646-1, which is online at http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc.
See ISO/IEC 10646:2003, which is publicly available in several text and PDF files within a zip archive from http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip.
See UTF-8, UTF-16, UTF-32 & BOM.
To paragraph 1, edit
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to containany member of the basic execution character setthe eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.
To the grammar, edit
- string-literal:
- " c-char-sequenceopt "
- E" c-char-sequenceopt "
- L" c-char-sequenceopt "
To paragraph 1, edit
A string literal is a sequence of characters (as defined in 2.13.2) surrounded by double quotes, optionally beginning with one of the letters E or L, as in"...",E"...", orL"...". A string literal that does not begin withEorLis an ordinary string literal, and is initialized with the given characters. A string literal that begins with E, such as E"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8. It is implementation-defined whether literals may contain more than members of the basic character set and universal character names (\Unnnnnnnn and \unnnn). Ordinary string literals and UTF-8 string literals are also referred to asanarrow string literals. Anordinarynarrow string literal has type "array of nconst char" and static storage duration (3.7), where n is the size of the string as defined below, and is initialized with the given characters. A string literal that begins withL, such asL"asdf", is a wide string literal. A wide string literal has type "array of nconst wchar_t" and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters.
In paragraph 3, edit
In translation phase 6 (2.1), adjacent string literals are concatenated. If an ordinary string literal token is adjacent to a UTF-8 string literal token, the result is a UTF-8 string literal. Ifa narrowan ordinary string literal token is adjacent to a wide string literal token, the result is a wide string literal. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed.
Paragraph 5 already admits a multi-byte encoding of narrow string literals.
To paragraph 1, after the first sentence, add
Objects declared as characters (char)
shall be large enough to store
either one byte (1.7 [intro.memory]) or
any member of the implementation's basic character set.
If a character from this set is stored in a character object,
the integral value of that character object
is equal to the value of the single character literal form of that character.