ISO/IEC JTC1 SC22 WG21 N2159 = 07-0019 - 2007-01-10
Lawrence Crowl
Many users of C++ need to manipulate Unicode character strings. While N2149 New Character Types for C++ addresses most low-level issues, it does not provide a mechanism to ensure UTF-8 literals. For portable international code, the standard needs such a mechanism.
We propose to add a new lexical token for UTF-8 string literals. No new types or other language changes are required. In particular, we do not propose character literals.
Note that this paper does not presume adoption of N2149 and some editorial merge will be necessary.
Likewise, this paper does not presume adoption of N2053 Raw String Literals, for which some editorial merge will also be necessary.
See section 2.5 "Encoding Forms" in
The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)The online version (printing prohibited) is at http://www.unicode.org/versions/Unicode5.0.0/.
See Annex C of ISO 10646-1, which is online at http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc.
See ISO/IEC 10646:2003, which is publicly available in several text and PDF files within a zip archive from http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip.
See UTF-8, UTF-16, UTF-32 & BOM.
To the grammar, add
- string-literal:
- E" c-char-sequenceopt "
To paragraph 1, replace
optionally beginning with the letter L, as in "..." or L"..."with
optionally beginning with one of the letters L, or E, as in "...", L"...", or E"...", respectively
To paragraph 1, append
A string literal that begins with E, such as E"asdf", is a char string literal. The literal has the typearray of n const charwhere n is the size of the string as defined below, and is initialized with the given characters encoded in UTF-8. It is implementation-defined whether literals may contain more than members of the basic character set and universal character names (\Unnnnnnnn and \unnnn).
In paragraph 3, append
If any narrow string literal in the concatenation specifies UTF-8 encoding, the resulting string has UTF-8 encoding.
Paragraph 5 already admits a multi-byte encoding of ordinary character string literals.