<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="preview__inner-2" style="padding: 10px 20px 573px;">
<div class="cl-preview-section">
<h1 id="make-char16_tchar32_t-string-literals-be-utf-1632">Make
char16_t/char32_t string literals be UTF-16/32</h1>
</div>
<div class="cl-preview-section">
<p>Document Number: P1041R0<br>
Date: 2018-04-24<br>
Audience: Evolution Working Group<br>
Reply-to: <a href="mailto:cpp@rmf.io">cpp@rmf.io</a></p>
</div>
<div class="cl-preview-section">
<h2 id="introduction">Introduction</h2>
</div>
<div class="cl-preview-section">
<p>C++11 introduced character types suitable for code units of
the UTF-16 and UTF-32 encoding forms, namely <code>char16_t</code>
and <code>char32_t</code>. Along with this, it also
introduced new string literals whose types are arrays of those
two character types, prefixed with <code>u</code> and <code>U</code>,
respectively. And last but not least, it also introduced <em>UTF-8
string literals</em>, prefixed with <code>u8</code>, with
types arrays of <code>const char</code>. Of these three new
string literal types, only one has a guarantee about the
values that the elements of the array have; in other words,
only one has a guaranteed encoding form, the <em>UTF-8 string
literals</em>.</p>
</div>
<div class="cl-preview-section">
<p>The standard text hints that the <code>char16_t</code> and <code>char32_t</code>
string literals are intended to be encoded as, respectively,
UTF-16 and UTF-32, but unlike it does for <em>UTF-8 string
literals</em>, it never explicitly makes such a requirement.</p>
</div>
<div class="cl-preview-section">
<h2 id="motivation">Motivation</h2>
</div>
<div class="cl-preview-section">
<p>In defining <code>char16_t</code> string literals
([lex.string]/10), the standard makes a mention of “surrogate
pairs”:</p>
</div>
<div class="cl-preview-section">
<blockquote>
<p>A string-literal that begins with <code>u</code>, such as
<code>u"asdf"</code>, is a <code>char16_t</code> string
literal. A <code>char16_t</code> string literal has type
“array of <em>n</em> <code>const char16_t</code>”, where <em>n</em>
is the size of the string as defined below; it is
initialized with the given characters. A single <em>c-char</em>
may produce more than one <code>char16_t</code> character
in the form of surrogate pairs.</p>
</blockquote>
</div>
<div class="cl-preview-section">
<p>Further down, when defining the size of <code>char16_t</code>
string literals ([lex.string]/15), there is another mention of
“surrogate pairs”:</p>
</div>
<div class="cl-preview-section">
<blockquote>
<p>The size of a <code>char16_t</code> string literal is the
total number of escape sequences, <em>universal-character-names</em>,
and other characters, plus one for each character requiring
a surrogate pair, plus one for the terminating <code>u'\0'</code>.
[<em>Note:</em> The size of a char16_t string literal is
the number of code units, not the number of characters. — <em>end
note</em>]</p>
</blockquote>
</div>
<div class="cl-preview-section">
<p>For <code>char32_t</code> string literals, the definition of
their size ([lex.string]/15) essentially limits the encoding
form used to one that doesn’t have more than one code unit per
character:</p>
</div>
<div class="cl-preview-section">
<blockquote>
<p>The size of a <code>char32_t</code> or wide string literal
is the total number of escape sequences, <em>universal-character-names</em>,
and other characters, plus one for the terminating <code>U'\0'</code>
or <code>L'\0'</code>.</p>
</blockquote>
</div>
<div class="cl-preview-section">
<p>Additionally, the standard constrains the range of <em>universal-character-names</em>
to the range that is supported by all of the UTF encoding
forms discussed here:</p>
</div>
<div class="cl-preview-section">
<blockquote>
<p>Within <code>char32_t</code> and <code>char16_t</code>
string literals, any <em>universal-character-names</em>
shall be within the range <code>0x0</code> to <code>0x10FFFF</code>.</p>
</blockquote>
</div>
<div class="cl-preview-section">
<p>All of these requirements, while never explicitly naming the
UTF-16 or UTF-32 encoding forms, strongly imply that these are
the encoding forms intended. Furthermore, it would be
questionable for an implementation to pick any other encoding
forms for these string literals: there is no well-known
encoding form that uses a concept named “surrogate pair” other
than UTF-16, and there is no well-known encoding form that
encodes each character as a single 32-bit code unit other than
UTF-32.</p>
</div>
<div class="cl-preview-section">
<p>In practice, all implementations use UTF-16 and UTF-32 for
these string literals. C++ should standardize this practice
and make these requirements explicit instead of just hinting
at them.</p>
</div>
<div class="cl-preview-section">
<h2 id="proposal">Proposal</h2>
</div>
<div class="cl-preview-section">
<p>This proposal renames "<code>char16_t</code> string literals"
and "<code>char32_t</code> string literals" to “UTF-16 string
literals” and “UTF-32 string literals”, to match the existing
“UTF-8 string literals”, and explicitly requires the object
representations of those literals to be the values that
correspond to the UTF-16 and UTF-32 (respectively) encodings
of the given characters.</p>
</div>
<div class="cl-preview-section">
<h2 id="technical-specifications">Technical Specifications</h2>
</div>
<div class="cl-preview-section">
<ul>
<li>
<p>Add to [lex.string]/10:</p>
<blockquote>
<p>A <em>string-literal</em> that begins with <code>u</code>,
such as <code>u"asdf"</code>, is a <del><code>char16_t</code>
string literal</del><ins><em>UTF-16 string literal</em></ins>.
A <del><code>char16_t</code> string literal</del><ins>UTF-16
string literal</ins> has type “array of <em>n</em> <code>const
char16_t</code>”, where <em>n</em> is the size of the
string as defined below; it is initialized with the
given characters. A single <em>c-char</em> may produce
more than one <code>char16_t</code> character in the
form of surrogate pairs.</p>
</blockquote>
</li>
<li>
<p>Change [lex.string]/11:</p>
<blockquote>
<p>A <em>string-literal</em> that begins with <code>U</code>,
such as <code>U"asdf"</code>, is a <del><code>char32_t</code>
string literal</del><ins><em>UTF-32 string literal</em></ins>.
A <del><code>char32_t</code> string literal</del><ins>UTF-32
string literal</ins> has type “array of <em>n</em> <code>const
char32_t</code>”, where <em>n</em> is the size of the
string as defined below; it is initialized with the
given characters.</p>
</blockquote>
</li>
<li>
<p>Insert a paragraph between [lex.string]/10 and /11:</p>
<blockquote>
<p><ins>For a UTF-16 string literal, each successive
element of the object representation has the value of
the corresponding code unit of the UTF-16 encoding of
the string.</ins></p>
</blockquote>
</li>
<li>
<p>Insert a paragraph between [lex.string]/11 and /12:</p>
<blockquote>
<p><ins>For a <em>UTF-32 string literal</em>, each
successive element of the object representation has
the value of the corresponding code unit of the UTF-32
encoding of the string.</ins></p>
</blockquote>
</li>
<li>
<p>Change [lex.ccon]/4:</p>
<blockquote>
<p>A character literal that begins with the letter <code>u</code>,
such as <code>u'x'</code>, is a character literal of
type <code>char16_t</code><ins>, known as a <em>UTF-8
character literal</em></ins>. The value of a <del><code>char16_t</code></del><ins>UTF-16</ins>
character literal containing a single <em>c-char</em>
is equal to its ISO 10646 code point value, provided
that the code point value is representable with a single
16-bit code unit (that is, provided it is in the basic
multi-lingual plane). If the value is not representable
with a single 16-bit code unit, the program is
ill-formed. A <del><code>char16_t</code></del><ins>UTF-16</ins>
character literal containing multiple <em>c-char</em>s
is ill-formed.</p>
</blockquote>
</li>
<li>
<p>Change [lex.ccon]/5:</p>
<blockquote>
<p>A character literal that begins with the letter <code>U</code>,
such as <code>U'y'</code>, is a character literal of
type <code>char32_t</code>. The value of a <del><code>char32_t</code></del><ins>UTF-32</ins>
character literal containing a single <em>c-char</em>
is equal to its ISO 10646 code point value. A <del><code>char32_t</code></del><ins>UTF-32</ins>
character literal containing multiple <em>c-char</em>s
is ill-formed.</p>
</blockquote>
</li>
</ul>
</div>
<div class="cl-preview-section">
<h2 id="interaction-with-other-papers">Interaction with other
papers</h2>
</div>
<div class="cl-preview-section">
<p>Currently, the standard lacks a normative reference to
UTF-16, and UTF-32; however, it also lacks one such reference
for UTF-8. This paper assumes the this problem will fixed for
all three encodings in another paper, potentially <a
href="https://github.com/sg16-unicode/sg16/blob/master/papers/D1025R0.md">D1025R0</a>
(<em>Update The Reference To The Unicode Standard</em>).</p>
</div>
<div class="cl-preview-section">
<p>This paper was also written so as to not conflict with <a
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r2.html">P0482R2</a>
(<em>char8_t: A type for UTF-8 characters and strings
(Revision 2)</em>).</p>
</div>
</div>
</body>
</html>