<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="preview__inner-2" style="padding: 10px 20px 573px;">
      <div class="cl-preview-section">
        <h1 id="make-char16_tchar32_t-string-literals-be-utf-1632">Make
          char16_t/char32_t string literals be UTF-16/32</h1>
      </div>
      <div class="cl-preview-section">
        <p>Document Number: P1041R0<br>
          Date: 2018-04-24<br>
          Audience: Evolution Working Group<br>
          Reply-to: <a href="mailto:cpp@rmf.io">cpp@rmf.io</a></p>
      </div>
      <div class="cl-preview-section">
        <h2 id="introduction">Introduction</h2>
      </div>
      <div class="cl-preview-section">
        <p>C++11 introduced character types suitable for code units of
          the UTF-16 and UTF-32 encoding forms, namely <code>char16_t</code>
          and <code>char32_t</code>. Along with this, it also
          introduced new string literals whose types are arrays of those
          two character types, prefixed with <code>u</code> and <code>U</code>,
          respectively. And last but not least, it also introduced <em>UTF-8
            string literals</em>, prefixed with <code>u8</code>, with
          types arrays of <code>const char</code>. Of these three new
          string literal types, only one has a guarantee about the
          values that the elements of the array have; in other words,
          only one has a guaranteed encoding form, the <em>UTF-8 string
            literals</em>.</p>
      </div>
      <div class="cl-preview-section">
        <p>The standard text hints that the <code>char16_t</code> and <code>char32_t</code>
          string literals are intended to be encoded as, respectively,
          UTF-16 and UTF-32, but unlike it does for <em>UTF-8 string
            literals</em>, it never explicitly makes such a requirement.</p>
      </div>
      <div class="cl-preview-section">
        <h2 id="motivation">Motivation</h2>
      </div>
      <div class="cl-preview-section">
        <p>In defining <code>char16_t</code> string literals
          ([lex.string]/10), the standard makes a mention of “surrogate
          pairs”:</p>
      </div>
      <div class="cl-preview-section">
        <blockquote>
          <p>A string-literal that begins with <code>u</code>, such as
            <code>u"asdf"</code>, is a <code>char16_t</code> string
            literal. A <code>char16_t</code> string literal has type
            “array of <em>n</em> <code>const char16_t</code>”, where <em>n</em>
            is the size of the string as defined below; it is
            initialized with the given characters. A single <em>c-char</em>
            may produce more than one <code>char16_t</code> character
            in the form of surrogate pairs.</p>
        </blockquote>
      </div>
      <div class="cl-preview-section">
        <p>Further down, when defining the size of <code>char16_t</code>
          string literals ([lex.string]/15), there is another mention of
          “surrogate pairs”:</p>
      </div>
      <div class="cl-preview-section">
        <blockquote>
          <p>The size of a <code>char16_t</code> string literal is the
            total number of escape sequences, <em>universal-character-names</em>,
            and other characters, plus one for each character requiring
            a surrogate pair, plus one for the terminating <code>u'\0'</code>.
            [<em>Note:</em> The size of a char16_­t string literal is
            the number of code units, not the number of characters. — <em>end
              note</em>]</p>
        </blockquote>
      </div>
      <div class="cl-preview-section">
        <p>For <code>char32_t</code> string literals, the definition of
          their size ([lex.string]/15) essentially limits the encoding
          form used to one that doesn’t have more than one code unit per
          character:</p>
      </div>
      <div class="cl-preview-section">
        <blockquote>
          <p>The size of a <code>char32_t</code> or wide string literal
            is the total number of escape sequences, <em>universal-character-names</em>,
            and other characters, plus one for the terminating <code>U'\0'</code>
            or <code>L'\0'</code>.</p>
        </blockquote>
      </div>
      <div class="cl-preview-section">
        <p>Additionally, the standard constrains the range of <em>universal-character-names</em>
          to the range that is supported by all of the UTF encoding
          forms discussed here:</p>
      </div>
      <div class="cl-preview-section">
        <blockquote>
          <p>Within <code>char32_t</code> and <code>char16_t</code>
            string literals, any <em>universal-character-names</em>
            shall be within the range <code>0x0</code> to <code>0x10FFFF</code>.</p>
        </blockquote>
      </div>
      <div class="cl-preview-section">
        <p>All of these requirements, while never explicitly naming the
          UTF-16 or UTF-32 encoding forms, strongly imply that these are
          the encoding forms intended. Furthermore, it would be
          questionable for an implementation to pick any other encoding
          forms for these string literals: there is no well-known
          encoding form that uses a concept named “surrogate pair” other
          than UTF-16, and there is no well-known encoding form that
          encodes each character as a single 32-bit code unit other than
          UTF-32.</p>
      </div>
      <div class="cl-preview-section">
        <p>In practice, all implementations use UTF-16 and UTF-32 for
          these string literals. C++ should standardize this practice
          and make these requirements explicit instead of just hinting
          at them.</p>
      </div>
      <div class="cl-preview-section">
        <h2 id="proposal">Proposal</h2>
      </div>
      <div class="cl-preview-section">
        <p>This proposal renames "<code>char16_t</code> string literals"
          and "<code>char32_t</code> string literals" to “UTF-16 string
          literals” and “UTF-32 string literals”, to match the existing
          “UTF-8 string literals”, and explicitly requires the object
          representations of those literals to be the values that
          correspond to the UTF-16 and UTF-32 (respectively) encodings
          of the given characters.</p>
      </div>
      <div class="cl-preview-section">
        <h2 id="technical-specifications">Technical Specifications</h2>
      </div>
      <div class="cl-preview-section">
        <ul>
          <li>
            <p>Add to [lex.string]/10:</p>
            <blockquote>
              <p>A <em>string-literal</em> that begins with <code>u</code>,
                such as <code>u"asdf"</code>, is a <del><code>char16_t</code>
                  string literal</del><ins><em>UTF-16 string literal</em></ins>.
                A <del><code>char16_t</code> string literal</del><ins>UTF-16
                  string literal</ins> has type “array of <em>n</em> <code>const
                  char16_t</code>”, where <em>n</em> is the size of the
                string as defined below; it is initialized with the
                given characters. A single <em>c-char</em> may produce
                more than one <code>char16_t</code> character in the
                form of surrogate pairs.</p>
            </blockquote>
          </li>
          <li>
            <p>Change [lex.string]/11:</p>
            <blockquote>
              <p>A <em>string-literal</em> that begins with <code>U</code>,
                such as <code>U"asdf"</code>, is a <del><code>char32_t</code>
                  string literal</del><ins><em>UTF-32 string literal</em></ins>.
                A <del><code>char32_t</code> string literal</del><ins>UTF-32
                  string literal</ins> has type “array of <em>n</em> <code>const
                  char32_t</code>”, where <em>n</em> is the size of the
                string as defined below; it is initialized with the
                given characters.</p>
            </blockquote>
          </li>
          <li>
            <p>Insert a paragraph between [lex.string]/10 and /11:</p>
            <blockquote>
              <p><ins>For a UTF-16 string literal, each successive
                  element of the object representation has the value of
                  the corresponding code unit of the UTF-16 encoding of
                  the string.</ins></p>
            </blockquote>
          </li>
          <li>
            <p>Insert a paragraph between [lex.string]/11 and /12:</p>
            <blockquote>
              <p><ins>For a <em>UTF-32 string literal</em>, each
                  successive element of the object representation has
                  the value of the corresponding code unit of the UTF-32
                  encoding of the string.</ins></p>
            </blockquote>
          </li>
          <li>
            <p>Change [lex.ccon]/4:</p>
            <blockquote>
              <p>A character literal that begins with the letter <code>u</code>,
                such as <code>u'x'</code>, is a character literal of
                type <code>char16_t</code><ins>, known as a <em>UTF-8
                    character literal</em></ins>. The value of a <del><code>char16_t</code></del><ins>UTF-16</ins>
                character literal containing a single <em>c-char</em>
                is equal to its ISO 10646 code point value, provided
                that the code point value is representable with a single
                16-bit code unit (that is, provided it is in the basic
                multi-lingual plane). If the value is not representable
                with a single 16-bit code unit, the program is
                ill-formed. A <del><code>char16_t</code></del><ins>UTF-16</ins>
                character literal containing multiple <em>c-char</em>s
                is ill-formed.</p>
            </blockquote>
          </li>
          <li>
            <p>Change [lex.ccon]/5:</p>
            <blockquote>
              <p>A character literal that begins with the letter <code>U</code>,
                such as <code>U'y'</code>, is a character literal of
                type <code>char32_t</code>. The value of a <del><code>char32_­t</code></del><ins>UTF-32</ins>
                character literal containing a single <em>c-char</em>
                is equal to its ISO 10646 code point value. A <del><code>char32_­t</code></del><ins>UTF-32</ins>
                character literal containing multiple <em>c-char</em>s
                is ill-formed.</p>
            </blockquote>
          </li>
        </ul>
      </div>
      <div class="cl-preview-section">
        <h2 id="interaction-with-other-papers">Interaction with other
          papers</h2>
      </div>
      <div class="cl-preview-section">
        <p>Currently, the standard lacks a normative reference to
          UTF-16, and UTF-32; however, it also lacks one such reference
          for UTF-8. This paper assumes the this problem will fixed for
          all three encodings in another paper, potentially <a
href="https://github.com/sg16-unicode/sg16/blob/master/papers/D1025R0.md">D1025R0</a>
          (<em>Update The Reference To The Unicode Standard</em>).</p>
      </div>
      <div class="cl-preview-section">
        <p>This paper was also written so as to not conflict with <a
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r2.html">P0482R2</a>
          (<em>char8_t: A type for UTF-8 characters and strings
            (Revision 2)</em>).</p>
      </div>
    </div>
  </body>
</html>