<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">Thanks, Steve.  Could you please attach
      this paper to the SG16 wiki at
      <a class="moz-txt-link-freetext" href="http://wiki.edg.com/bin/view/Wg21belfast/SG16">http://wiki.edg.com/bin/view/Wg21belfast/SG16</a>?<br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">Tom.<br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">On 11/2/19 9:44 AM, Steve Downey wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAJEGDKruxz-Y1-ZAw_qv5QzS0nWSPyh82cK+VimSbOUF8Uw8+g@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <h1 class="gmail-title" style="line-height:1;text-align:center">C++
          Identifier Syntax using Unicode Standard Annex 31</h1>
        <table
style="border:none;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em;float:right">
          <tbody>
            <tr>
              <td
                style="padding-left:1em;padding-right:1em;vertical-align:top">Document
                #:</td>
              <td
                style="padding-left:1em;padding-right:1em;vertical-align:top">D19149R0</td>
            </tr>
            <tr>
              <td
                style="padding-left:1em;padding-right:1em;vertical-align:top">Date:</td>
              <td
                style="padding-left:1em;padding-right:1em;vertical-align:top">2019-11-02</td>
            </tr>
            <tr>
              <td
                style="padding-left:1em;padding-right:1em;vertical-align:top">Project:</td>
              <td
                style="padding-left:1em;padding-right:1em;vertical-align:top">Programming
                Language C++<br>
                SG16<br>
                EWG<br>
                CWG<br>
              </td>
            </tr>
            <tr>
              <td
                style="padding-left:1em;padding-right:1em;vertical-align:top">Reply-to:</td>
              <td
                style="padding-left:1em;padding-right:1em;vertical-align:top">Steve
                Downey<br>
                &lt;<a href="mailto:sdowney@gmail.com" class="email"
                  style="text-decoration-line:none;color:rgb(65,131,196)"
                  moz-do-not-send="true">sdowney@gmail.com</a>, <a
                  href="mailto:sdowney2@bloomberg.net" class="email"
                  style="text-decoration-line:none;color:rgb(65,131,196)"
                  moz-do-not-send="true">sdowney2@bloomberg.net</a>&gt;<br>
              </td>
            </tr>
          </tbody>
        </table>
        <div
          style="color:rgb(0,0,0);font-family:serif;font-size:medium;clear:both">
          <h1 id="gmail-abstract" style="line-height:1"><span
              class="gmail-header-section-number"
              style="display:inline-block;min-width:35pt">1</span> Abstract</h1>
          <p>In response to NL 029 : Disallow zero-width and control
            characters</p>
          <p>Adopt Unicode Annex 31 as part of C++ 23. - That C++
            identifiers match the pattern (XID_START + _ ) +
            XID_CONTINUE*. - That portable source is required to be
            normalized as NFC. - That using unassigned code points
            ill-formed.</p>
          <h1 id="gmail-poll-before-discussion" style="line-height:1"><span
              class="gmail-header-section-number"
              style="display:inline-block;min-width:35pt">2</span> Poll
            before discussion</h1>
          <p>The current state, allowing control characters, ZWJ, and
            unassigned codepoints in C++ identifiers is not a defect,
            and is working as designed, and does not need to be
            addressed</p>
          <h1
            id="gmail-addressing-identifiers-in-a-more-principled-ways"
            style="line-height:1"><span
              class="gmail-header-section-number"
              style="display:inline-block;min-width:35pt">3</span> Addressing
            identifiers in a more principled ways</h1>
          <p><a href="https://unicode.org/reports/tr31/"
              style="text-decoration-line:none;color:rgb(65,131,196)"
              moz-do-not-send="true">UNICODE IDENTIFIER AND PATTERN
              SYNTAX</a> is an attempt to provide a normative way of
            specifying definitions of general-purpose identifiers for
            use in programming languages. It has evolved signfigantly
            over the years, in particular since the time that C++ 11 was
            specified. In particular, the characters that were allowed
            as identifiers, and the patterns, were not stable at the
            time of C++11, which is the last time identifiers were
            addressed in the standard. In addition, at that time, ISO
            was promulgating advice suggesting a list of code points as
            the recommended method for ISO standards to specify
            identifiers.</p>
          <p>Today the definitions in UAX31 can be used to provide
            stable definitions for programming language identifiers,
            with guarantees that an identifier will not be invalidated
            by later standards.</p>
          <p>Originally, UAX31 relied on derived properties of
            characters, ID_START and ID_CONTINUE, however those
            properties relied on fundamental properties that could
            change over time. The unicode database now provides
            XID_START and XID_CONTINUE, based on the same
            characteristics, but with an additional stability guarantee.
            The Unicode database now provides explicit classification of
            both.</p>
          <p>The original definitions closely match the identifier
            syntax of C:</p>
          <table style="border:1px solid
black;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em">
            <colgroup><col style="width:0px"><col style="width:0px"></colgroup><thead><tr
                class="gmail-header" style="border-bottom:3px double
                black">
                <th
style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px
                  solid black">
                  <div><strong>Properties</strong></div>
                </th>
                <th
style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px
                  solid black">
                  <div><strong>General Description of Coverage</strong></div>
                </th>
              </tr>
            </thead><tbody>
              <tr class="gmail-odd" style="border-bottom:1px solid
                black">
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start</td>
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start
                  characters are derived from the Unicode
                  General_Category of uppercase letters, lowercase
                  letters, titlecase letters, modifier letters, other
                  letters, letter numbers, plus Other_ID_Start, minus
                  Pattern_Syntax and Pattern_White_Space code points.</td>
              </tr>
              <tr class="even" style="border-bottom:1px solid black">
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                </td>
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top">In
                  set notation:</td>
              </tr>
              <tr class="gmail-odd" style="border-bottom:1px solid
                black">
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                </td>
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
              </tr>
              <tr class="even" style="border-bottom:1px solid black">
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue</td>
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue
                  characters include ID_Start characters, plus
                  characters having the Unicode General_Category of
                  nonspacing marks, spacing combining marks, decimal
                  number, connector punctuation, plus Other_ID_Continue
                  , minus Pattern_Syntax and Pattern_White_Space code
                  points.</td>
              </tr>
              <tr class="gmail-odd" style="border-bottom:1px solid
                black">
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                </td>
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top">In
                  set notation:</td>
              </tr>
              <tr class="even" style="border-bottom:1px solid black">
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                </td>
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
              </tr>
              <tr class="gmail-odd" style="border-bottom:1px solid
                black">
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                </td>
                <td
                  style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                </td>
              </tr>
            </tbody>
          </table>
          <p>The X versions of the properties start the same, but are
            guaranteed stable in subsequent Unicode standards</p>
          <h1 id="gmail-issues" style="line-height:1"><span
              class="gmail-header-section-number"
              style="display:inline-block;min-width:35pt">4</span> Issues</h1>
          <ul style="list-style-type:none;padding-left:2em">
            <li style="margin-top:0.6em;margin-bottom:0.6em">Continue
              does not include ZWJ, which some scripts require</li>
            <li style="margin-top:0.6em;margin-bottom:0.6em">Does not
              exclude homoglyph attack</li>
            <li style="margin-top:0.6em;margin-bottom:0.6em">Does not
              require the compiler to normalize identifiers</li>
            <li style="margin-top:0.6em;margin-bottom:0.6em">Does not
              allow emoji</li>
          </ul>
          <h1 id="gmail-history" style="line-height:1"><span
              class="gmail-header-section-number"
              style="display:inline-block;min-width:35pt">5</span> History</h1>
          <p>Using an explicit list of Unicode characters was considered
            a best practice for ISO standardization in TR 10176:2003
            Guidelines for the preparation of programming language
            standards.</p>
          <p>National body comment CA 24 for C++11:</p>
          <blockquote>
            <p>A list of issues related TR 10176:2003:</p>
            <ul style="list-style-type:none;padding-left:2em">
              <li style="margin-top:0.6em;margin-bottom:0.6em">“Combining
                characters should not appear as the first character of
                an identifier.” Reference: ISO/IEC TR 10176:2003 (Annex
                A) This is not reflected in FCD.</li>
              <li style="margin-top:0.6em;margin-bottom:0.6em">Restrictions
                on the first character of an identifier are not observed
                as recommended in TR 10176:2003. The inclusion of digits
                (outside of those in the basic character set) under
                identifer-nondigit is implied by FCD.</li>
              <li style="margin-top:0.6em;margin-bottom:0.6em">It is
                implied that only the “main listing” from Annex A is
                included for C++. That is, the list ends with the
                Special Characters section. This is not made explicit in
                FCD. Existing practice in C++03 as well as WG 14 (C, as
                of N1425) and WG 4 (COBOL, as of N4315) is to include a
                list in a normative Annex.</li>
              <li style="margin-top:0.6em;margin-bottom:0.6em">Specify
                width sensitivity as implied by C++03: is not the same
                as A. Case sensitivity is already stated in [<a
                  href="http://lex.name" moz-do-not-send="true">lex.name</a>].</li>
            </ul>
          </blockquote>
          <p>N3146 in 2010-10-04 considered using UAX31, but at the time
            there were stability issues with identifiers, and came down
            on the side of explicit white listing.</p>
          <p>The Unicode standard has since made stability guarantees
            about identifiers, and created the XID_START and
            XID_CONTINUE properties to alleviate the stability concerns
            that existed in 2010.</p>
          <h1 id="gmail-wording" style="line-height:1"><span
              class="gmail-header-section-number"
              style="display:inline-block;min-width:35pt">6</span> Wording</h1>
          <p>Wording to follow based on SG16 and EWG guidance. There is
            much prior art to follow based on similar proposals and
            adoption in Rust and Swift.</p>
          <p>Explicit universal character names and codepoints are
            available for particular Unicode standards from the
            published database, and could be appended as an appendix.</p>
        </div>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
SG16 Unicode mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Unicode@isocpp.open-std.org">Unicode@isocpp.open-std.org</a>
<a class="moz-txt-link-freetext" href="http://www.open-std.org/mailman/listinfo/unicode">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>