<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 11/3/19 2:39 AM, Yehezkel Bernat
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CA+CmpXukF_R4md9ejR2k=2UR_1hDa4uX1syfHg6GT3ZsTyXZPQ@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div class="gmail_default" style="font-size:small;color:#000000">I'm
          sorry if this isn't the right place/thread to ask it:</div>
      </div>
    </blockquote>
    This is a fine place to ask.<br>
    <blockquote type="cite"
cite="mid:CA+CmpXukF_R4md9ejR2k=2UR_1hDa4uX1syfHg6GT3ZsTyXZPQ@mail.gmail.com">
      <div dir="ltr">
        <div class="gmail_default" style="font-size:small;color:#000000">Why
          do we allow non-ASCII characters in identifiers at all?
          Wouldn't life be simpler if identifiers must include only
          ASCII alphanumeric characters?</div>
        <div class="gmail_default" style="font-size:small;color:#000000">I
          know I assumed it to be the case until lately (when I started
          reading the relevant papers here.)
        </div>
      </div>
    </blockquote>
    <p>This feature was added in C++11 when support for
      universal-character-name escapes were added.  I wasn't involved in
      the committee at the time, so I don't really know the history. 
      The relevant paper is N3146
      (<a class="moz-txt-link-freetext" href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm</a>).<br>
    </p>
    <blockquote type="cite"
cite="mid:CA+CmpXukF_R4md9ejR2k=2UR_1hDa4uX1syfHg6GT3ZsTyXZPQ@mail.gmail.com">
      <div dir="ltr">
        <div class="gmail_default" style="font-size:small;color:#000000"><br>
        </div>
        <div class="gmail_default" style="font-size:small;color:#000000">Or
          maybe Unicode was allowed in the past and now it's too late to
          change it?</div>
      </div>
    </blockquote>
    <br>
    Tom.<br>
    <blockquote type="cite"
cite="mid:CA+CmpXukF_R4md9ejR2k=2UR_1hDa4uX1syfHg6GT3ZsTyXZPQ@mail.gmail.com"><br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Sun, Nov 3, 2019 at 1:22 AM
          Steve Downey &lt;<a href="mailto:sdowney@gmail.com"
            moz-do-not-send="true">sdowney@gmail.com</a>&gt; wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div dir="auto">Will do. </div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Sat, Nov 2, 2019, 15:07
              Tom Honermann &lt;<a href="mailto:tom@honermann.net"
                target="_blank" moz-do-not-send="true">tom@honermann.net</a>&gt;
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex">
              <div bgcolor="#FFFFFF">
                <div>Also, please clarify the document number.  I
                  suspect it should be D1949R0 (it looks like an extra
                  "1" may have snuck in there).</div>
                <div><br>
                </div>
                <div>Tom.<br>
                </div>
                <div><br>
                </div>
                <div>On 11/2/19 3:05 PM, Tom Honermann wrote:<br>
                </div>
                <blockquote type="cite">
                  <div>Thanks, Steve.  Could you please attach this
                    paper to the SG16 wiki at <a
                      href="http://wiki.edg.com/bin/view/Wg21belfast/SG16"
                      rel="noreferrer" target="_blank"
                      moz-do-not-send="true">http://wiki.edg.com/bin/view/Wg21belfast/SG16</a>?<br>
                  </div>
                  <div><br>
                  </div>
                  <div>Tom.<br>
                  </div>
                  <div><br>
                  </div>
                  <div>On 11/2/19 9:44 AM, Steve Downey wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <div dir="ltr">
                      <h1 style="line-height:1;text-align:center">C++
                        Identifier Syntax using Unicode Standard Annex
                        31</h1>
                      <table
style="border:none;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em;float:right">
                        <tbody>
                          <tr>
                            <td
                              style="padding-left:1em;padding-right:1em;vertical-align:top">Document
                              #:</td>
                            <td
                              style="padding-left:1em;padding-right:1em;vertical-align:top">D19149R0</td>
                          </tr>
                          <tr>
                            <td
                              style="padding-left:1em;padding-right:1em;vertical-align:top">Date:</td>
                            <td
                              style="padding-left:1em;padding-right:1em;vertical-align:top">2019-11-02</td>
                          </tr>
                          <tr>
                            <td
                              style="padding-left:1em;padding-right:1em;vertical-align:top">Project:</td>
                            <td
                              style="padding-left:1em;padding-right:1em;vertical-align:top">Programming
                              Language C++<br>
                              SG16<br>
                              EWG<br>
                              CWG<br>
                            </td>
                          </tr>
                          <tr>
                            <td
                              style="padding-left:1em;padding-right:1em;vertical-align:top">Reply-to:</td>
                            <td
                              style="padding-left:1em;padding-right:1em;vertical-align:top">Steve
                              Downey<br>
                              &lt;<a href="mailto:sdowney@gmail.com"
                                style="text-decoration-line:none;color:rgb(65,131,196)"
                                rel="noreferrer" target="_blank"
                                moz-do-not-send="true">sdowney@gmail.com</a>, <a
                                href="mailto:sdowney2@bloomberg.net"
                                style="text-decoration-line:none;color:rgb(65,131,196)"
                                rel="noreferrer" target="_blank"
                                moz-do-not-send="true">sdowney2@bloomberg.net</a>&gt;<br>
                            </td>
                          </tr>
                        </tbody>
                      </table>
                      <div
                        style="color:rgb(0,0,0);font-family:serif;font-size:medium;clear:both">
                        <h1
                          id="gmail-m_4891618660013739441m_2729222737077895551gmail-abstract"
                          style="line-height:1"><span
                            style="display:inline-block;min-width:35pt">1</span> Abstract</h1>
                        <p>In response to NL 029 : Disallow zero-width
                          and control characters</p>
                        <p>Adopt Unicode Annex 31 as part of C++ 23. -
                          That C++ identifiers match the pattern
                          (XID_START + _ ) + XID_CONTINUE*. - That
                          portable source is required to be normalized
                          as NFC. - That using unassigned code points
                          ill-formed.</p>
                        <h1
id="gmail-m_4891618660013739441m_2729222737077895551gmail-poll-before-discussion"
                          style="line-height:1"><span
                            style="display:inline-block;min-width:35pt">2</span> Poll
                          before discussion</h1>
                        <p>The current state, allowing control
                          characters, ZWJ, and unassigned codepoints in
                          C++ identifiers is not a defect, and is
                          working as designed, and does not need to be
                          addressed</p>
                        <h1
id="gmail-m_4891618660013739441m_2729222737077895551gmail-addressing-identifiers-in-a-more-principled-ways"
                          style="line-height:1"><span
                            style="display:inline-block;min-width:35pt">3</span> Addressing
                          identifiers in a more principled ways</h1>
                        <p><a href="https://unicode.org/reports/tr31/"
                            style="text-decoration-line:none;color:rgb(65,131,196)"
                            rel="noreferrer" target="_blank"
                            moz-do-not-send="true">UNICODE IDENTIFIER
                            AND PATTERN SYNTAX</a> is an attempt to
                          provide a normative way of specifying
                          definitions of general-purpose identifiers for
                          use in programming languages. It has evolved
                          signfigantly over the years, in particular
                          since the time that C++ 11 was specified. In
                          particular, the characters that were allowed
                          as identifiers, and the patterns, were not
                          stable at the time of C++11, which is the last
                          time identifiers were addressed in the
                          standard. In addition, at that time, ISO was
                          promulgating advice suggesting a list of code
                          points as the recommended method for ISO
                          standards to specify identifiers.</p>
                        <p>Today the definitions in UAX31 can be used to
                          provide stable definitions for programming
                          language identifiers, with guarantees that an
                          identifier will not be invalidated by later
                          standards.</p>
                        <p>Originally, UAX31 relied on derived
                          properties of characters, ID_START and
                          ID_CONTINUE, however those properties relied
                          on fundamental properties that could change
                          over time. The unicode database now provides
                          XID_START and XID_CONTINUE, based on the same
                          characteristics, but with an additional
                          stability guarantee. The Unicode database now
                          provides explicit classification of both.</p>
                        <p>The original definitions closely match the
                          identifier syntax of C:</p>
                        <table style="border:1px solid
black;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em">
                          <colgroup><col style="width:0px"><col
                              style="width:0px"></colgroup><thead><tr
                              style="border-bottom:3px double black">
                              <th
style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px
                                solid black">
                                <div><strong>Properties</strong></div>
                              </th>
                              <th
style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px
                                solid black">
                                <div><strong>General Description of
                                    Coverage</strong></div>
                              </th>
                            </tr>
                          </thead><tbody>
                            <tr style="border-bottom:1px solid black">
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start</td>
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start
                                characters are derived from the Unicode
                                General_Category of uppercase letters,
                                lowercase letters, titlecase letters,
                                modifier letters, other letters, letter
                                numbers, plus Other_ID_Start, minus
                                Pattern_Syntax and Pattern_White_Space
                                code points.</td>
                            </tr>
                            <tr style="border-bottom:1px solid black">
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                              </td>
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top">In
                                set notation:</td>
                            </tr>
                            <tr style="border-bottom:1px solid black">
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                              </td>
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
                            </tr>
                            <tr style="border-bottom:1px solid black">
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue</td>
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue
                                characters include ID_Start characters,
                                plus characters having the Unicode
                                General_Category of nonspacing marks,
                                spacing combining marks, decimal number,
                                connector punctuation, plus
                                Other_ID_Continue , minus Pattern_Syntax
                                and Pattern_White_Space code points.</td>
                            </tr>
                            <tr style="border-bottom:1px solid black">
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                              </td>
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top">In
                                set notation:</td>
                            </tr>
                            <tr style="border-bottom:1px solid black">
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                              </td>
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
                            </tr>
                            <tr style="border-bottom:1px solid black">
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                              </td>
                              <td
                                style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                              </td>
                            </tr>
                          </tbody>
                        </table>
                        <p>The X versions of the properties start the
                          same, but are guaranteed stable in subsequent
                          Unicode standards</p>
                        <h1
                          id="gmail-m_4891618660013739441m_2729222737077895551gmail-issues"
                          style="line-height:1"><span
                            style="display:inline-block;min-width:35pt">4</span> Issues</h1>
                        <ul
                          style="list-style-type:none;padding-left:2em">
                          <li
                            style="margin-top:0.6em;margin-bottom:0.6em">Continue
                            does not include ZWJ, which some scripts
                            require</li>
                          <li
                            style="margin-top:0.6em;margin-bottom:0.6em">Does
                            not exclude homoglyph attack</li>
                          <li
                            style="margin-top:0.6em;margin-bottom:0.6em">Does
                            not require the compiler to normalize
                            identifiers</li>
                          <li
                            style="margin-top:0.6em;margin-bottom:0.6em">Does
                            not allow emoji</li>
                        </ul>
                        <h1
                          id="gmail-m_4891618660013739441m_2729222737077895551gmail-history"
                          style="line-height:1"><span
                            style="display:inline-block;min-width:35pt">5</span> History</h1>
                        <p>Using an explicit list of Unicode characters
                          was considered a best practice for ISO
                          standardization in TR 10176:2003 Guidelines
                          for the preparation of programming language
                          standards.</p>
                        <p>National body comment CA 24 for C++11:</p>
                        <blockquote>
                          <p>A list of issues related TR 10176:2003:</p>
                          <ul
                            style="list-style-type:none;padding-left:2em">
                            <li
                              style="margin-top:0.6em;margin-bottom:0.6em">“Combining
                              characters should not appear as the first
                              character of an identifier.” Reference:
                              ISO/IEC TR 10176:2003 (Annex A) This is
                              not reflected in FCD.</li>
                            <li
                              style="margin-top:0.6em;margin-bottom:0.6em">Restrictions
                              on the first character of an identifier
                              are not observed as recommended in TR
                              10176:2003. The inclusion of digits
                              (outside of those in the basic character
                              set) under identifer-nondigit is implied
                              by FCD.</li>
                            <li
                              style="margin-top:0.6em;margin-bottom:0.6em">It
                              is implied that only the “main listing”
                              from Annex A is included for C++. That is,
                              the list ends with the Special Characters
                              section. This is not made explicit in FCD.
                              Existing practice in C++03 as well as WG
                              14 (C, as of N1425) and WG 4 (COBOL, as of
                              N4315) is to include a list in a normative
                              Annex.</li>
                            <li
                              style="margin-top:0.6em;margin-bottom:0.6em">Specify
                              width sensitivity as implied by C++03: is
                              not the same as A. Case sensitivity is
                              already stated in [<a
                                href="http://lex.name" rel="noreferrer"
                                target="_blank" moz-do-not-send="true">lex.name</a>].</li>
                          </ul>
                        </blockquote>
                        <p>N3146 in 2010-10-04 considered using UAX31,
                          but at the time there were stability issues
                          with identifiers, and came down on the side of
                          explicit white listing.</p>
                        <p>The Unicode standard has since made stability
                          guarantees about identifiers, and created the
                          XID_START and XID_CONTINUE properties to
                          alleviate the stability concerns that existed
                          in 2010.</p>
                        <h1
                          id="gmail-m_4891618660013739441m_2729222737077895551gmail-wording"
                          style="line-height:1"><span
                            style="display:inline-block;min-width:35pt">6</span> Wording</h1>
                        <p>Wording to follow based on SG16 and EWG
                          guidance. There is much prior art to follow
                          based on similar proposals and adoption in
                          Rust and Swift.</p>
                        <p>Explicit universal character names and
                          codepoints are available for particular
                          Unicode standards from the published database,
                          and could be appended as an appendix.</p>
                      </div>
                    </div>
                    <br>
                    <fieldset></fieldset>
                    <pre>_______________________________________________
SG16 Unicode mailing list
<a href="mailto:Unicode@isocpp.open-std.org" rel="noreferrer" target="_blank" moz-do-not-send="true">Unicode@isocpp.open-std.org</a>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
                  </blockquote>
                  <p><br>
                  </p>
                  <br>
                  <fieldset></fieldset>
                  <pre>_______________________________________________
SG16 Unicode mailing list
<a href="mailto:Unicode@isocpp.open-std.org" rel="noreferrer" target="_blank" moz-do-not-send="true">Unicode@isocpp.open-std.org</a>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
                </blockquote>
                <p><br>
                </p>
              </div>
            </blockquote>
          </div>
          _______________________________________________<br>
          SG16 Unicode mailing list<br>
          <a href="mailto:Unicode@isocpp.open-std.org" target="_blank"
            moz-do-not-send="true">Unicode@isocpp.open-std.org</a><br>
          <a href="http://www.open-std.org/mailman/listinfo/unicode"
            rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a><br>
        </blockquote>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
SG16 Unicode mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Unicode@isocpp.open-std.org">Unicode@isocpp.open-std.org</a>
<a class="moz-txt-link-freetext" href="http://www.open-std.org/mailman/listinfo/unicode">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>