<div dir="ltr"><div class="gmail_default" style="font-size:small;color:#000000">I&#39;m sorry if this isn&#39;t the right place/thread to ask it:</div><div class="gmail_default" style="font-size:small;color:#000000">Why do we allow non-ASCII characters in identifiers at all? Wouldn&#39;t life be simpler if identifiers must include only ASCII alphanumeric characters?</div><div class="gmail_default" style="font-size:small;color:#000000">I know I assumed it to be the case until lately (when I started reading the relevant papers here.)

</div><div class="gmail_default" style="font-size:small;color:#000000"><br></div><div class="gmail_default" style="font-size:small;color:#000000">Or maybe Unicode was allowed in the past and now it&#39;s too late to change it?</div><div class="gmail_default" style="font-size:small;color:#000000"></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Nov 3, 2019 at 1:22 AM Steve Downey &lt;<a href="mailto:sdowney@gmail.com">sdowney@gmail.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">Will do. </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Nov 2, 2019, 15:07 Tom Honermann &lt;<a href="mailto:tom@honermann.net" target="_blank">tom@honermann.net</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF">
    <div>Also, please clarify the document
      number.  I suspect it should be D1949R0 (it looks like an extra
      &quot;1&quot; may have snuck in there).</div>
    <div><br>
    </div>
    <div>Tom.<br>
    </div>
    <div><br>
    </div>
    <div>On 11/2/19 3:05 PM, Tom Honermann
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div>Thanks, Steve.  Could you please
        attach this paper to the SG16 wiki at <a href="http://wiki.edg.com/bin/view/Wg21belfast/SG16" rel="noreferrer" target="_blank">http://wiki.edg.com/bin/view/Wg21belfast/SG16</a>?<br>
      </div>
      <div><br>
      </div>
      <div>Tom.<br>
      </div>
      <div><br>
      </div>
      <div>On 11/2/19 9:44 AM, Steve Downey
        wrote:<br>
      </div>
      <blockquote type="cite">
        
        <div dir="ltr">
          <h1 style="line-height:1;text-align:center">C++ Identifier
            Syntax using Unicode Standard Annex 31</h1>
          <table style="border:none;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em;float:right">
            <tbody>
              <tr>
                <td style="padding-left:1em;padding-right:1em;vertical-align:top">Document
                  #:</td>
                <td style="padding-left:1em;padding-right:1em;vertical-align:top">D19149R0</td>
              </tr>
              <tr>
                <td style="padding-left:1em;padding-right:1em;vertical-align:top">Date:</td>
                <td style="padding-left:1em;padding-right:1em;vertical-align:top">2019-11-02</td>
              </tr>
              <tr>
                <td style="padding-left:1em;padding-right:1em;vertical-align:top">Project:</td>
                <td style="padding-left:1em;padding-right:1em;vertical-align:top">Programming
                  Language C++<br>
                  SG16<br>
                  EWG<br>
                  CWG<br>
                </td>
              </tr>
              <tr>
                <td style="padding-left:1em;padding-right:1em;vertical-align:top">Reply-to:</td>
                <td style="padding-left:1em;padding-right:1em;vertical-align:top">Steve
                  Downey<br>
                  &lt;<a href="mailto:sdowney@gmail.com" style="text-decoration-line:none;color:rgb(65,131,196)" rel="noreferrer" target="_blank">sdowney@gmail.com</a>, <a href="mailto:sdowney2@bloomberg.net" style="text-decoration-line:none;color:rgb(65,131,196)" rel="noreferrer" target="_blank">sdowney2@bloomberg.net</a>&gt;<br>
                </td>
              </tr>
            </tbody>
          </table>
          <div style="color:rgb(0,0,0);font-family:serif;font-size:medium;clear:both">
            <h1 id="gmail-m_4891618660013739441m_2729222737077895551gmail-abstract" style="line-height:1"><span style="display:inline-block;min-width:35pt">1</span> Abstract</h1>
            <p>In response to NL 029 : Disallow zero-width and control
              characters</p>
            <p>Adopt Unicode Annex 31 as part of C++ 23. - That C++
              identifiers match the pattern (XID_START + _ ) +
              XID_CONTINUE*. - That portable source is required to be
              normalized as NFC. - That using unassigned code points
              ill-formed.</p>
            <h1 id="gmail-m_4891618660013739441m_2729222737077895551gmail-poll-before-discussion" style="line-height:1"><span style="display:inline-block;min-width:35pt">2</span> Poll
              before discussion</h1>
            <p>The current state, allowing control characters, ZWJ, and
              unassigned codepoints in C++ identifiers is not a defect,
              and is working as designed, and does not need to be
              addressed</p>
            <h1 id="gmail-m_4891618660013739441m_2729222737077895551gmail-addressing-identifiers-in-a-more-principled-ways" style="line-height:1"><span style="display:inline-block;min-width:35pt">3</span> Addressing
              identifiers in a more principled ways</h1>
            <p><a href="https://unicode.org/reports/tr31/" style="text-decoration-line:none;color:rgb(65,131,196)" rel="noreferrer" target="_blank">UNICODE IDENTIFIER AND PATTERN
                SYNTAX</a> is an attempt to provide a normative way of
              specifying definitions of general-purpose identifiers for
              use in programming languages. It has evolved signfigantly
              over the years, in particular since the time that C++ 11
              was specified. In particular, the characters that were
              allowed as identifiers, and the patterns, were not stable
              at the time of C++11, which is the last time identifiers
              were addressed in the standard. In addition, at that time,
              ISO was promulgating advice suggesting a list of code
              points as the recommended method for ISO standards to
              specify identifiers.</p>
            <p>Today the definitions in UAX31 can be used to provide
              stable definitions for programming language identifiers,
              with guarantees that an identifier will not be invalidated
              by later standards.</p>
            <p>Originally, UAX31 relied on derived properties of
              characters, ID_START and ID_CONTINUE, however those
              properties relied on fundamental properties that could
              change over time. The unicode database now provides
              XID_START and XID_CONTINUE, based on the same
              characteristics, but with an additional stability
              guarantee. The Unicode database now provides explicit
              classification of both.</p>
            <p>The original definitions closely match the identifier
              syntax of C:</p>
            <table style="border:1px solid black;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em">
              <colgroup><col style="width:0px"><col style="width:0px"></colgroup><thead><tr style="border-bottom:3px double black">
                  <th style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px solid black">
                    <div><strong>Properties</strong></div>
                  </th>
                  <th style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px solid black">
                    <div><strong>General Description of Coverage</strong></div>
                  </th>
                </tr>
              </thead><tbody>
                <tr style="border-bottom:1px solid black">
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start</td>
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start
                    characters are derived from the Unicode
                    General_Category of uppercase letters, lowercase
                    letters, titlecase letters, modifier letters, other
                    letters, letter numbers, plus Other_ID_Start, minus
                    Pattern_Syntax and Pattern_White_Space code points.</td>
                </tr>
                <tr style="border-bottom:1px solid black">
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                  </td>
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top">In
                    set notation:</td>
                </tr>
                <tr style="border-bottom:1px solid black">
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                  </td>
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
                </tr>
                <tr style="border-bottom:1px solid black">
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue</td>
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue
                    characters include ID_Start characters, plus
                    characters having the Unicode General_Category of
                    nonspacing marks, spacing combining marks, decimal
                    number, connector punctuation, plus
                    Other_ID_Continue , minus Pattern_Syntax and
                    Pattern_White_Space code points.</td>
                </tr>
                <tr style="border-bottom:1px solid black">
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                  </td>
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top">In
                    set notation:</td>
                </tr>
                <tr style="border-bottom:1px solid black">
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                  </td>
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
                </tr>
                <tr style="border-bottom:1px solid black">
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                  </td>
                  <td style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
                  </td>
                </tr>
              </tbody>
            </table>
            <p>The X versions of the properties start the same, but are
              guaranteed stable in subsequent Unicode standards</p>
            <h1 id="gmail-m_4891618660013739441m_2729222737077895551gmail-issues" style="line-height:1"><span style="display:inline-block;min-width:35pt">4</span> Issues</h1>
            <ul style="list-style-type:none;padding-left:2em">
              <li style="margin-top:0.6em;margin-bottom:0.6em">Continue
                does not include ZWJ, which some scripts require</li>
              <li style="margin-top:0.6em;margin-bottom:0.6em">Does not
                exclude homoglyph attack</li>
              <li style="margin-top:0.6em;margin-bottom:0.6em">Does not
                require the compiler to normalize identifiers</li>
              <li style="margin-top:0.6em;margin-bottom:0.6em">Does not
                allow emoji</li>
            </ul>
            <h1 id="gmail-m_4891618660013739441m_2729222737077895551gmail-history" style="line-height:1"><span style="display:inline-block;min-width:35pt">5</span> History</h1>
            <p>Using an explicit list of Unicode characters was
              considered a best practice for ISO standardization in TR
              10176:2003 Guidelines for the preparation of programming
              language standards.</p>
            <p>National body comment CA 24 for C++11:</p>
            <blockquote>
              <p>A list of issues related TR 10176:2003:</p>
              <ul style="list-style-type:none;padding-left:2em">
                <li style="margin-top:0.6em;margin-bottom:0.6em">“Combining
                  characters should not appear as the first character of
                  an identifier.” Reference: ISO/IEC TR 10176:2003
                  (Annex A) This is not reflected in FCD.</li>
                <li style="margin-top:0.6em;margin-bottom:0.6em">Restrictions
                  on the first character of an identifier are not
                  observed as recommended in TR 10176:2003. The
                  inclusion of digits (outside of those in the basic
                  character set) under identifer-nondigit is implied by
                  FCD.</li>
                <li style="margin-top:0.6em;margin-bottom:0.6em">It is
                  implied that only the “main listing” from Annex A is
                  included for C++. That is, the list ends with the
                  Special Characters section. This is not made explicit
                  in FCD. Existing practice in C++03 as well as WG 14
                  (C, as of N1425) and WG 4 (COBOL, as of N4315) is to
                  include a list in a normative Annex.</li>
                <li style="margin-top:0.6em;margin-bottom:0.6em">Specify
                  width sensitivity as implied by C++03: is not the same
                  as A. Case sensitivity is already stated in [<a href="http://lex.name" rel="noreferrer" target="_blank">lex.name</a>].</li>
              </ul>
            </blockquote>
            <p>N3146 in 2010-10-04 considered using UAX31, but at the
              time there were stability issues with identifiers, and
              came down on the side of explicit white listing.</p>
            <p>The Unicode standard has since made stability guarantees
              about identifiers, and created the XID_START and
              XID_CONTINUE properties to alleviate the stability
              concerns that existed in 2010.</p>
            <h1 id="gmail-m_4891618660013739441m_2729222737077895551gmail-wording" style="line-height:1"><span style="display:inline-block;min-width:35pt">6</span> Wording</h1>
            <p>Wording to follow based on SG16 and EWG guidance. There
              is much prior art to follow based on similar proposals and
              adoption in Rust and Swift.</p>
            <p>Explicit universal character names and codepoints are
              available for particular Unicode standards from the
              published database, and could be appended as an appendix.</p>
          </div>
        </div>
        <br>
        <fieldset></fieldset>
        <pre>_______________________________________________
SG16 Unicode mailing list
<a href="mailto:Unicode@isocpp.open-std.org" rel="noreferrer" target="_blank">Unicode@isocpp.open-std.org</a>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
      </blockquote>
      <p><br>
      </p>
      <br>
      <fieldset></fieldset>
      <pre>_______________________________________________
SG16 Unicode mailing list
<a href="mailto:Unicode@isocpp.open-std.org" rel="noreferrer" target="_blank">Unicode@isocpp.open-std.org</a>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
    </blockquote>
    <p><br>
    </p>
  </div>

</blockquote></div>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" target="_blank">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote></div>