<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 11/3/19 2:39 AM, Yehezkel Bernat
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CA+CmpXukF_R4md9ejR2k=2UR_1hDa4uX1syfHg6GT3ZsTyXZPQ@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div class="gmail_default" style="font-size:small;color:#000000">I'm
sorry if this isn't the right place/thread to ask it:</div>
</div>
</blockquote>
This is a fine place to ask.<br>
<blockquote type="cite"
cite="mid:CA+CmpXukF_R4md9ejR2k=2UR_1hDa4uX1syfHg6GT3ZsTyXZPQ@mail.gmail.com">
<div dir="ltr">
<div class="gmail_default" style="font-size:small;color:#000000">Why
do we allow non-ASCII characters in identifiers at all?
Wouldn't life be simpler if identifiers must include only
ASCII alphanumeric characters?</div>
<div class="gmail_default" style="font-size:small;color:#000000">I
know I assumed it to be the case until lately (when I started
reading the relevant papers here.)
</div>
</div>
</blockquote>
<p>This feature was added in C++11 when support for
universal-character-name escapes were added. I wasn't involved in
the committee at the time, so I don't really know the history.
The relevant paper is N3146
(<a class="moz-txt-link-freetext" href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm</a>).<br>
</p>
<blockquote type="cite"
cite="mid:CA+CmpXukF_R4md9ejR2k=2UR_1hDa4uX1syfHg6GT3ZsTyXZPQ@mail.gmail.com">
<div dir="ltr">
<div class="gmail_default" style="font-size:small;color:#000000"><br>
</div>
<div class="gmail_default" style="font-size:small;color:#000000">Or
maybe Unicode was allowed in the past and now it's too late to
change it?</div>
</div>
</blockquote>
<br>
Tom.<br>
<blockquote type="cite"
cite="mid:CA+CmpXukF_R4md9ejR2k=2UR_1hDa4uX1syfHg6GT3ZsTyXZPQ@mail.gmail.com"><br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun, Nov 3, 2019 at 1:22 AM
Steve Downey <<a href="mailto:sdowney@gmail.com"
moz-do-not-send="true">sdowney@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="auto">Will do. </div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Nov 2, 2019, 15:07
Tom Honermann <<a href="mailto:tom@honermann.net"
target="_blank" moz-do-not-send="true">tom@honermann.net</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div>Also, please clarify the document number. I
suspect it should be D1949R0 (it looks like an extra
"1" may have snuck in there).</div>
<div><br>
</div>
<div>Tom.<br>
</div>
<div><br>
</div>
<div>On 11/2/19 3:05 PM, Tom Honermann wrote:<br>
</div>
<blockquote type="cite">
<div>Thanks, Steve. Could you please attach this
paper to the SG16 wiki at <a
href="http://wiki.edg.com/bin/view/Wg21belfast/SG16"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://wiki.edg.com/bin/view/Wg21belfast/SG16</a>?<br>
</div>
<div><br>
</div>
<div>Tom.<br>
</div>
<div><br>
</div>
<div>On 11/2/19 9:44 AM, Steve Downey wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<h1 style="line-height:1;text-align:center">C++
Identifier Syntax using Unicode Standard Annex
31</h1>
<table
style="border:none;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em;float:right">
<tbody>
<tr>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Document
#:</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">D19149R0</td>
</tr>
<tr>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Date:</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">2019-11-02</td>
</tr>
<tr>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Project:</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Programming
Language C++<br>
SG16<br>
EWG<br>
CWG<br>
</td>
</tr>
<tr>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Reply-to:</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Steve
Downey<br>
<<a href="mailto:sdowney@gmail.com"
style="text-decoration-line:none;color:rgb(65,131,196)"
rel="noreferrer" target="_blank"
moz-do-not-send="true">sdowney@gmail.com</a>, <a
href="mailto:sdowney2@bloomberg.net"
style="text-decoration-line:none;color:rgb(65,131,196)"
rel="noreferrer" target="_blank"
moz-do-not-send="true">sdowney2@bloomberg.net</a>><br>
</td>
</tr>
</tbody>
</table>
<div
style="color:rgb(0,0,0);font-family:serif;font-size:medium;clear:both">
<h1
id="gmail-m_4891618660013739441m_2729222737077895551gmail-abstract"
style="line-height:1"><span
style="display:inline-block;min-width:35pt">1</span> Abstract</h1>
<p>In response to NL 029 : Disallow zero-width
and control characters</p>
<p>Adopt Unicode Annex 31 as part of C++ 23. -
That C++ identifiers match the pattern
(XID_START + _ ) + XID_CONTINUE*. - That
portable source is required to be normalized
as NFC. - That using unassigned code points
ill-formed.</p>
<h1
id="gmail-m_4891618660013739441m_2729222737077895551gmail-poll-before-discussion"
style="line-height:1"><span
style="display:inline-block;min-width:35pt">2</span> Poll
before discussion</h1>
<p>The current state, allowing control
characters, ZWJ, and unassigned codepoints in
C++ identifiers is not a defect, and is
working as designed, and does not need to be
addressed</p>
<h1
id="gmail-m_4891618660013739441m_2729222737077895551gmail-addressing-identifiers-in-a-more-principled-ways"
style="line-height:1"><span
style="display:inline-block;min-width:35pt">3</span> Addressing
identifiers in a more principled ways</h1>
<p><a href="https://unicode.org/reports/tr31/"
style="text-decoration-line:none;color:rgb(65,131,196)"
rel="noreferrer" target="_blank"
moz-do-not-send="true">UNICODE IDENTIFIER
AND PATTERN SYNTAX</a> is an attempt to
provide a normative way of specifying
definitions of general-purpose identifiers for
use in programming languages. It has evolved
signfigantly over the years, in particular
since the time that C++ 11 was specified. In
particular, the characters that were allowed
as identifiers, and the patterns, were not
stable at the time of C++11, which is the last
time identifiers were addressed in the
standard. In addition, at that time, ISO was
promulgating advice suggesting a list of code
points as the recommended method for ISO
standards to specify identifiers.</p>
<p>Today the definitions in UAX31 can be used to
provide stable definitions for programming
language identifiers, with guarantees that an
identifier will not be invalidated by later
standards.</p>
<p>Originally, UAX31 relied on derived
properties of characters, ID_START and
ID_CONTINUE, however those properties relied
on fundamental properties that could change
over time. The unicode database now provides
XID_START and XID_CONTINUE, based on the same
characteristics, but with an additional
stability guarantee. The Unicode database now
provides explicit classification of both.</p>
<p>The original definitions closely match the
identifier syntax of C:</p>
<table style="border:1px solid
black;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em">
<colgroup><col style="width:0px"><col
style="width:0px"></colgroup><thead><tr
style="border-bottom:3px double black">
<th
style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px
solid black">
<div><strong>Properties</strong></div>
</th>
<th
style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px
solid black">
<div><strong>General Description of
Coverage</strong></div>
</th>
</tr>
</thead><tbody>
<tr style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start
characters are derived from the Unicode
General_Category of uppercase letters,
lowercase letters, titlecase letters,
modifier letters, other letters, letter
numbers, plus Other_ID_Start, minus
Pattern_Syntax and Pattern_White_Space
code points.</td>
</tr>
<tr style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">In
set notation:</td>
</tr>
<tr style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
</tr>
<tr style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue
characters include ID_Start characters,
plus characters having the Unicode
General_Category of nonspacing marks,
spacing combining marks, decimal number,
connector punctuation, plus
Other_ID_Continue , minus Pattern_Syntax
and Pattern_White_Space code points.</td>
</tr>
<tr style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">In
set notation:</td>
</tr>
<tr style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
</tr>
<tr style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
</tr>
</tbody>
</table>
<p>The X versions of the properties start the
same, but are guaranteed stable in subsequent
Unicode standards</p>
<h1
id="gmail-m_4891618660013739441m_2729222737077895551gmail-issues"
style="line-height:1"><span
style="display:inline-block;min-width:35pt">4</span> Issues</h1>
<ul
style="list-style-type:none;padding-left:2em">
<li
style="margin-top:0.6em;margin-bottom:0.6em">Continue
does not include ZWJ, which some scripts
require</li>
<li
style="margin-top:0.6em;margin-bottom:0.6em">Does
not exclude homoglyph attack</li>
<li
style="margin-top:0.6em;margin-bottom:0.6em">Does
not require the compiler to normalize
identifiers</li>
<li
style="margin-top:0.6em;margin-bottom:0.6em">Does
not allow emoji</li>
</ul>
<h1
id="gmail-m_4891618660013739441m_2729222737077895551gmail-history"
style="line-height:1"><span
style="display:inline-block;min-width:35pt">5</span> History</h1>
<p>Using an explicit list of Unicode characters
was considered a best practice for ISO
standardization in TR 10176:2003 Guidelines
for the preparation of programming language
standards.</p>
<p>National body comment CA 24 for C++11:</p>
<blockquote>
<p>A list of issues related TR 10176:2003:</p>
<ul
style="list-style-type:none;padding-left:2em">
<li
style="margin-top:0.6em;margin-bottom:0.6em">“Combining
characters should not appear as the first
character of an identifier.” Reference:
ISO/IEC TR 10176:2003 (Annex A) This is
not reflected in FCD.</li>
<li
style="margin-top:0.6em;margin-bottom:0.6em">Restrictions
on the first character of an identifier
are not observed as recommended in TR
10176:2003. The inclusion of digits
(outside of those in the basic character
set) under identifer-nondigit is implied
by FCD.</li>
<li
style="margin-top:0.6em;margin-bottom:0.6em">It
is implied that only the “main listing”
from Annex A is included for C++. That is,
the list ends with the Special Characters
section. This is not made explicit in FCD.
Existing practice in C++03 as well as WG
14 (C, as of N1425) and WG 4 (COBOL, as of
N4315) is to include a list in a normative
Annex.</li>
<li
style="margin-top:0.6em;margin-bottom:0.6em">Specify
width sensitivity as implied by C++03: is
not the same as A. Case sensitivity is
already stated in [<a
href="http://lex.name" rel="noreferrer"
target="_blank" moz-do-not-send="true">lex.name</a>].</li>
</ul>
</blockquote>
<p>N3146 in 2010-10-04 considered using UAX31,
but at the time there were stability issues
with identifiers, and came down on the side of
explicit white listing.</p>
<p>The Unicode standard has since made stability
guarantees about identifiers, and created the
XID_START and XID_CONTINUE properties to
alleviate the stability concerns that existed
in 2010.</p>
<h1
id="gmail-m_4891618660013739441m_2729222737077895551gmail-wording"
style="line-height:1"><span
style="display:inline-block;min-width:35pt">6</span> Wording</h1>
<p>Wording to follow based on SG16 and EWG
guidance. There is much prior art to follow
based on similar proposals and adoption in
Rust and Swift.</p>
<p>Explicit universal character names and
codepoints are available for particular
Unicode standards from the published database,
and could be appended as an appendix.</p>
</div>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
SG16 Unicode mailing list
<a href="mailto:Unicode@isocpp.open-std.org" rel="noreferrer" target="_blank" moz-do-not-send="true">Unicode@isocpp.open-std.org</a>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
</blockquote>
<p><br>
</p>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
SG16 Unicode mailing list
<a href="mailto:Unicode@isocpp.open-std.org" rel="noreferrer" target="_blank" moz-do-not-send="true">Unicode@isocpp.open-std.org</a>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
</blockquote>
<p><br>
</p>
</div>
</blockquote>
</div>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" target="_blank"
moz-do-not-send="true">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode"
rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
SG16 Unicode mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Unicode@isocpp.open-std.org">Unicode@isocpp.open-std.org</a>
<a class="moz-txt-link-freetext" href="http://www.open-std.org/mailman/listinfo/unicode">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
</blockquote>
<p><br>
</p>
</body>
</html>