<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Also, please clarify the document
number. I suspect it should be D1949R0 (it looks like an extra
"1" may have snuck in there).</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Tom.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 11/2/19 3:05 PM, Tom Honermann
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:20215e93-de36-674c-fba0-d0643b4d79f3@honermann.net">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div class="moz-cite-prefix">Thanks, Steve. Could you please
attach this paper to the SG16 wiki at <a
class="moz-txt-link-freetext"
href="http://wiki.edg.com/bin/view/Wg21belfast/SG16"
moz-do-not-send="true">http://wiki.edg.com/bin/view/Wg21belfast/SG16</a>?<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Tom.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 11/2/19 9:44 AM, Steve Downey
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJEGDKruxz-Y1-ZAw_qv5QzS0nWSPyh82cK+VimSbOUF8Uw8+g@mail.gmail.com">
<meta http-equiv="content-type" content="text/html;
charset=UTF-8">
<div dir="ltr">
<h1 class="gmail-title"
style="line-height:1;text-align:center">C++ Identifier
Syntax using Unicode Standard Annex 31</h1>
<table
style="border:none;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em;float:right">
<tbody>
<tr>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Document
#:</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">D19149R0</td>
</tr>
<tr>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Date:</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">2019-11-02</td>
</tr>
<tr>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Project:</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Programming
Language C++<br>
SG16<br>
EWG<br>
CWG<br>
</td>
</tr>
<tr>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Reply-to:</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">Steve
Downey<br>
<<a href="mailto:sdowney@gmail.com" class="email"
style="text-decoration-line:none;color:rgb(65,131,196)"
moz-do-not-send="true">sdowney@gmail.com</a>, <a
href="mailto:sdowney2@bloomberg.net" class="email"
style="text-decoration-line:none;color:rgb(65,131,196)"
moz-do-not-send="true">sdowney2@bloomberg.net</a>><br>
</td>
</tr>
</tbody>
</table>
<div
style="color:rgb(0,0,0);font-family:serif;font-size:medium;clear:both">
<h1 id="gmail-abstract" style="line-height:1"><span
class="gmail-header-section-number"
style="display:inline-block;min-width:35pt">1</span> Abstract</h1>
<p>In response to NL 029 : Disallow zero-width and control
characters</p>
<p>Adopt Unicode Annex 31 as part of C++ 23. - That C++
identifiers match the pattern (XID_START + _ ) +
XID_CONTINUE*. - That portable source is required to be
normalized as NFC. - That using unassigned code points
ill-formed.</p>
<h1 id="gmail-poll-before-discussion" style="line-height:1"><span
class="gmail-header-section-number"
style="display:inline-block;min-width:35pt">2</span> Poll
before discussion</h1>
<p>The current state, allowing control characters, ZWJ, and
unassigned codepoints in C++ identifiers is not a defect,
and is working as designed, and does not need to be
addressed</p>
<h1
id="gmail-addressing-identifiers-in-a-more-principled-ways"
style="line-height:1"><span
class="gmail-header-section-number"
style="display:inline-block;min-width:35pt">3</span> Addressing
identifiers in a more principled ways</h1>
<p><a href="https://unicode.org/reports/tr31/"
style="text-decoration-line:none;color:rgb(65,131,196)"
moz-do-not-send="true">UNICODE IDENTIFIER AND PATTERN
SYNTAX</a> is an attempt to provide a normative way of
specifying definitions of general-purpose identifiers for
use in programming languages. It has evolved signfigantly
over the years, in particular since the time that C++ 11
was specified. In particular, the characters that were
allowed as identifiers, and the patterns, were not stable
at the time of C++11, which is the last time identifiers
were addressed in the standard. In addition, at that time,
ISO was promulgating advice suggesting a list of code
points as the recommended method for ISO standards to
specify identifiers.</p>
<p>Today the definitions in UAX31 can be used to provide
stable definitions for programming language identifiers,
with guarantees that an identifier will not be invalidated
by later standards.</p>
<p>Originally, UAX31 relied on derived properties of
characters, ID_START and ID_CONTINUE, however those
properties relied on fundamental properties that could
change over time. The unicode database now provides
XID_START and XID_CONTINUE, based on the same
characteristics, but with an additional stability
guarantee. The Unicode database now provides explicit
classification of both.</p>
<p>The original definitions closely match the identifier
syntax of C:</p>
<table style="border:1px solid
black;border-collapse:collapse;margin-left:auto;margin-right:auto;margin-top:0.8em">
<colgroup><col style="width:0px"><col style="width:0px"></colgroup><thead><tr
class="gmail-header" style="border-bottom:3px double
black">
<th
style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px
solid black">
<div><strong>Properties</strong></div>
</th>
<th
style="padding-left:1em;padding-right:1em;vertical-align:top;border-bottom:1px
solid black">
<div><strong>General Description of Coverage</strong></div>
</th>
</tr>
</thead><tbody>
<tr class="gmail-odd" style="border-bottom:1px solid
black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Start
characters are derived from the Unicode
General_Category of uppercase letters, lowercase
letters, titlecase letters, modifier letters, other
letters, letter numbers, plus Other_ID_Start, minus
Pattern_Syntax and Pattern_White_Space code points.</td>
</tr>
<tr class="even" style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">In
set notation:</td>
</tr>
<tr class="gmail-odd" style="border-bottom:1px solid
black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
</tr>
<tr class="even" style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">ID_Continue
characters include ID_Start characters, plus
characters having the Unicode General_Category of
nonspacing marks, spacing combining marks, decimal
number, connector punctuation, plus
Other_ID_Continue , minus Pattern_Syntax and
Pattern_White_Space code points.</td>
</tr>
<tr class="gmail-odd" style="border-bottom:1px solid
black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">In
set notation:</td>
</tr>
<tr class="even" style="border-bottom:1px solid black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top">[\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
</tr>
<tr class="gmail-odd" style="border-bottom:1px solid
black">
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
<td
style="padding-left:1em;padding-right:1em;vertical-align:top"><br>
</td>
</tr>
</tbody>
</table>
<p>The X versions of the properties start the same, but are
guaranteed stable in subsequent Unicode standards</p>
<h1 id="gmail-issues" style="line-height:1"><span
class="gmail-header-section-number"
style="display:inline-block;min-width:35pt">4</span> Issues</h1>
<ul style="list-style-type:none;padding-left:2em">
<li style="margin-top:0.6em;margin-bottom:0.6em">Continue
does not include ZWJ, which some scripts require</li>
<li style="margin-top:0.6em;margin-bottom:0.6em">Does not
exclude homoglyph attack</li>
<li style="margin-top:0.6em;margin-bottom:0.6em">Does not
require the compiler to normalize identifiers</li>
<li style="margin-top:0.6em;margin-bottom:0.6em">Does not
allow emoji</li>
</ul>
<h1 id="gmail-history" style="line-height:1"><span
class="gmail-header-section-number"
style="display:inline-block;min-width:35pt">5</span> History</h1>
<p>Using an explicit list of Unicode characters was
considered a best practice for ISO standardization in TR
10176:2003 Guidelines for the preparation of programming
language standards.</p>
<p>National body comment CA 24 for C++11:</p>
<blockquote>
<p>A list of issues related TR 10176:2003:</p>
<ul style="list-style-type:none;padding-left:2em">
<li style="margin-top:0.6em;margin-bottom:0.6em">“Combining
characters should not appear as the first character of
an identifier.” Reference: ISO/IEC TR 10176:2003
(Annex A) This is not reflected in FCD.</li>
<li style="margin-top:0.6em;margin-bottom:0.6em">Restrictions
on the first character of an identifier are not
observed as recommended in TR 10176:2003. The
inclusion of digits (outside of those in the basic
character set) under identifer-nondigit is implied by
FCD.</li>
<li style="margin-top:0.6em;margin-bottom:0.6em">It is
implied that only the “main listing” from Annex A is
included for C++. That is, the list ends with the
Special Characters section. This is not made explicit
in FCD. Existing practice in C++03 as well as WG 14
(C, as of N1425) and WG 4 (COBOL, as of N4315) is to
include a list in a normative Annex.</li>
<li style="margin-top:0.6em;margin-bottom:0.6em">Specify
width sensitivity as implied by C++03: is not the same
as A. Case sensitivity is already stated in [<a
href="http://lex.name" moz-do-not-send="true">lex.name</a>].</li>
</ul>
</blockquote>
<p>N3146 in 2010-10-04 considered using UAX31, but at the
time there were stability issues with identifiers, and
came down on the side of explicit white listing.</p>
<p>The Unicode standard has since made stability guarantees
about identifiers, and created the XID_START and
XID_CONTINUE properties to alleviate the stability
concerns that existed in 2010.</p>
<h1 id="gmail-wording" style="line-height:1"><span
class="gmail-header-section-number"
style="display:inline-block;min-width:35pt">6</span> Wording</h1>
<p>Wording to follow based on SG16 and EWG guidance. There
is much prior art to follow based on similar proposals and
adoption in Rust and Swift.</p>
<p>Explicit universal character names and codepoints are
available for particular Unicode standards from the
published database, and could be appended as an appendix.</p>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
SG16 Unicode mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Unicode@isocpp.open-std.org" moz-do-not-send="true">Unicode@isocpp.open-std.org</a>
<a class="moz-txt-link-freetext" href="http://www.open-std.org/mailman/listinfo/unicode" moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
</blockquote>
<p><br>
</p>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
SG16 Unicode mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Unicode@isocpp.open-std.org">Unicode@isocpp.open-std.org</a>
<a class="moz-txt-link-freetext" href="http://www.open-std.org/mailman/listinfo/unicode">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
</blockquote>
<p><br>
</p>
</body>
</html>