[SG16-Unicode] In response to NL029

Tom Honermann tom at honermann.net
Sun Nov 3 09:10:54 CET 2019


On 11/3/19 2:39 AM, Yehezkel Bernat wrote:
> I'm sorry if this isn't the right place/thread to ask it:
This is a fine place to ask.
> Why do we allow non-ASCII characters in identifiers at all? Wouldn't 
> life be simpler if identifiers must include only ASCII alphanumeric 
> characters?
> I know I assumed it to be the case until lately (when I started 
> reading the relevant papers here.)

This feature was added in C++11 when support for 
universal-character-name escapes were added.  I wasn't involved in the 
committee at the time, so I don't really know the history. The relevant 
paper is N3146 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm).

>
> Or maybe Unicode was allowed in the past and now it's too late to 
> change it?

Tom.
>
> On Sun, Nov 3, 2019 at 1:22 AM Steve Downey <sdowney at gmail.com 
> <mailto:sdowney at gmail.com>> wrote:
>
>     Will do.
>
>     On Sat, Nov 2, 2019, 15:07 Tom Honermann <tom at honermann.net
>     <mailto:tom at honermann.net>> wrote:
>
>         Also, please clarify the document number.  I suspect it should
>         be D1949R0 (it looks like an extra "1" may have snuck in there).
>
>         Tom.
>
>         On 11/2/19 3:05 PM, Tom Honermann wrote:
>>         Thanks, Steve.  Could you please attach this paper to the
>>         SG16 wiki at http://wiki.edg.com/bin/view/Wg21belfast/SG16?
>>
>>         Tom.
>>
>>         On 11/2/19 9:44 AM, Steve Downey wrote:
>>>
>>>
>>>           C++ Identifier Syntax using Unicode Standard Annex 31
>>>
>>>         Document #: 	D19149R0
>>>         Date: 	2019-11-02
>>>         Project: 	Programming Language C++
>>>         SG16
>>>         EWG
>>>         CWG
>>>         Reply-to: 	Steve Downey
>>>         <sdowney at gmail.com <mailto:sdowney at gmail.com>,
>>>         sdowney2 at bloomberg.net <mailto:sdowney2 at bloomberg.net>>
>>>
>>>
>>>           1 Abstract
>>>
>>>         In response to NL 029 : Disallow zero-width and control
>>>         characters
>>>
>>>         Adopt Unicode Annex 31 as part of C++ 23. - That C++
>>>         identifiers match the pattern (XID_START + _ ) +
>>>         XID_CONTINUE*. - That portable source is required to be
>>>         normalized as NFC. - That using unassigned code points
>>>         ill-formed.
>>>
>>>
>>>           2 Poll before discussion
>>>
>>>         The current state, allowing control characters, ZWJ, and
>>>         unassigned codepoints in C++ identifiers is not a defect,
>>>         and is working as designed, and does not need to be addressed
>>>
>>>
>>>           3 Addressing identifiers in a more principled ways
>>>
>>>         UNICODE IDENTIFIER AND PATTERN SYNTAX
>>>         <https://unicode.org/reports/tr31/> is an attempt to provide
>>>         a normative way of specifying definitions of general-purpose
>>>         identifiers for use in programming languages. It has evolved
>>>         signfigantly over the years, in particular since the time
>>>         that C++ 11 was specified. In particular, the characters
>>>         that were allowed as identifiers, and the patterns, were not
>>>         stable at the time of C++11, which is the last time
>>>         identifiers were addressed in the standard. In addition, at
>>>         that time, ISO was promulgating advice suggesting a list of
>>>         code points as the recommended method for ISO standards to
>>>         specify identifiers.
>>>
>>>         Today the definitions in UAX31 can be used to provide stable
>>>         definitions for programming language identifiers, with
>>>         guarantees that an identifier will not be invalidated by
>>>         later standards.
>>>
>>>         Originally, UAX31 relied on derived properties of
>>>         characters, ID_START and ID_CONTINUE, however those
>>>         properties relied on fundamental properties that could
>>>         change over time. The unicode database now provides
>>>         XID_START and XID_CONTINUE, based on the same
>>>         characteristics, but with an additional stability guarantee.
>>>         The Unicode database now provides explicit classification of
>>>         both.
>>>
>>>         The original definitions closely match the identifier syntax
>>>         of C:
>>>
>>>         *Properties*
>>>         	
>>>         *General Description of Coverage*
>>>         ID_Start 	ID_Start characters are derived from the Unicode
>>>         General_Category of uppercase letters, lowercase letters,
>>>         titlecase letters, modifier letters, other letters, letter
>>>         numbers, plus Other_ID_Start, minus Pattern_Syntax and
>>>         Pattern_White_Space code points.
>>>
>>>         	In set notation:
>>>
>>>         	[\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>>>         ID_Continue 	ID_Continue characters include ID_Start
>>>         characters, plus characters having the Unicode
>>>         General_Category of nonspacing marks, spacing combining
>>>         marks, decimal number, connector punctuation, plus
>>>         Other_ID_Continue , minus Pattern_Syntax and
>>>         Pattern_White_Space code points.
>>>
>>>         	In set notation:
>>>
>>>         	[\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>>>
>>>         	
>>>
>>>         The X versions of the properties start the same, but are
>>>         guaranteed stable in subsequent Unicode standards
>>>
>>>
>>>           4 Issues
>>>
>>>           * Continue does not include ZWJ, which some scripts require
>>>           * Does not exclude homoglyph attack
>>>           * Does not require the compiler to normalize identifiers
>>>           * Does not allow emoji
>>>
>>>
>>>           5 History
>>>
>>>         Using an explicit list of Unicode characters was considered
>>>         a best practice for ISO standardization in TR 10176:2003
>>>         Guidelines for the preparation of programming language
>>>         standards.
>>>
>>>         National body comment CA 24 for C++11:
>>>
>>>             A list of issues related TR 10176:2003:
>>>
>>>               * “Combining characters should not appear as the first
>>>                 character of an identifier.” Reference: ISO/IEC TR
>>>                 10176:2003 (Annex A) This is not reflected in FCD.
>>>               * Restrictions on the first character of an identifier
>>>                 are not observed as recommended in TR 10176:2003.
>>>                 The inclusion of digits (outside of those in the
>>>                 basic character set) under identifer-nondigit is
>>>                 implied by FCD.
>>>               * It is implied that only the “main listing” from
>>>                 Annex A is included for C++. That is, the list ends
>>>                 with the Special Characters section. This is not
>>>                 made explicit in FCD. Existing practice in C++03 as
>>>                 well as WG 14 (C, as of N1425) and WG 4 (COBOL, as
>>>                 of N4315) is to include a list in a normative Annex.
>>>               * Specify width sensitivity as implied by C++03: is
>>>                 not the same as A. Case sensitivity is already
>>>                 stated in [lex.name <http://lex.name>].
>>>
>>>         N3146 in 2010-10-04 considered using UAX31, but at the time
>>>         there were stability issues with identifiers, and came down
>>>         on the side of explicit white listing.
>>>
>>>         The Unicode standard has since made stability guarantees
>>>         about identifiers, and created the XID_START and
>>>         XID_CONTINUE properties to alleviate the stability concerns
>>>         that existed in 2010.
>>>
>>>
>>>           6 Wording
>>>
>>>         Wording to follow based on SG16 and EWG guidance. There is
>>>         much prior art to follow based on similar proposals and
>>>         adoption in Rust and Swift.
>>>
>>>         Explicit universal character names and codepoints are
>>>         available for particular Unicode standards from the
>>>         published database, and could be appended as an appendix.
>>>
>>>
>>>         _______________________________________________
>>>         SG16 Unicode mailing list
>>>         Unicode at isocpp.open-std.org  <mailto:Unicode at isocpp.open-std.org>
>>>         http://www.open-std.org/mailman/listinfo/unicode
>>
>>
>>
>>         _______________________________________________
>>         SG16 Unicode mailing list
>>         Unicode at isocpp.open-std.org  <mailto:Unicode at isocpp.open-std.org>
>>         http://www.open-std.org/mailman/listinfo/unicode
>
>
>     _______________________________________________
>     SG16 Unicode mailing list
>     Unicode at isocpp.open-std.org <mailto:Unicode at isocpp.open-std.org>
>     http://www.open-std.org/mailman/listinfo/unicode
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode at isocpp.open-std.org
> http://www.open-std.org/mailman/listinfo/unicode


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20191103/475d135d/attachment-0001.html 


More information about the Unicode mailing list