[SG16-Unicode] In response to NL029

Tom Honermann tom at honermann.net
Sat Nov 2 20:05:42 CET 2019


Thanks, Steve.  Could you please attach this paper to the SG16 wiki at 
http://wiki.edg.com/bin/view/Wg21belfast/SG16?

Tom.

On 11/2/19 9:44 AM, Steve Downey wrote:
>
>
>   C++ Identifier Syntax using Unicode Standard Annex 31
>
> Document #: 	D19149R0
> Date: 	2019-11-02
> Project: 	Programming Language C++
> SG16
> EWG
> CWG
> Reply-to: 	Steve Downey
> <sdowney at gmail.com <mailto:sdowney at gmail.com>, sdowney2 at bloomberg.net 
> <mailto:sdowney2 at bloomberg.net>>
>
>
>   1 Abstract
>
> In response to NL 029 : Disallow zero-width and control characters
>
> Adopt Unicode Annex 31 as part of C++ 23. - That C++ identifiers match 
> the pattern (XID_START + _ ) + XID_CONTINUE*. - That portable source 
> is required to be normalized as NFC. - That using unassigned code 
> points ill-formed.
>
>
>   2 Poll before discussion
>
> The current state, allowing control characters, ZWJ, and unassigned 
> codepoints in C++ identifiers is not a defect, and is working as 
> designed, and does not need to be addressed
>
>
>   3 Addressing identifiers in a more principled ways
>
> UNICODE IDENTIFIER AND PATTERN SYNTAX 
> <https://unicode.org/reports/tr31/> is an attempt to provide a 
> normative way of specifying definitions of general-purpose identifiers 
> for use in programming languages. It has evolved signfigantly over the 
> years, in particular since the time that C++ 11 was specified. In 
> particular, the characters that were allowed as identifiers, and the 
> patterns, were not stable at the time of C++11, which is the last time 
> identifiers were addressed in the standard. In addition, at that time, 
> ISO was promulgating advice suggesting a list of code points as the 
> recommended method for ISO standards to specify identifiers.
>
> Today the definitions in UAX31 can be used to provide stable 
> definitions for programming language identifiers, with guarantees that 
> an identifier will not be invalidated by later standards.
>
> Originally, UAX31 relied on derived properties of characters, ID_START 
> and ID_CONTINUE, however those properties relied on fundamental 
> properties that could change over time. The unicode database now 
> provides XID_START and XID_CONTINUE, based on the same 
> characteristics, but with an additional stability guarantee. The 
> Unicode database now provides explicit classification of both.
>
> The original definitions closely match the identifier syntax of C:
>
> *Properties*
> 	
> *General Description of Coverage*
> ID_Start 	ID_Start characters are derived from the Unicode 
> General_Category of uppercase letters, lowercase letters, titlecase 
> letters, modifier letters, other letters, letter numbers, plus 
> Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.
>
> 	In set notation:
>
> 	[\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
> ID_Continue 	ID_Continue characters include ID_Start characters, plus 
> characters having the Unicode General_Category of nonspacing marks, 
> spacing combining marks, decimal number, connector punctuation, plus 
> Other_ID_Continue , minus Pattern_Syntax and Pattern_White_Space code 
> points.
>
> 	In set notation:
>
> 	[\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>
> 	
>
> The X versions of the properties start the same, but are guaranteed 
> stable in subsequent Unicode standards
>
>
>   4 Issues
>
>   * Continue does not include ZWJ, which some scripts require
>   * Does not exclude homoglyph attack
>   * Does not require the compiler to normalize identifiers
>   * Does not allow emoji
>
>
>   5 History
>
> Using an explicit list of Unicode characters was considered a best 
> practice for ISO standardization in TR 10176:2003 Guidelines for the 
> preparation of programming language standards.
>
> National body comment CA 24 for C++11:
>
>     A list of issues related TR 10176:2003:
>
>       * “Combining characters should not appear as the first character
>         of an identifier.” Reference: ISO/IEC TR 10176:2003 (Annex A)
>         This is not reflected in FCD.
>       * Restrictions on the first character of an identifier are not
>         observed as recommended in TR 10176:2003. The inclusion of
>         digits (outside of those in the basic character set) under
>         identifer-nondigit is implied by FCD.
>       * It is implied that only the “main listing” from Annex A is
>         included for C++. That is, the list ends with the Special
>         Characters section. This is not made explicit in FCD. Existing
>         practice in C++03 as well as WG 14 (C, as of N1425) and WG 4
>         (COBOL, as of N4315) is to include a list in a normative Annex.
>       * Specify width sensitivity as implied by C++03: is not the same
>         as A. Case sensitivity is already stated in [lex.name
>         <http://lex.name>].
>
> N3146 in 2010-10-04 considered using UAX31, but at the time there were 
> stability issues with identifiers, and came down on the side of 
> explicit white listing.
>
> The Unicode standard has since made stability guarantees about 
> identifiers, and created the XID_START and XID_CONTINUE properties to 
> alleviate the stability concerns that existed in 2010.
>
>
>   6 Wording
>
> Wording to follow based on SG16 and EWG guidance. There is much prior 
> art to follow based on similar proposals and adoption in Rust and Swift.
>
> Explicit universal character names and codepoints are available for 
> particular Unicode standards from the published database, and could be 
> appended as an appendix.
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode at isocpp.open-std.org
> http://www.open-std.org/mailman/listinfo/unicode


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20191102/c3ecd008/attachment.html 


More information about the Unicode mailing list