[SG16-Unicode] In response to NL029

Corentin corentin.jabot at gmail.com
Sun Nov 3 09:32:21 CET 2019


On Sun, Nov 3, 2019, 08:39 Yehezkel Bernat <yehezkelshb at gmail.com> wrote:

> I'm sorry if this isn't the right place/thread to ask it:
> Why do we allow non-ASCII characters in identifiers at all? Wouldn't life
> be simpler if identifiers must include only ASCII alphanumeric characters?
> I know I assumed it to be the case until lately (when I started reading
> the relevant papers here.)
>
> Or maybe Unicode was allowed in the past and now it's too late to change
> it?
>

I think implementers do support it/want to support it.
But they don't necessarily do it right and definitely not consistently so I
personally think it's better to specify how to do it to ensure portability
and intoropability with other features such as reflections.

I do think using that feature needs to be done carefully but there are
certainly use cases for it.

>
> On Sun, Nov 3, 2019 at 1:22 AM Steve Downey <sdowney at gmail.com> wrote:
>
>> Will do.
>>
>> On Sat, Nov 2, 2019, 15:07 Tom Honermann <tom at honermann.net> wrote:
>>
>>> Also, please clarify the document number.  I suspect it should be
>>> D1949R0 (it looks like an extra "1" may have snuck in there).
>>>
>>> Tom.
>>>
>>> On 11/2/19 3:05 PM, Tom Honermann wrote:
>>>
>>> Thanks, Steve.  Could you please attach this paper to the SG16 wiki at
>>> http://wiki.edg.com/bin/view/Wg21belfast/SG16?
>>>
>>> Tom.
>>>
>>> On 11/2/19 9:44 AM, Steve Downey wrote:
>>>
>>> C++ Identifier Syntax using Unicode Standard Annex 31
>>> Document #: D19149R0
>>> Date: 2019-11-02
>>> Project: Programming Language C++
>>> SG16
>>> EWG
>>> CWG
>>> Reply-to: Steve Downey
>>> <sdowney at gmail.com, sdowney2 at bloomberg.net>
>>> 1 Abstract
>>>
>>> In response to NL 029 : Disallow zero-width and control characters
>>>
>>> Adopt Unicode Annex 31 as part of C++ 23. - That C++ identifiers match
>>> the pattern (XID_START + _ ) + XID_CONTINUE*. - That portable source is
>>> required to be normalized as NFC. - That using unassigned code points
>>> ill-formed.
>>> 2 Poll before discussion
>>>
>>> The current state, allowing control characters, ZWJ, and unassigned
>>> codepoints in C++ identifiers is not a defect, and is working as designed,
>>> and does not need to be addressed
>>> 3 Addressing identifiers in a more principled ways
>>>
>>> UNICODE IDENTIFIER AND PATTERN SYNTAX
>>> <https://unicode.org/reports/tr31/> is an attempt to provide a
>>> normative way of specifying definitions of general-purpose identifiers for
>>> use in programming languages. It has evolved signfigantly over the years,
>>> in particular since the time that C++ 11 was specified. In particular, the
>>> characters that were allowed as identifiers, and the patterns, were not
>>> stable at the time of C++11, which is the last time identifiers were
>>> addressed in the standard. In addition, at that time, ISO was promulgating
>>> advice suggesting a list of code points as the recommended method for ISO
>>> standards to specify identifiers.
>>>
>>> Today the definitions in UAX31 can be used to provide stable definitions
>>> for programming language identifiers, with guarantees that an identifier
>>> will not be invalidated by later standards.
>>>
>>> Originally, UAX31 relied on derived properties of characters, ID_START
>>> and ID_CONTINUE, however those properties relied on fundamental properties
>>> that could change over time. The unicode database now provides XID_START
>>> and XID_CONTINUE, based on the same characteristics, but with an additional
>>> stability guarantee. The Unicode database now provides explicit
>>> classification of both.
>>>
>>> The original definitions closely match the identifier syntax of C:
>>> *Properties*
>>> *General Description of Coverage*
>>> ID_Start ID_Start characters are derived from the Unicode
>>> General_Category of uppercase letters, lowercase letters, titlecase
>>> letters, modifier letters, other letters, letter numbers, plus
>>> Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.
>>>
>>> In set notation:
>>>
>>> [\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>>> ID_Continue ID_Continue characters include ID_Start characters, plus
>>> characters having the Unicode General_Category of nonspacing marks, spacing
>>> combining marks, decimal number, connector punctuation, plus
>>> Other_ID_Continue , minus Pattern_Syntax and Pattern_White_Space code
>>> points.
>>>
>>> In set notation:
>>>
>>>
>>> [\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]
>>>
>>>
>>> The X versions of the properties start the same, but are guaranteed
>>> stable in subsequent Unicode standards
>>> 4 Issues
>>>
>>>    - Continue does not include ZWJ, which some scripts require
>>>    - Does not exclude homoglyph attack
>>>    - Does not require the compiler to normalize identifiers
>>>    - Does not allow emoji
>>>
>>> 5 History
>>>
>>> Using an explicit list of Unicode characters was considered a best
>>> practice for ISO standardization in TR 10176:2003 Guidelines for the
>>> preparation of programming language standards.
>>>
>>> National body comment CA 24 for C++11:
>>>
>>> A list of issues related TR 10176:2003:
>>>
>>>    - “Combining characters should not appear as the first character of
>>>    an identifier.” Reference: ISO/IEC TR 10176:2003 (Annex A) This is not
>>>    reflected in FCD.
>>>    - Restrictions on the first character of an identifier are not
>>>    observed as recommended in TR 10176:2003. The inclusion of digits (outside
>>>    of those in the basic character set) under identifer-nondigit is implied by
>>>    FCD.
>>>    - It is implied that only the “main listing” from Annex A is
>>>    included for C++. That is, the list ends with the Special Characters
>>>    section. This is not made explicit in FCD. Existing practice in C++03 as
>>>    well as WG 14 (C, as of N1425) and WG 4 (COBOL, as of N4315) is to include
>>>    a list in a normative Annex.
>>>    - Specify width sensitivity as implied by C++03: is not the same as
>>>    A. Case sensitivity is already stated in [lex.name].
>>>
>>> N3146 in 2010-10-04 considered using UAX31, but at the time there were
>>> stability issues with identifiers, and came down on the side of explicit
>>> white listing.
>>>
>>> The Unicode standard has since made stability guarantees about
>>> identifiers, and created the XID_START and XID_CONTINUE properties to
>>> alleviate the stability concerns that existed in 2010.
>>> 6 Wording
>>>
>>> Wording to follow based on SG16 and EWG guidance. There is much prior
>>> art to follow based on similar proposals and adoption in Rust and Swift.
>>>
>>> Explicit universal character names and codepoints are available for
>>> particular Unicode standards from the published database, and could be
>>> appended as an appendix.
>>>
>>> _______________________________________________
>>> SG16 Unicode mailing listUnicode at isocpp.open-std.orghttp://www.open-std.org/mailman/listinfo/unicode
>>>
>>>
>>>
>>> _______________________________________________
>>> SG16 Unicode mailing listUnicode at isocpp.open-std.orghttp://www.open-std.org/mailman/listinfo/unicode
>>>
>>>
>>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode at isocpp.open-std.org
>> http://www.open-std.org/mailman/listinfo/unicode
>>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode at isocpp.open-std.org
> http://www.open-std.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20191103/6a712052/attachment-0001.html 


More information about the Unicode mailing list