[SG16-Unicode] [isocpp-core] Fwd: New Core Issue: [lex.name]/3.2 under-specifies "uppercase letter"
Mathias Stearn
redbeard0531+isocpp at gmail.com
Mon Oct 28 21:25:54 CET 2019
On Mon, Oct 28, 2019 at 12:58 PM Richard Smith <richardsmith at google.com>
wrote:
> On Mon, Oct 28, 2019 at 9:39 AM Mathias Stearn via Core <
> core at lists.isocpp.org> wrote:
>
>> Is it just uppercase letters in the basic source character set, or
>> anything considered an uppercase letter in the universal character set
>> after phase 1 transcoding and universal-character-name resolution? Or is
>> there some other definition of uppercase?
>>
>
> My interpretation:
>
> * We don't resolve universal-character-names; rather, we *form* them. (Eg,
> int façade; is converted into int fa\u00e7ade;) So for example _Ç becomes
> _\u00c7, which doesn't start with an underscore followed by an uppercase
> letter (it's an underscore followed by a slash).
>
I considered that but it felt like an overly legalistic reading at the
time. It also seems to be counter to http://eel.is/c++draft/lex.name#1. On
the other hand, that first sentence "An identifier is an arbitrarily long
sequence of letters and digits." is clearly incorrect because many of the
allowed code points (including all emoji) are neither letters nor digits.
It also seems vaguely counter to my reading of the "spirit" of
http://eel.is/c++draft/lex.phases#1.1.sentence-4, but I have no idea what
the normative impact of that sentence is. (I hope compilers internal
encoding choices are not observable...)
I guess [lex] needs some cleanup in general.
> * Unicode (to which we have a normative reference) defines uppercase, and
> we follow that, but we happen to only ever apply it to the basic source
> character set because of the above rewriting.
>
>
>> I have a slight preference for restricting to just A-Z so that it doesn't
>> require humans or tools to consult the unicode data tables to decide if an
>> identifier is safe to use.
>>
>
> Regardless of how we express the rule, I agree with this direction.
>
> Proposed resolution:
>>
>> Replace [lex.names]/3.2 with:
>>
>> Each identifier that contains a double underscore __ or begins with an
>> underscore followed by an uppercase <del>letter</del><ins>*nondigit*</ins>
>> is reserved to the implementation for any use.
>>
>
> ... and I think this is a fine wording improvement, whether or not we
> think it's formally necessary.
>
>
>> Alternatively we could either create a new grammar production for
>> uppercase *nondigit*s, or just say something like "one of the universal
>> characters in the range 0041-005A (A-Z)"
>>
>>
>> _______________________________________________
>> Core mailing list
>> Core at lists.isocpp.org
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post: http://lists.isocpp.org/core/2019/10/7541.php
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20191028/cb7aaa67/attachment-0001.html
More information about the Unicode
mailing list