[SG16-Unicode] NL 029 : Disallow zero-width and control characters

Corentin corentin.jabot at gmail.com
Fri Oct 25 09:09:56 CEST 2019


On Fri, 25 Oct 2019 at 08:58, Corentin <corentin.jabot at gmail.com> wrote:

>
>
> On Fri, Oct 25, 2019, 02:18 Zach Laine <whatwasthataddress at gmail.com>
> wrote:
>
>> Is this a real problem that is biting people right now?  Are people using
>> these characters in identifiers and causing great upheaval?  This seems of
>> the lowest possible priority to me, and not at all C++20-related.
>>
>
>
> Completely agree, with both of you.
> I would be deeply unsatisfied with a solution that would:
>
> * Not follow TR31 recommandations
> * Not address the fact that you can only have Unicode identifiers if the
> compiler knows that your file id
>

I sent the previous mail too fast, sorry about the noise.
As I was saying

Completely agree, with both of you.
I would be deeply unsatisfied with a solution that would:

* Not follow TR31 recommendations
* Not address the fact that you can only have Unicode identifiers if the
compiler knows that your file is UTF encoded (same issue that u8 literals,
we talked about that - P1880)
* Fail to address concerns related to mangling if a normalization form is
not specified
* Fail to recognize that we will want to reflect on the name of these
things (std::meta::name_of) and that would require reflection to be able to
deal with that, both in terms of providing a uf8 api AND a specified
normalization form

All of that require careful consideration


Corentin




>
>> Zach
>>
>> On Thu, Oct 24, 2019 at 5:25 PM Steve Downey <sdowney at gmail.com> wrote:
>>
>>> SG16 has an NB comment to deal with! Tom has already scheduled it for
>>> Belfast. It's basically that the list of allowed code points have some
>>> interesting control characters like zero width joiners and RTL modifiers.
>>>
>>> https://github.com/cplusplus/nbballot/issues/28
>>>
>>> There's also an issue that JF raised earlier:
>>> https://github.com/sg16-unicode/sg16/issues/48
>>> Improve support for Unicode characters in identifiers
>>>
>>> Relevant unicode standard:
>>> https://unicode.org/reports/tr31/ UNICODE IDENTIFIER AND PATTERN SYNTAX
>>>
>>>
>>> Which is complicated because it allows things like identifiers written
>>> in Farsi which requires zwj for disambiguation, and suggests regex to
>>> detect particular allowed identifiers. It's fairly dense, and I haven't
>>> digested it yet, but it looks like there might be allowed ways to exclude
>>> that.
>>>
>>> Plus tailoring would be needed because C++ disallows some characters
>>> such as '$' which might otherwise be allowed. This is also discussed in
>>> TR31.
>>>
>>>
>>> My feeling on the comment is that it's not a new issue for C++20, so
>>> it's not clear that it has to be fixed for C++20. I believe it should be
>>> fixed, but it ought to be fixed in a principled manner, and that likely
>>> means TR31.
>>>
>>> We would also have to discuss if emoji are allowed in identifiers. TR31
>>> does not strictly disallow them. The TonyTable shall be interesting.
>>>
>>>
>>>
>>> _______________________________________________
>>> SG16 Unicode mailing list
>>> Unicode at isocpp.open-std.org
>>> http://www.open-std.org/mailman/listinfo/unicode
>>>
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode at isocpp.open-std.org
>> http://www.open-std.org/mailman/listinfo/unicode
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20191025/f66e8ab2/attachment-0001.html 


More information about the Unicode mailing list