<div dir="ltr"><div dir="ltr"></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Oct 26, 2019 at 4:29 PM Steve Downey <<a href="mailto:sdowney@gmail.com">sdowney@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">Building a static checker wouldn't be that hard, and would mean the compiler doesn't need to have a deep understanding of the unicode database. <div dir="auto"><br></div><div dir="auto">That's a reason to not use emoji, too, unfortunately. </div></div></blockquote><div><br></div><div>I don't particularly care which tool addresses security concerns, I'm just saying that IMO a tool should do so before the committee considers anything.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Oct 26, 2019, 11:37 JF Bastien <<a href="mailto:cxx@jfbastien.com" rel="noreferrer" target="_blank">cxx@jfbastien.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><br></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Oct 25, 2019 at 7:07 AM Steve Downey <<a href="mailto:sdowney@gmail.com" rel="noreferrer noreferrer" target="_blank">sdowney@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">We also should consider Unicode Technical Report #36 UNICODE SECURITY CONSIDERATIONS. Although my first thought was that allowing confusing characters in an identifier is just a developer causing problems for themselves, it is actually a problem in code review. Using punning names to disguise that a local `i` is not shadowing an outer scope `i` and using that to inject an exploitable buffer attack, for example. If I spend some black hat time, I could probably craft something even "better". </div></blockquote><div dir="auto"><br></div><div dir="auto">IMO: The above security considerations seem like something that compiler diagnostics should try out non-normatively before we try to standardize anything. </div><div dir="auto"><br></div><div dir="auto">I think TR31 should be some in the standard. One goal is that all compilers end up supporting Unicode the same way. The current situation is pretty silly. </div><div dir="auto"><br></div><div dir="auto"><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">TR31 has some discussion on normalization. I think canonicalization is probably the right thing to do, as anything else leads to tools lying to you without intending to. It should not matter how my editor decides to craft a letter with a diacritic, even if the source code takes a round trip through some rich text or word processor. This is an implementation burden, though. Really anything beyond the current white list is, in any case. <br><br>We'd also probably need to clarify that this means an even stronger requirement on the internal representation of source code. The input text has to be converted into code points (universal character names) and all of the operations we are talking about apply to that representation. Representation of the code points is implementation specific. <br><br>I'm going to end up writing this paper, aren't I. <br><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Oct 25, 2019 at 3:10 AM Corentin <<a href="mailto:corentin.jabot@gmail.com" rel="noreferrer noreferrer" target="_blank">corentin.jabot@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 25 Oct 2019 at 08:58, Corentin <<a href="mailto:corentin.jabot@gmail.com" rel="noreferrer noreferrer" target="_blank">corentin.jabot@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Oct 25, 2019, 02:18 Zach Laine <<a href="mailto:whatwasthataddress@gmail.com" rel="noreferrer noreferrer" target="_blank">whatwasthataddress@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Is this a real problem that is biting people right now? Are people using these characters in identifiers and causing great upheaval? This seems of the lowest possible priority to me, and not at all C++20-related.</div></blockquote></div></div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto">Completely agree, with both of you.</div><div dir="auto">I would be deeply unsatisfied with a solution that would:</div><div dir="auto"><br></div><div dir="auto">* Not follow TR31 recommandations</div><div dir="auto">* Not address the fact that you can only have Unicode identifiers if the compiler knows that your file id </div></div></blockquote><div><br></div><div>I sent the previous mail too fast, sorry about the noise.</div><div>As I was saying</div><div><br></div>Completely agree, with both of you.<br>I would be deeply unsatisfied with a solution that would:<br><br>* Not follow TR31 recommendations<br>* Not address the fact that you can only have Unicode identifiers if the compiler knows that your file is UTF encoded (same issue that u8 literals, we talked about that - P1880)</div><div class="gmail_quote">* Fail to address concerns related to mangling if a normalization form is not specified</div><div class="gmail_quote">* Fail to recognize that we will want to reflect on the name of these things (std::meta::name_of) and that would require reflection to be able to deal with that, both in terms of providing a uf8 api AND a specified normalization form <br><div><br></div><div>All of that require careful consideration</div><div><br></div><div><br></div><div>Corentin</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div><div>Zach</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Oct 24, 2019 at 5:25 PM Steve Downey <<a href="mailto:sdowney@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">sdowney@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">SG16 has an NB comment to deal with! Tom has already scheduled it for Belfast. It's basically that the list of allowed code points have some interesting control characters like zero width joiners and RTL modifiers. <br><br><a href="https://github.com/cplusplus/nbballot/issues/28" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/cplusplus/nbballot/issues/28</a><br><br>There's also an issue that JF raised earlier:<br><a href="https://github.com/sg16-unicode/sg16/issues/48" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/sg16-unicode/sg16/issues/48</a><br>Improve support for Unicode characters in identifiers <br><br>Relevant unicode standard:<br><a href="https://unicode.org/reports/tr31/" rel="noreferrer noreferrer noreferrer" target="_blank">https://unicode.org/reports/tr31/</a> UNICODE IDENTIFIER AND PATTERN SYNTAX <br><br>Which is complicated because it allows things like identifiers written in Farsi which requires zwj for disambiguation, and suggests regex to detect particular allowed identifiers. It's fairly dense, and I haven't digested it yet, but it looks like there might be allowed ways to exclude that. <br><br>Plus tailoring would be needed because C++ disallows some characters such as '$' which might otherwise be allowed. This is also discussed in TR31. <br><br><br>My feeling on the comment is that it's not a new issue for C++20, so it's not clear that it has to be fixed for C++20. I believe it should be fixed, but it ought to be fixed in a principled manner, and that likely means TR31. <br><br>We would also have to discuss if emoji are allowed in identifiers. TR31 does not strictly disallow them. The TonyTable shall be interesting. <br><br><br><br></div>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" rel="noreferrer noreferrer noreferrer" target="_blank">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer noreferrer noreferrer noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote></div>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" rel="noreferrer noreferrer noreferrer" target="_blank">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer noreferrer noreferrer noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote></div></div></div>
</blockquote></div></div>
</blockquote></div>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" rel="noreferrer noreferrer" target="_blank">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer noreferrer noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote></div></div>
</blockquote></div>
</blockquote></div></div>