<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Aug 14, 2019, 12:39 PM Niall Douglas <<a href="mailto:s_sourceforge@nedprod.com">s_sourceforge@nedprod.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Removed CC to Core, as per Tom's request.<br>
<br>
> I agree with you that reinterpreting all existing code overnight as<br>
> utf-8 would hinder the adoption of future c++ version enough that we<br>
> should probably avoid to do that, but maybe a slight encouragement to<br>
> use utf8 would be beneficial to everyone.<br>
<br>
I don't personally think it's a big ask for people to convert their<br>
source files into UTF-8 when they flip the compiler language standard<br>
version into C++ 23, *if they don't tell the compiler to interpret the<br>
source code in a different way*. As I mentioned in a previous post, even<br>
very complex multi-encoded legacy codebases can be upgraded via Python.<br>
Just invest the effort, upgrade your code, clear the tech debt. Same as<br>
everyone must do with every C++ standard version upgrade.<br>
<br>
Far more importantly, if the committee can assume unicode-clean source<br>
code going forth, that makes far more tractable lots of other problems<br>
such as how char string literals ought to be interpreted.<br>
<br>
Right now there is conflation in this discussion between two types of<br>
char string:<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">I don't think people (at least sg 16) are confused. The standard does conflates everything. I think that's why Tom asked about the names of these things to begin with.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
1. char strings which come from the runtime environment e.g. from<br>
argv[], which can be ANY arbitrary encoding, including arbitrary bits.<br>
<br>
2. char strings which come from the compile time environment with<br>
compiler-imposed expectations of encoding e.g. from __FILE__<br>
<br>
3. char strings which come from the compiler time environment with<br>
arbitrary encoding and bits e.g. escaped characters inside string literals.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">2 and 3 will have the same encoding. (Which will uterly fail when we try to introduce Unicode identifiers and reflection).</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
This conflation is not helping the discussion get anywhere useful<br>
quickly. For example, one obvious solution to the above is that string<br>
literals gain a type of char8_maybe_t if they don't contain anything<br>
UTF-8 unsafe, and char8_maybe_t can implicitly convert to char8_t or to<br>
char.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Maybe we have enough literal types</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Various people have objected to my proposal on strawman grounds e.g. "my<br>
code would break". Firstly, if that is the case, your code is probably<br>
*already* broken, and "just happens" to work on your particular<br>
toolchain version. It won't be portable, in any case.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto">Agreed. But whey I say these kinds of things people make funny faces. And get annoyed to be pointed the brokeness of their code/the standard. So this option seems out. Especially on windows where the system is not utf8</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Secondly, as Tom suggested, some sort of #pragma to indicate encoding is<br>
probably unavoidable in the long run in any case, because the<br>
preprocessor also needs to know encoding. Anybody who has wrestled with<br>
files #including files of differing encoding, but insufficiently<br>
different that the compiler can't auto-detect the disparate encoding,<br>
will know what I mean. Far worse happens again when macros with content<br>
from one encoding are expanded into files with different encoding.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">I don't see how the preprocessor factors into that, the mapping to internal encoding is done before.</div><div dir="auto"><br></div><div dir="auto">Also pragma doesn't help you mixing ebcdic and ASCII supersets</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
The current situation of letting everybody do what they want is a mess.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Strongly agree.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
That's what standardisation is for: imposition of order upon chaos.<br>
<br>
Just make the entire lot UTF-8! And let individual files opt-out if they<br>
want, or whole TUs if the user asks the compiler to do so, with the<br>
standard making it very clear that anything other than UTF-8 =<br>
implementation defined behaviour for C++ 23 onwards.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">That is the pragmatic long term solution. But not the pragmatic short term one. Wg21 favors the later it seems.</div><div dir="auto"><br></div><div dir="auto">I would support such a thing. All other languages went there and it works great for them. Python will for example assume utf8 in the absence of pragma.</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Niall<br>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" target="_blank" rel="noreferrer">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote></div></div></div>