<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 6/23/19 5:17 AM, Corentin wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Let's further assume
<div>
<ul>
<li>Unicode will not be replaced and superseded in our
lifetime</li>
</ul>
</div>
</div>
</blockquote>
Predictions have a spectacular rate of failure, but in this case, I
would not bet against you :)<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>
<ul>
<li>Unicode is the only character set to be able to handle
text for it is the only encoding that is a strict super of
all previously existing encodings and strives to encompass
all characters used by people</li>
</ul>
</div>
</div>
</blockquote>
Sure.<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>
<ul>
<li>A poor Unicode support at the language and system levels
has led to most developers having a poor understanding of
text and encoding (regardless of skill level)</li>
</ul>
</div>
</div>
</blockquote>
Poor Unicode support *might* be a contributing factor, but is hardly
the only one. Sometimes, poor support is a motivating factor for
getting educated.<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>
<ul>
<li>In many cases developers and engineers only care about
sequences of bytes they can pronounce rather than text,
and in these cases ASCII is sufficient</li>
</ul>
</div>
</div>
</blockquote>
Likewise, EBCDIC is sufficient on some systems.<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>
<ul>
<li>Systems and compiler vendors have a vested interest in
supporting Unicode which is already the most used encoding
in user-facing systems <a
href="https://en.wikipedia.org/wiki/Unicode#Adoption"
moz-do-not-send="true">https://en.wikipedia.org/wiki/Unicode#Adoption</a>
<br>
</li>
</ul>
</div>
</div>
</blockquote>
<p>Not necessarily. Some may take the view that Unicode support is
an external library problem.</p>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>I propose that we work towards making Unicode the only
supported _source_ character set - I realize this might take
time as far from all source files are encoded in a UTF
encoding, however Unicode is designed to make that possible.</div>
<div>This is also standard practice and both GCC and Clang will
assume a UTF-8 encoding</div>
</div>
</blockquote>
I don't see a need for the standard to impose a single supported
source character set; I think it is better to let the market drive
this. If a convergence occurs, that would be the appropriate time
for the standard to reflect it. It is appropriate for the standard
to lead in some cases, but in this case, there is considerable
history that prohibits a wholesale migration. I agree with the
general sentiment though; I do want to encourage and make migration
easier.<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div><br>
</div>
<div>In the meantime, I propose that:</div>
<div>
<ul>
<li>Source with characters outside of the basic source
character set embedded in a utf-8, utf-16 or utf-32
character or string literal are ill-formed<b> if and only
if</b> the source/input character set is not Unicode.</li>
</ul>
</div>
</div>
</blockquote>
That would break existing code for, in my opinion, little gain. I
think the more pressing concern is means to determine what encoding
to interpret a source file as.<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>
<ul>
<li>We put out a guideline recommending for source files to
be utf-8 encoded</li>
</ul>
</div>
</div>
</blockquote>
Are you suggesting a standing document? I don't see much benefit in
doing so.<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>
<ul>
<li>We put in the standard that compiler should assume
utf8-encoded filers as the default input encoding unless
of the existence of implementation defined out-of-band
information (which would have no practical impact, but to
signal we recommend supporting utf-8)</li>
</ul>
</div>
</div>
</blockquote>
This would again break existing code. And the out-of-band
information used today by the Microsoft compiler is the current
locale (active code page).<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>
<ul>
<li>We deprecate string and wide character and string
literals (char, wchar_t) whose source representation
contains characters not re-presentable in the
basic execution character set or wide execution character
set respectively. We encourage implementers to emit a
warning in these cases - the intent is to avoid loss of
information when transcoding to the execution character
set - This matches existing practice</li>
</ul>
</div>
</div>
</blockquote>
<p>This is not existing practice in my experience, and I'm not sure
what techniques you have in mind for such encouragement. I think
I can get on board with such deprecation for the non-basic [wide]
(presumed) execution character set. E.g., I'd like to change the
implementation and conditionally defined behavior in these clauses
such that the program becomes ill-formed:<br>
</p>
<ul>
<li>[lex.phases]p1.5 (<a class="moz-txt-link-freetext" href="http://eel.is/c++draft/lex.phases#1.5">http://eel.is/c++draft/lex.phases#1.5</a>)</li>
<li>[lex.ccon]p2 (<a class="moz-txt-link-freetext" href="http://eel.is/c++draft/lex.ccon#2">http://eel.is/c++draft/lex.ccon#2</a>)</li>
<li>[lex.ccon]p6 (<a class="moz-txt-link-freetext" href="http://eel.is/c++draft/lex.ccon#6">http://eel.is/c++draft/lex.ccon#6</a>)</li>
</ul>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>
<div><br>
</div>
</div>
<div>The proposed changes hope to make it easier to use string
literals and Unicode strings literal without loss of
information portably across platforms by capitalizing on
char8_t <br>
</div>
</div>
</blockquote>
Maybe I'm missing it, but I don't see how what is proposed above
helps capitalize on char8_t.<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div>They would standardize existing practice, match common
practice in other languages (go rust, swift, python) and avoid
bugs related to loss of information when transcoding arbitrary
Unicode data to legacy encodings not able to represent but a
very small of characters defined by Unicode. <br>
</div>
</div>
</blockquote>
I'm not seeing the parallels here. I do agree with wanting to
change source-to-execution character set transcoding problems into
compile-time errors though.<br>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div><br>
</div>
<div>It also make it feasible to have Unicode identifiers in the
future, as proposed by JF <br>
</div>
</div>
</blockquote>
<p>We actually have them now.</p>
<ul>
<li>[lex.name]p1 (<a class="moz-txt-link-freetext" href="http://eel.is/c++draft/lex.name#1">http://eel.is/c++draft/lex.name#1</a>)</li>
</ul>
<p>I started on a related paper for the pre-Cologne mailing, but
didn't get it done in time. I hope to have it to enough of a
state to present at least some of the content at the SG16 evening
session in Cologne. One of the goals is to enable encodings to be
determined on a per-source-file basis. Two options will be
discussed:<br>
</p>
<ol>
<li>Specifying that implementations check for a Unicode BOM. BOMs
aren't particularly popular, but sufficiently supported.</li>
<li>Specifying a #pragma directive that indicates the encoding.
This is similar to features in Python and the HTML spec.<br>
</li>
</ol>
<p>Tom.<br>
</p>
<blockquote type="cite"
cite="mid:CA+Om+Sh573ZvvjTkZ64uGmEGwNvB4oh1gTfUeRPTuxLv2gPPSA@mail.gmail.com">
<div dir="ltr">
<div><br>
</div>
<div>Looking to discussing these ideas further,</div>
<div>Corentin</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
SG16 Unicode mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Unicode@isocpp.open-std.org">Unicode@isocpp.open-std.org</a>
<a class="moz-txt-link-freetext" href="http://www.open-std.org/mailman/listinfo/unicode">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
</blockquote>
<p><br>
</p>
</body>
</html>