<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="">
<div style="margin: 0px;"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">></span><span style="margin: 0px; display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">Far
more importantly, if the committee can assume unicode-clean source </span><span style="margin: 0px; display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">code going forth, that makes far more
tractable lots of other problems </span><span style="margin: 0px; display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">such as how char string literals ought to be interpreted.</span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important;"><br>
</span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">I don't think this actually matters for implementations. The standard can describe what
happens for Unicode and let implementations figure out what that means for the legacy encodings they target. An implementation on an EBCDIC machine, for example, can do an 'as if' notional conversion into UTF-8 for the purposes of following the standard's
rules.</span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important;"><br>
</span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">(I've been saying we should use IEEE 754 language for floats even though some machines
don't have that for years; this is very similar; describe the behavior you want and let implementations with special considerations get as close to that as is practical)</span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important;"><br>
</span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">The bigger problem is what happens to puts("some string literal") on such an EBCDIC machine
if the terminal is expected to not be UTF-8, or comparisons with argv when it is not UTF-8 </span><span style="margin: 0px; display: inline !important;"><span title=":slight_smile:" style="margin: 0px; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">🙂.</span></span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important;"><br>
</span><span style="margin: 0px; display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">></span><span style="margin: 0px; display: inline !important;"><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">The
present implementation-defined interpretation of the byte sequence in</span><br style="">
<span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">>source files allows a default of "UTF-8 in strings, comments can use</span><br style="">
<span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">>arbitrary bytes" (which thus allows existing source files in a range of</span><br style="">
<span style="display: inline !important;"><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">></span></span><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">ASCII-compatible
8-bit character sets if the non-ASCII characters only</span><br style="">
<span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">>appear in comments, without needing to tell the compiler which character</span><br style="">
<span style="display: inline !important;"><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">></span></span><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">set
is being used). That approach (which is what GCC does by default)</span><br style="">
<span style="display: inline !important;"><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">></span></span><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">seems
more friendly to users with existing source files using various</span><br style="">
<span style="display: inline !important;"><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">></span></span><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">character
sets in comments than strictly requiring everything to be UTF-8</span><br style="">
<span style="display: inline !important;"><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">></span></span><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">(even
in comments) unless the compiler is explicitly told otherwise.</span></span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important;"><span style="display: inline !important;"><br>
</span></span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important;"><span style="display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">I don't think GCC's behavior here would be prevented
by the standard describing the program input in terms of UTF-8.</span></span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important;"><br>
</span></div>
<div style="margin: 0px;"><span style="margin: 0px; display: inline !important; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">Billy3</span></div>
<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Liaison <liaison-bounces@lists.isocpp.org> on behalf of Niall Douglas via Liaison <liaison@lists.isocpp.org><br>
<b>Sent:</b> Wednesday, August 14, 2019 10:36 AM<br>
<b>To:</b> Niall Douglas via Liaison <liaison@lists.isocpp.org><br>
<b>Cc:</b> Niall Douglas <s_sourceforge@nedprod.com>; unicode@open-std.org <unicode@open-std.org><br>
<b>Subject:</b> Re: [wg14/wg21 liaison] [isocpp-core] Source file encoding (was: What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?)</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">> The present implementation-defined interpretation of the byte sequence in
<br>
> source files allows a default of "UTF-8 in strings, comments can use <br>
> arbitrary bytes" (which thus allows existing source files in a range of <br>
> ASCII-compatible 8-bit character sets if the non-ASCII characters only <br>
> appear in comments, without needing to tell the compiler which character <br>
> set is being used). That approach (which is what GCC does by default) <br>
> seems more friendly to users with existing source files using various <br>
> character sets in comments than strictly requiring everything to be UTF-8 <br>
> (even in comments) unless the compiler is explicitly told otherwise.<br>
<br>
I would find that choice unhelpful for tooling which processes C++<br>
source code. e.g. Python, which insists that text you feed it is either<br>
correct, or not text. And that's not unreasonable, either text is<br>
encoded correctly, or it is not.<br>
<br>
What do you think of my "all 7-bit clean ASCII" proposal? #pragma<br>
encoding (if supported by your C compiler) to opt out.<br>
<br>
Niall<br>
_______________________________________________<br>
Liaison mailing list<br>
Liaison@lists.isocpp.org<br>
Subscription: <a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fliaison&amp;data=02%7C01%7Cbion%40microsoft.com%7C8603993bd2154496bc8f08d720dde475%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014009707838598&amp;sdata=UWxrCeCFV5eCCyo%2FtDtsghMRCc9qtZVg6zKzH0dWA90%3D&amp;reserved=0">
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fliaison&amp;data=02%7C01%7Cbion%40microsoft.com%7C8603993bd2154496bc8f08d720dde475%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014009707838598&amp;sdata=UWxrCeCFV5eCCyo%2FtDtsghMRCc9qtZVg6zKzH0dWA90%3D&amp;reserved=0</a><br>
Link to this post: <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Fliaison%2F2019%2F08%2F0022.php&amp;data=02%7C01%7Cbion%40microsoft.com%7C8603993bd2154496bc8f08d720dde475%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014009707838598&amp;sdata=4BLCv%2FeepKePMWRaf6Da2IvGWIiZAhDBblsuju%2BOWGU%3D&amp;reserved=0">
https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Fliaison%2F2019%2F08%2F0022.php&amp;data=02%7C01%7Cbion%40microsoft.com%7C8603993bd2154496bc8f08d720dde475%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014009707838598&amp;sdata=4BLCv%2FeepKePMWRaf6Da2IvGWIiZAhDBblsuju%2BOWGU%3D&amp;reserved=0</a><br>
</div>
</span></font></div>
</body>
</html>