<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>Hey, Zach. The following are some items I had intended to
mention in our last telecon when we were discussing <a
moz-do-not-send="true"
href="https://github.com/tzlaine/small_wg1_papers/blob/master/P1879_please_dont_rewrite_my_string_literals.md">P1879R0</a>,
but we ran out of time. These are offered in the spirit of making
the paper the best it can be.</p>
<ol>
<li>I suggest referring to the MSVC <tt>/source-charset:utf-8</tt>
option instead of the <tt>/utf-8</tt> option since the issue
presented primarily concerns the assumed source file encoding.
However, the distinction is relevant for the later comments
regarding omission of the <tt>u8</tt> prefix in order to retain
the exact code units from the source file; that behavior depends
on source file encoding exactly matching execution encoding. If
source file encoding and execution encoding don't match, then
ordinary string literal contents will be transcoded similarly to
UTF literals.<br>
</li>
<li>I think it would be useful to expand on the MSVC behavior. In
particular, state that, by default, MSVC assumes the Active Code
Page for both the encoding of source files and the execution
character set, and that the particular values that you witnessed
at run-time were the result of the source files being decoded as
Windows-1252 and then transcoded to UTF-8. Specifically, the <tt>0xCF</tt>
code unit was interpreted as U+00CF {LATIN CAPITAL LETTER I WITH
DIAERESIS} and encoded as <tt>0x</tt><tt>C3 0x8F</tt> and the <tt>0x82</tt>
code unit was interpreted as U+201A {SINGLE LOW-9 QUOTATION
MARK} and encoded as <tt>0xE2 0x80 0x9A</tt>.</li>
<li>Per the <a moz-do-not-send="true"
href="https://github.com/sg16-unicode/sg16-meetings#october-9th-2019">meeting
summary from the telecon</a>, there were suggestions of,
instead of prohibiting use of UTF literals completely in non-UTF
encoded source files, to instead restrict the set of characters
that may be directly transcoded from the source file. This
would allow, for example, encoding U+03C2 {GREEK SMALL LETTER
FINAL SIGMA} using <tt>u8"\u</tt><tt>03C2"</tt>, but not <tt>u8"</tt><tt>ς"</tt>
(in non-UTF encoded source files). Unfortunately, it is not
obvious where to draw the line between which source encoded
characters are and are not allowed. Some possibilities follow
(these pretty much match what was discussed in the telecon):</li>
<ol>
<li>Restrict to source file characters from the basic source
character set. This solves the portability issue well, but is
pretty restrictive. '<tt>$</tt>' and '<tt>@</tt>' are not
members of the basic source character set so this approach
would require writing email addresses with an escape sequence
for the '<tt>@</tt>' sign: e.g., <tt>u8"tom\u</tt><tt>0040honermann.net"</tt>.
Yuck (feel free to propose adding '<tt>@</tt>' to the basic
source character set!)</li>
<li>Restrict to characters that transcode to ASCII characters.
This solves the portability issue well for the MSVC compiler
since all of its supported source encodings are ASCII
derivatives (the compiler can diagnose any source file code
units with a value above <tt>0x7F</tt>). It doesn't solve
the issue well for EBCDIC code pages since they don't all
share a common set of code points that map to ASCII characters
(for example, in IBM-1047, 0x5F maps to U+005E {CIRCUMFLEX
ACCENT} where as in IBM-037, 0x5F maps to U+00AC {NOT SIGN}).
Unfortunately, I don't think EBCDIC code pages have a common
subset equivalent to ASCII for Windows code pages; that makes
designing a solution that addresses EBCDIC strictly more
challenging (I think it is reasonable to not try and solve
this issue for EBCDIC).<br>
</li>
</ol>
<li>The proposed and alternatively discussed changes all break
backward compatibility. I think the paper should call this out
explicitly, ideally with some analysis of the anticipated
impact. In particular, it may be worth noting that UTF literals
are used on z/OS to obtain ASCII/Unicode strings needed for
interaction on the web.<br>
</li>
</ol>
<p>Was this paper submitted for the Belfast pre-meeting mailing?<br>
</p>
<p>Tom.<br>
</p>
</body>
</html>