<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 3/7/19 1:46 PM, Corentin wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CA+Om+Sg7AtK5xB1tFQ4s9MQ9yS1jr8-BqLxoRmmCV0_LNvis7w@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr"><br>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, 7 Mar 2019 at 16:59
Tom Honermann <<a href="mailto:tom@honermann.net"
moz-do-not-send="true">tom@honermann.net</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">On 3/7/19
10:30 AM, Ben Boeckel wrote:<br>
> On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann
wrote:<br>
>> I don't know of any that use 32-bit code units for
file names.<br>
>><br>
>> I find myself thinking (as I so often do these days
much to the surprise<br>
>> of my past self), how does EBCDIC and z/OS fit in
here? If we stick to<br>
>> JSON and require the dependency file to be UTF-8
encoded, would all file<br>
>> names in these files be raw8 encoded and
effectively unreadable (by<br>
>> humans) on z/OS? Perhaps we could allow more
flexibility, but doing so<br>
>> necessarily invites locales into the discussion
(for those that are<br>
>> unaware, EBCDIC has code pages too). For example,
we could require that<br>
>> the selected locale match between the producers and
consumers of the<br>
>> file (UB if they don't) and permit use of the
string representation by<br>
>> transcoding from the locale interpreted physical
file name to UTF-8, but<br>
>> only if reverse-transcoding produces the same
physical file name,<br>
>> otherwise the appropriate raw format must be used.<br>
> I first tried saying "treat these strings as if they
were byte arrays"<br>
> with allowances for escaping `"` and `\`, but there was
pushback on the<br>
> previous thread about it. This basically makes a new
dialect of JSON<br>
> which is (usually) an error in existing
implementations. It would mean<br>
> that tools are implementing their own JSON parsers (or
even writers)…<br>
<br>
This isn't what I was suggesting. Rather, I was suggesting
that <br>
standard UTF-8 encoded JSON be used, but that, for platforms
where the <br>
interpretation of the filename may differ based on locale
settings, <br>
that, if the file name can be losslessly round-tripped to
UTF-8 and <br>
back, that the UTF-8 encoding of it (transcoded from the
locale) be used <br>
in the JSON file as a (well-formed) UTF-8/JSON string even
though that <br>
name wouldn't reflect the exact code units of the file name.<br>
<br>
For example, consider a file name consisting of the octets {
0x86, 0x89, <br>
0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a
file name <br>
"fileß", but in EBCDIC code page 273 denotes "file~". The
suggestion <br>
then is, when generating the JSON file, if the current
locale setting is <br>
CP37, to use the UTF-8 encoded name "fileß" as a normal JSON
string. <br>
Tools consuming the file would then have to transcode the
UTF-8 provided <br>
name back to the original locale to open the file.<br>
<br>
Previously, I had suggested that the locales must match for
the producer <br>
and consumer and that it be UB otherwise (effectively
leading to file <br>
not found errors). However, I think it would be better to
store the <br>
encoding used to interpret the file name at generation time
(if it isn't <br>
UTF-8) in the file to allow tools to accurately reverse the
UTF-8 <br>
encoding. The supported encodings and the spelling of their
names <br>
would, of course, be implementation/platform defined.<br>
<br>
><br>
> Note that if you'd like to have a readable filename,
adding it as a<br>
> `_readable` key with a human-readable utf-8 transcoding
to the filename<br>
> would be supported (see my message with the JSON schema
bits from<br>
> yesterday).<br>
<br>
That seems reasonable to me for file names that really can't
be <br>
represented as UTF-8, but seems like noise otherwise. In
other words, I <br>
think we should try to minimize use of raw8, raw16, etc...
where possible.<br>
</blockquote>
<div><br>
</div>
<div>Didn't we realize that we can't know the encoding of a
filename, and so we cannot reliably decode it,</div>
<div>even less in a round trip safe way and that as such
filenames can't be anything but bags of bytes?</div>
<div>At least, on some platforms?</div>
</div>
</div>
</blockquote>
Yes. However, we do routinely present file names to humans and that
requires interpreting them according to some encoding. The
challenge of course is, choosing an encoding, and how to present
code unit sequences that are not valid in that encoding.<br>
<blockquote type="cite"
cite="mid:CA+Om+Sg7AtK5xB1tFQ4s9MQ9yS1jr8-BqLxoRmmCV0_LNvis7w@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>The only hack I can think of is: assume an encoding with
some platform dependent heuristic (locale, etc), round trip
the filename through utf-8 and back if it's not bytewise</div>
<div>identical, base64 encode it and add a _readable key?</div>
</div>
</div>
</blockquote>
<p>Exactly (whether base64, raw8, or raw16, as the fall back isn't
significant). This is the approach I was trying to describe; you
did a better job of doing so :)<br>
</p>
<p>Tom.<br>
</p>
<blockquote type="cite"
cite="mid:CA+Om+Sg7AtK5xB1tFQ4s9MQ9yS1jr8-BqLxoRmmCV0_LNvis7w@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Tom.<br>
<br>
><br>
> --Ben<br>
<br>
<br>
_______________________________________________<br>
Modules mailing list<br>
<a href="mailto:Modules@lists.isocpp.org" target="_blank"
moz-do-not-send="true">Modules@lists.isocpp.org</a><br>
Subscription: <a
href="http://lists.isocpp.org/mailman/listinfo.cgi/modules"
rel="noreferrer" target="_blank" moz-do-not-send="true">http://lists.isocpp.org/mailman/listinfo.cgi/modules</a><br>
Link to this post: <a
href="http://lists.isocpp.org/modules/2019/03/0204.php"
rel="noreferrer" target="_blank" moz-do-not-send="true">http://lists.isocpp.org/modules/2019/03/0204.php</a><br>
</blockquote>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
Modules mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Modules@lists.isocpp.org">Modules@lists.isocpp.org</a>
Subscription: <a class="moz-txt-link-freetext" href="http://lists.isocpp.org/mailman/listinfo.cgi/modules">http://lists.isocpp.org/mailman/listinfo.cgi/modules</a>
Link to this post: <a class="moz-txt-link-freetext" href="http://lists.isocpp.org/modules/2019/03/0210.php">http://lists.isocpp.org/modules/2019/03/0210.php</a>
</pre>
</blockquote>
<p><br>
</p>
</body>
</html>