<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 7 Mar 2019 at 16:59 Tom Honermann <<a href="mailto:tom@honermann.net">tom@honermann.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 3/7/19 10:30 AM, Ben Boeckel wrote:<br>
> On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann wrote:<br>
>> I don't know of any that use 32-bit code units for file names.<br>
>><br>
>> I find myself thinking (as I so often do these days much to the surprise<br>
>> of my past self), how does EBCDIC and z/OS fit in here? If we stick to<br>
>> JSON and require the dependency file to be UTF-8 encoded, would all file<br>
>> names in these files be raw8 encoded and effectively unreadable (by<br>
>> humans) on z/OS? Perhaps we could allow more flexibility, but doing so<br>
>> necessarily invites locales into the discussion (for those that are<br>
>> unaware, EBCDIC has code pages too). For example, we could require that<br>
>> the selected locale match between the producers and consumers of the<br>
>> file (UB if they don't) and permit use of the string representation by<br>
>> transcoding from the locale interpreted physical file name to UTF-8, but<br>
>> only if reverse-transcoding produces the same physical file name,<br>
>> otherwise the appropriate raw format must be used.<br>
> I first tried saying "treat these strings as if they were byte arrays"<br>
> with allowances for escaping `"` and `\`, but there was pushback on the<br>
> previous thread about it. This basically makes a new dialect of JSON<br>
> which is (usually) an error in existing implementations. It would mean<br>
> that tools are implementing their own JSON parsers (or even writers)…<br>
<br>
This isn't what I was suggesting. Rather, I was suggesting that <br>
standard UTF-8 encoded JSON be used, but that, for platforms where the <br>
interpretation of the filename may differ based on locale settings, <br>
that, if the file name can be losslessly round-tripped to UTF-8 and <br>
back, that the UTF-8 encoding of it (transcoded from the locale) be used <br>
in the JSON file as a (well-formed) UTF-8/JSON string even though that <br>
name wouldn't reflect the exact code units of the file name.<br>
<br>
For example, consider a file name consisting of the octets { 0x86, 0x89, <br>
0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a file name <br>
"fileß", but in EBCDIC code page 273 denotes "file~". The suggestion <br>
then is, when generating the JSON file, if the current locale setting is <br>
CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string. <br>
Tools consuming the file would then have to transcode the UTF-8 provided <br>
name back to the original locale to open the file.<br>
<br>
Previously, I had suggested that the locales must match for the producer <br>
and consumer and that it be UB otherwise (effectively leading to file <br>
not found errors). However, I think it would be better to store the <br>
encoding used to interpret the file name at generation time (if it isn't <br>
UTF-8) in the file to allow tools to accurately reverse the UTF-8 <br>
encoding. The supported encodings and the spelling of their names <br>
would, of course, be implementation/platform defined.<br>
<br>
><br>
> Note that if you'd like to have a readable filename, adding it as a<br>
> `_readable` key with a human-readable utf-8 transcoding to the filename<br>
> would be supported (see my message with the JSON schema bits from<br>
> yesterday).<br>
<br>
That seems reasonable to me for file names that really can't be <br>
represented as UTF-8, but seems like noise otherwise. In other words, I <br>
think we should try to minimize use of raw8, raw16, etc... where possible.<br></blockquote><div><br></div><div>Didn't we realize that we can't know the encoding of a filename, and so we cannot reliably decode it,</div><div>even less in a round trip safe way and that as such filenames can't be anything but bags of bytes?</div><div>At least, on some platforms?</div><div><br></div><div>The only hack I can think of is: assume an encoding with some platform dependent heuristic (locale, etc), round trip the filename through utf-8 and back if it's not bytewise</div><div>identical, base64 encode it and add a _readable key?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Tom.<br>
<br>
><br>
> --Ben<br>
<br>
<br>
_______________________________________________<br>
Modules mailing list<br>
<a href="mailto:Modules@lists.isocpp.org" target="_blank">Modules@lists.isocpp.org</a><br>
Subscription: <a href="http://lists.isocpp.org/mailman/listinfo.cgi/modules" rel="noreferrer" target="_blank">http://lists.isocpp.org/mailman/listinfo.cgi/modules</a><br>
Link to this post: <a href="http://lists.isocpp.org/modules/2019/03/0204.php" rel="noreferrer" target="_blank">http://lists.isocpp.org/modules/2019/03/0204.php</a><br>
</blockquote></div></div>