<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 3/7/19 3:00 PM, Richard Smith via
Modules wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAGL0aWffi9_hDz3brL=Q6MLwwu9+BgVqA9-gdWXHf7kk7mowFw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div dir="ltr">On Thu, Mar 7, 2019 at 10:46 AM Corentin <<a
href="mailto:corentin.jabot@gmail.com"
moz-do-not-send="true">corentin.jabot@gmail.com</a>>
wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, 7 Mar 2019 at
16:59 Tom Honermann <<a
href="mailto:tom@honermann.net" target="_blank"
moz-do-not-send="true">tom@honermann.net</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">On 3/7/19 10:30 AM,
Ben Boeckel wrote:<br>
> On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom
Honermann wrote:<br>
>> I don't know of any that use 32-bit code
units for file names.<br>
>><br>
>> I find myself thinking (as I so often do
these days much to the surprise<br>
>> of my past self), how does EBCDIC and z/OS
fit in here? If we stick to<br>
>> JSON and require the dependency file to be
UTF-8 encoded, would all file<br>
>> names in these files be raw8 encoded and
effectively unreadable (by<br>
>> humans) on z/OS? Perhaps we could allow more
flexibility, but doing so<br>
>> necessarily invites locales into the
discussion (for those that are<br>
>> unaware, EBCDIC has code pages too). For
example, we could require that<br>
>> the selected locale match between the
producers and consumers of the<br>
>> file (UB if they don't) and permit use of the
string representation by<br>
>> transcoding from the locale interpreted
physical file name to UTF-8, but<br>
>> only if reverse-transcoding produces the same
physical file name,<br>
>> otherwise the appropriate raw format must be
used.<br>
> I first tried saying "treat these strings as if
they were byte arrays"<br>
> with allowances for escaping `"` and `\`, but
there was pushback on the<br>
> previous thread about it. This basically makes a
new dialect of JSON<br>
> which is (usually) an error in existing
implementations. It would mean<br>
> that tools are implementing their own JSON
parsers (or even writers)…<br>
<br>
This isn't what I was suggesting. Rather, I was
suggesting that <br>
standard UTF-8 encoded JSON be used, but that, for
platforms where the <br>
interpretation of the filename may differ based on
locale settings, <br>
that, if the file name can be losslessly round-tripped
to UTF-8 and <br>
back, that the UTF-8 encoding of it (transcoded from
the locale) be used <br>
in the JSON file as a (well-formed) UTF-8/JSON string
even though that <br>
name wouldn't reflect the exact code units of the file
name.<br>
<br>
For example, consider a file name consisting of the
octets { 0x86, 0x89, <br>
0x93, 0x85, 0x59 }. In EBCDIC code page 37, this
denotes a file name <br>
"fileß", but in EBCDIC code page 273 denotes "file~".
The suggestion <br>
then is, when generating the JSON file, if the current
locale setting is <br>
CP37, to use the UTF-8 encoded name "fileß" as a
normal JSON string. <br>
Tools consuming the file would then have to transcode
the UTF-8 provided <br>
name back to the original locale to open the file.<br>
<br>
Previously, I had suggested that the locales must
match for the producer <br>
and consumer and that it be UB otherwise (effectively
leading to file <br>
not found errors). However, I think it would be
better to store the <br>
encoding used to interpret the file name at generation
time (if it isn't <br>
UTF-8) in the file to allow tools to accurately
reverse the UTF-8 <br>
encoding. The supported encodings and the spelling of
their names <br>
would, of course, be implementation/platform defined.<br>
<br>
><br>
> Note that if you'd like to have a readable
filename, adding it as a<br>
> `_readable` key with a human-readable utf-8
transcoding to the filename<br>
> would be supported (see my message with the JSON
schema bits from<br>
> yesterday).<br>
<br>
That seems reasonable to me for file names that really
can't be <br>
represented as UTF-8, but seems like noise otherwise.
In other words, I <br>
think we should try to minimize use of raw8, raw16,
etc... where possible.<br>
</blockquote>
<div><br>
</div>
<div>Didn't we realize that we can't know the encoding
of a filename, and so we cannot reliably decode it,</div>
<div>even less in a round trip safe way and that as such
filenames can't be anything but bags of bytes?</div>
<div>At least, on some platforms?</div>
<div><br>
</div>
<div>The only hack I can think of is: assume an encoding
with some platform dependent heuristic (locale, etc),
round trip the filename through utf-8 and back if it's
not bytewise</div>
<div>identical, base64 encode it and add a _readable
key?</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>As far as I'm aware (but someone please correct me if
z/OS or similar adds another wrinkle), there are exactly
three cases we need to deal with:</div>
<div><br>
</div>
<div>1) Platform paths are Unicode, encoded in UTF-8 in a
specific normalization form. The OS normalizes, possibly
case-folds, and rejects invalid encodings. (eg, Mac OS)</div>
<div>2) Platform paths are arbitrary sequences of 8-bit
values, with some reserved patterns (eg, no embedded nul
bytes), and no guaranteed intrinsic meaning or encoding.
There may be a platform convention for encoding, but it is
not enforced. (eg, Linux)</div>
<div>3) Platform paths are arbitrary sequences of 16-bit
values, with some reserved patterns (eg, no embedded nul
values, some reserved characters), and no guaranteed
intrinsic meaning or encoding. There may be a platform
convention for encoding, but it is not enforced. (eg,
Windows)</div>
<div><br>
</div>
<div>Case 1 is easy: paths are UTF-8, so we can represent them
as Unicode strings.</div>
<div>Case 3 is mostly easy: paths are by strong convention
UTF-16, so we can represent them as Unicode strings when
they are valid, and fall back to raw16 in the very rare
remaining cases.</div>
<div>Case 2 is trickier: while many are using UTF-8 as their
convention for file name encoding, it is not a
universally-adopted convention. Nonetheless, I propose we do
the same as in Case 3: if the file name happens to be a
valid UTF-8 encoding, assume that it is in fact UTF-8 and
present the path as a Unicode string. Otherwise, fall back
to raw8.</div>
<div><br>
</div>
<div>Does that work well enough in practice? (Remember that
these files are primarily for communication between tools,
not for humans to read.)</div>
</div>
</div>
</blockquote>
<p>I think the above works well for ASCII based platforms in most
cases. Though with some surprising or unfortunate results for
Shift-JIS and GB18030 users.<br>
</p>
<p>The point about this format being more for tools than humans is
well taken. Though I'm bringing up the topic in this thread, I'm
more concerned about it for module map file formats (which we
haven't discussed yet).<br>
</p>
<p>The wrinkle with z/OS is that, while it fits the case 2 model, no
file names will be (intended to be) UTF-8 encoded and every file
name would end up represented with the raw8 format. In theory, an
EBCDIC encoded file name can have code units that form a valid
UTF-8 sequence (which would result in a file name in the JSON file
that doesn't look at all like the intended file name), but since
all of the non-accented alphanumeric characters in EBCDIC have
values above 0x7F, the chance of forming a valid UTF-8 sequence is
low.</p>
<p>We could take the approach of emitting both a display name (with
implementation dependent QOI, essentially Ben's "_readable"
suggestion) and a raw code unit sequence. But doing that well
invites locales back into the picture again, so perhaps just
dealing with locales is the better path forward anyway.<br>
</p>
<p>Tom.<br>
</p>
</body>
</html>