[SG16-Unicode] P1689: Encoding of filenames for interchange

Tony V E tvaneerd at gmail.com
Fri Sep 6 22:09:23 CEST 2019


On Fri, Sep 6, 2019 at 3:52 PM Thiago Macieira <thiago at macieira.org> wrote:

> On Friday, 6 September 2019 10:49:56 PDT Tony V E wrote:
> > First of all
> >
> > It seems Option 2b is a superset of Option 2a, and is just more work for
> > everyone, with no work saved. ie Windows still needs to support
> > single-bytes, but can use also use dual-bytes.
> > Are we encouraging Windows tools to *only* use dual-bytes and not support
> > single-bytes (ie not have full support)?  What's the benefit of 2b?
> > Can we narrow our choices by agreeing 2b isn't worthwhile?
>
> Indeed, it's a superset that spreads the pain by making everyone have to
> implement conversions, for the benefit of the case where a _WIN32 tool
> produces a file that is read by another _WIN32 tool: then it can do pass-
> through.
>
> > Now, overall, if I understand the discussion correctly:
> >
> > - if you encode the raw bytes (narrow or wide), you should add the
> encoding
> > as well (ie "EBCIDIC", etc).
> > This implies every tool needs to support (and translate) every encoding,
> or
> > accept that we will have non-interoperable tools, platform specific
> tools.
> > Also, is the set of encodings finite, or can I add the "TONY" encoding?
>
> There's no need to indicate which encoding was used because the options 2
> encode the raw bytes that are used with the filesystem API. The data is an
> opaque bag of bits.
>

but it is only valid if you use those bits with the same API and encoding
that they came from (if you don't know the encoding).


> If you want to *display* that to the user, then converting to text is
> necessary. But all the tools that display file names have such
> functionality,
> since they already deal with file names obtained from the FS API.
>
> > - if you encode the raw bytes, there might still be cases not covered,
> > might need to fall back to UTF8.  It sounds like *no* answer will be
> > guaranteed to work.
>
> Which case could there be that the raw bytes fail but UTF-8 supports? I
> would
> think it's the other way around.
>

the case where the encoding changed.  Or the raw bytes are being used with
the wrong FS API.


> > Are there systems where filenames *that developers use* can't be found
> via
> > UTF8?
>
> The problem is what happens when the locale isn't UTF-8, which is common
> enough when LC_ALL=C was set in the environment.
>
>
And how common is that (besides you :-)


But I repeat what I said: I am fine with Option 1 ("file names are text"),
> knowing that there are failure modes. This has been the case for Qt for
> two
> decades. We call those "filesystem corruption" and tell our users to go
> fix
> with a system tool.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
>    Software Architect - Intel System Software Products
>
>
>
>

-- 
Be seeing you,
Tony
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190906/29659c94/attachment.html 


More information about the Unicode mailing list