[SG16-Unicode] P1689: Encoding of filenames for interchange

Thiago Macieira thiago at macieira.org
Sat Sep 7 00:23:12 CEST 2019


On Friday, 6 September 2019 15:01:39 PDT Tony V E wrote:
> > You're describing case (a), which again implies resolving the problem by
> > declaring the problem cases to be out of scope.
> 
> Well, I was imagining that the IDE kept or converted it in whatever format
> it wanted, but it read it in some native format, and has enough info about
> that native format to convert to UTF8.  *When*it actually does the
> conversion (ie when reading, or later when writing the SG15 file) doesn't
> matter (I think).

That's two philosophies of what a file name is, which matches the two options 
of my OP:

1) file names are text, so I'll store them in my Unicode-capable class
2) file names are binary, so I'll store them in my byte array

The IDEs and text editors divide themselves into those two categories. You've 
assumed that only case 1 existed.

The failure modes differ too. In case 2, the IDEs will fail to display the 
file name in graphical environments, since all the text shaping frameworks 
consume Unicode input. But in this case, the IDE can display a placeholder 
that indicates that the file name can't be shown, but entries in the program 
memory still exist.

In case 1, you can't even represent said file. The failure happened when 
listing the directory or reading from the socket, pipe or file that contained 
the encoded form.

> > Why would you save it in UTF-8, knowing that the other tool that is going
> > to
> > read could be under a different assumption of what codec to use?
> > 
> > Why not instead save the same bag of bits that you received from the OS,
> > which
> > you know the OS can use to refer back to the same file? The environment
> > has
> > not changed during the run of the current application, so it can perform
> > back
> > and forth translations from the bag of bits to the internal
> > representation,
> > losslessly.
> 
> How do I know the environment hasn't changed when the other program (the
> reading one) runs?  The SG15 was written by one program, then _later_ read
> by another.

That's not what I meant. I meant that the environment hasn't changed within 
the same run of the process (at least, usually). I meant that if the 
conversion from "bag of bits" to Unicode text worked once, I can convert back 
and forth between them without loss.

> Are these two programs even on the same OS, or do they just have access to
> the same files?

This is a case of declaring that there is no problem: we excluded networking 
from the scope. We probably exclude removable storage media too. If nothing 
else, the mount points or drive letters may change.

> > No. This is the failure mode: if the file name was stored in UTF-8 and I
> > don't
> > know what the source used to decode the bag of bits to Unicode, I can't be
> > sure to reproduce the same bag of bits.
> 
> If I have the filename in unicode, and the original filename was
> unicode-able, do I need the same bag of bits, or does every OS have an API
> for "find this file, here's the unicode name".

You need the same bag of bits. There's no OS that has "find this file by the 
Unicode name" (excepting the case where the bag of bits and the Unicode name 
are one and the same, of course).

> > At which step(s) can things go wrong?
> > 
> > All of them, starting from the delineation of the problem space.
> 
> Yes, I'm wondering if we can make the problem space smaller, since
> developers and tools have lots of control over the filenames they use.

Yes, we can. That's Option 1: there is almost[*] no problem if you set your 
system up correctly so any failures are filesystem corruption and/or incorrect 
environment set up.

Qt has been doing that for 20 years, since Qt 2.0 introduced the Unicode-
capable QString.

[*] The only remaining issue is the perfectly valid case of setting LC_ALL=C 
in the environment for reading other tools' output. I would recommend just 
ignoring that.
-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products





More information about the Unicode mailing list