<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Sep 7, 2019, 12:23 AM Thiago Macieira <<a href="mailto:thiago@macieira.org">thiago@macieira.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Friday, 6 September 2019 15:01:39 PDT Tony V E wrote:<br>
> > You're describing case (a), which again implies resolving the problem by<br>
> > declaring the problem cases to be out of scope.<br>
> <br>
> Well, I was imagining that the IDE kept or converted it in whatever format<br>
> it wanted, but it read it in some native format, and has enough info about<br>
> that native format to convert to UTF8. *When*it actually does the<br>
> conversion (ie when reading, or later when writing the SG15 file) doesn't<br>
> matter (I think).<br>
<br>
That's two philosophies of what a file name is, which matches the two options <br>
of my OP:<br>
<br>
1) file names are text, so I'll store them in my Unicode-capable class<br>
2) file names are binary, so I'll store them in my byte array<br>
<br>
The IDEs and text editors divide themselves into those two categories. You've <br>
assumed that only case 1 existed.<br>
<br>
The failure modes differ too. In case 2, the IDEs will fail to display the <br>
file name in graphical environments, since all the text shaping frameworks <br>
consume Unicode input. But in this case, the IDE can display a placeholder <br>
that indicates that the file name can't be shown, but entries in the program <br>
memory still exist.<br>
<br>
In case 1, you can't even represent said file. The failure happened when <br>
listing the directory or reading from the socket, pipe or file that contained <br>
the encoded form.<br>
<br>
> > Why would you save it in UTF-8, knowing that the other tool that is going<br>
> > to<br>
> > read could be under a different assumption of what codec to use?<br>
> > <br>
> > Why not instead save the same bag of bits that you received from the OS,<br>
> > which<br>
> > you know the OS can use to refer back to the same file? The environment<br>
> > has<br>
> > not changed during the run of the current application, so it can perform<br>
> > back<br>
> > and forth translations from the bag of bits to the internal<br>
> > representation,<br>
> > losslessly.<br>
> <br>
> How do I know the environment hasn't changed when the other program (the<br>
> reading one) runs? The SG15 was written by one program, then _later_ read<br>
> by another.<br>
<br>
That's not what I meant. I meant that the environment hasn't changed within <br>
the same run of the process (at least, usually). I meant that if the <br>
conversion from "bag of bits" to Unicode text worked once, I can convert back <br>
and forth between them without loss.<br>
<br>
> Are these two programs even on the same OS, or do they just have access to<br>
> the same files?<br>
<br>
This is a case of declaring that there is no problem: we excluded networking <br>
from the scope. We probably exclude removable storage media too. If nothing <br>
else, the mount points or drive letters may change.<br>
<br>
> > No. This is the failure mode: if the file name was stored in UTF-8 and I<br>
> > don't<br>
> > know what the source used to decode the bag of bits to Unicode, I can't be<br>
> > sure to reproduce the same bag of bits.<br>
> <br>
> If I have the filename in unicode, and the original filename was<br>
> unicode-able, do I need the same bag of bits, or does every OS have an API<br>
> for "find this file, here's the unicode name".<br>
<br>
You need the same bag of bits. There's no OS that has "find this file by the <br>
Unicode name" (excepting the case where the bag of bits and the Unicode name <br>
are one and the same, of course).<br>
<br>
> > At which step(s) can things go wrong?<br>
> > <br>
> > All of them, starting from the delineation of the problem space.<br>
> <br>
> Yes, I'm wondering if we can make the problem space smaller, since<br>
> developers and tools have lots of control over the filenames they use.<br>
<br>
Yes, we can. That's Option 1: there is almost[*] no problem if you set your <br>
system up correctly so any failures are filesystem corruption and/or incorrect <br>
environment set up.<br>
<br>
Qt has been doing that for 20 years, since Qt 2.0 introduced the Unicode-<br>
capable QString.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"></blockquote></div></div><div dir="auto">And it works great !</div><div dir="auto"><br></div><div dir="auto">I want to reiterate that while there is value in the C++ standard library to be as wide and generic as possible, the opposite is true for the tooling ecosystem.</div><div dir="auto"><br></div><div dir="auto">The only way to have a reliable ecosystem is to find a simple, sane, easy to support, useful set of features.</div><div dir="auto"><br></div><div dir="auto">Supporting non displayable characters in build tools has no value. For anyone. "Someone might do that" is the reason we don't have nice things.</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
[*] The only remaining issue is the perfectly valid case of setting LC_ALL=C <br>
in the environment for reading other tools' output. I would recommend just <br>
ignoring that.<br>
-- <br>
Thiago Macieira - thiago (AT) <a href="http://macieira.info" rel="noreferrer noreferrer" target="_blank">macieira.info</a> - thiago (AT) <a href="http://kde.org" rel="noreferrer noreferrer" target="_blank">kde.org</a><br>
Software Architect - Intel System Software Products<br>
<br>
<br>
<br>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" target="_blank" rel="noreferrer">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote></div></div></div>