[SG16-Unicode] P1689: Encoding of filenames for interchange

Thiago Macieira thiago at macieira.org
Sat Sep 7 00:03:06 CEST 2019


On Friday, 6 September 2019 14:09:44 PDT Tony V E wrote:
> > The encoding is only needed when converting raw bytes to text. Since
> > there's
> > no conversion, the raw bytes are passed through unmodified from payload to
> > filesystem API and from filesystem API to the payload.
> 
> If I know which API it was from, and have it available to me.  And the
> filesystem encoding hasn't changed since then.  Niall gives me the
> impression that can change. (Or is that only the display encoding that can
> change?)

The *native* API that you have access to. On Unix systems, that's the POSIX 
API - open(), opendir(), readdir(), etc. On Windows, that's the Win32 API 
(CreateFileW, FindNextW, etc.). I don't know if Windows kernel API is 
relevant.

The filesystem encoding never changes, since the bytes-on-disk that the FS 
used to store the name don't. What changes is how you interpret those bytes. 
And unfortunately, on Windows, the POSIX and C library API are emulation, 
which indeed can change. That's why I am saying that Windows applications must 
not use the C and C++ standard API. 

std::filesystem muddies the waters a little bit because it can call the native 
API on Windows and bypass the emulation layer. But having never used it (at 
all, ever), I simply can't offer an opinion on whether it can be used or how 
it can be safely used. Until someone provides authoritative explanation, the 
ISO C++ paper will have to say "don't use the ISO C++ API".

> I know you listed all the rules for many scenarios (on linux do..., on MS
> do...) but it seems a bit precarious to me.  What happens when a new FS API
> comes around, or some other OS, EBCIDIC, etc?

Fair question.

> How portable do we want/need this interchange files to be?

We need it to be portable to other applications running on the same OS and we 
need a locale-independent method of transform from the payload format to the 
FS API. On Unix, that's the identity transform. For Windows, it's CESU-8 
encoding of the 16-bit wchar_t string.

If you want a concession, here's one:

If the filename you obtained from the FS API was valid UTF of the width in 
question (UTF-8 on Unix, UTF-16 on Windows), then store it as a text string. 
Otherwise, store as a byte array. Note how this only affects the producer. The 
consumer is still doing exactly what I outlined above: pass-through on Unix 
and CESU-8 decoding on Windows.

I don't recommend this because the vast majority of file names *will* fall 
into this concession, meaning that 99%+ of the SG15 payload files created will 
use text strings. That means few tools will ever write the code for and test 
the corner cases. We get the #pragma once problem: if usually doesn't fail, 
but when it does, it's an unexpected failure, with little context, in a single 
person's machine who wasn't the one writing the code that failed.

PS: the CBOR encoding difference between a text string and the byte array 
containing the UTF-8 encoding of that string is a single bit.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products





More information about the Unicode mailing list