[SG16-Unicode] P1689: Encoding of filenames for interchange

Thiago Macieira thiago at macieira.org
Thu Sep 5 06:12:53 CEST 2019


Hello Ben, Brad

SG16 was asked to comment on P1689 and how file names can be encoded in a file 
format for exchange of information between different tools in a buildsystem. 
This is not SG16's official reply, it is my own opinion that other members of 
SG16 asked to write after a discussion in our Slack channel.

This is outside of the current scope of the C++ standard, since the standard 
only admits that file names are a sequence of narrow characters not containing 
a NUL, but makes absolutely no determination of what those characters mean. If 
that were sufficient, you wouldn't have asked for an opinion, since all you'd 
need to do would be to encode those bytes somehow. We know that the standard 
is deficient in this area. Notably, the fact that file names on Windows are 
actually stored on the file system and accessed in the low-level API using 16-
bit wchar_t, instead of the 8-bit char of that platform.

Before I begin, let me say that this paper came as a surprise to those of us 
who are not familiar with SG15's workings. The paper describes a format, but 
does not explain what that format is for. Please revise this paper and try to 
answer some of these questions:

* what tools produce the file?
* what tools consume the file?
* how long is the file supposed to last? Is the file supposed to be committed 
   to version control?
* how far is the file supposed to be spread? That is, is networking in scope?
* what problem does this solve? Is it a new problem?
* what happens if we don't add this file?
* what other alternatives were considered? Both as solution and as file 
formats.
* what happens if the multiple tools do not agree on the view of the 
filesystem (different root, different mountpoints, etc.)? How do you deal with 
this?

== Assumptions ==

Since the paper does not answer those questions above, I am making the 
following assumptions:

1) the file is an artifact of the build that is not meant to be committed to 
version control. Notably, this means that two builds of the same software are 
not supposed to share this file.

2) networking is not in scope. Distributed builds are considered an extension 
of the local system, so they don't count as networking. Distributed build 
tools need to emulate the file system of the originator.

3) different views of the filesystem are out of scope.

== File paths for interchange ==

I propose three options and I will let you choose which one you want. This 
section is only about the file paths ("payload") and is independent of the 
format of the file ("transport"), but I will make references to storing such 
payload in JSON.

The options are:
 - Option 1: file names are Unicode text
 - Option 2: file names are binary
    * 2a: file names are bytes only
    * 2b: file names can be bytes or words

=== Option 1: file names are Unicode text ===
a.k.a. "What could go wrong™?" option

File names and paths are a valid sequence of Unicode codepoints. This is true 
because a file is very often displayed to the user in a shell, command-prompt, 
graphical or text interface, etc. When that happens, file names *are* text. 
This is option is what people *expect* to happen and is therefore the natural 
solution.

In JSON, this means file names are transmitted as Strings (RFC 8259 section 
7), encoded in UTF-8. In that scenario, you'd open a file name found in the 
payload the following ways:

a) on Windows, use c8strtowcs or c8srtoc16s and pass the result to _wfopen() 
or CreateFileW

b) on other systems, use SG16's proposed c8srtombs ("char8_t string to 
multibyte string") and pass the result to open() or fopen()

c) with Qt, if using QJsonDocument, the pass the string from 
QJsonValue::toString() to QFile.

Consequences:
1) easiest implementation. Codecs between UTF-8, UTF-16 and the narrow- and 
wide-character strings are everywhere.

2) on modern Unix systems, the locale codec is UTF-8, which means the 
implementation is even simpler. Tools can be designed to only support this 
environment and therefore perform a pass-through from UTF-8 payload directly 
to the filesystem and vice-versa.

3) only file names that can be decoded into the Unicode string are 
permissible. Anything that on Unix mbsrtoc8s fails to decode is 
unrepresentable and therefore should be considered filesystem corruption. 
Similar for WIndows: file names with improperly-paired surrogate code units 
are unrepresentable and therefore filesystem corruption.

4) changes in the encoding for the narrow- and/or wide-character sets are a 
failure mode and not supported. Notably, changing LANG or LC_ALL on Unix 
systems. This includes setting LC_ALL to "C", something a lot of tools do when 
they parse output from other tools, to ensure the output format they're 
parsing is stable.

=== Option 2a: file names are bytes ===
a.k.a. "Windows developers feel the pain" option

For systems where the filesystem API is implemented using narrow characters 
(that is, bytes), the payload is the exact array of bytes that the API 
provided and accepts. For systems where the API is not using narrow 
characters, a lossless transformation to bytes is required. Transporting those 
bytes in JSON is done by either using Base64 in a JSON String or by using an 
array of JSON numbers.

The only system I know where the native filesystem API is not byte-based is 
Windows. So for Windows, the file names are transformed using CESU-8 / WTF-8, 
*not* UTF-8. That is, any surrogate code units found in the file name are 
stored as the 3-byte UTF-8 encoding of each, not the 4-byte encoding of the 
UTF-32 code point they're supposed to represent.

This solution is lossless and can represent all possible file names.

To open such a file, you'd do:

a) on Windows, convert the byte array from CESU-8 / WTF-8 to WTF-16 
("potentially ill-formed UTF-16"), then pass the file name to _wfopen() or 
CreateFile()

b) on other systems, pass the byte array directly to open() or fopen()

c) with Qt, convert the byte array from CESU-8 / WTF-8 to WTF-16 and pass the 
resulting QString to QFile

Consequences:
1) easiest for Unix, since it's pass-through. However, for Windows and other 
UTF-16-using APIs, there's a non-trivial hurdle. The implementation for CESU-8 
encoding and decoding is *not* provided in the standard library and is not 
usually found in Unicode libraries. In fact, using compliant UTF-8 encoders 
and decoders is *not* permitted in this solution.

=== Option 2b: file names are bytes or words ===
a.k.a. "spread the pain" option

This is an extension of option 2a. It admits that file names on Windows are 
actually composed of 16-bit units and permits those as the payload. So the 
file names are stored in the payload with a tag indicating whether the 
contents are 8-bit or 16-bit.

Native Windows tools therefore can perform pass-through, if the payload is 
stored 16-bit. The problem is that both 8- and 16-bit are allowed, which means 
all tools need to deal with both possibilities.

I) if the payload is stored as 8-bit, do as option 2a
II) if the payload is stored as 16-bit, then:
a) on Windows and with Qt, pass-through

b) on other systems, assume it's WTF-16 and encode as CESU-8, then pass to 
open() or fopen()

The rationale for Unix systems also dealing with 16-bit units is because of 
Cygwin and WSL. See analysis below.

== Windows Analysis ==
I can think of four relevant build environments for Windows, which form two 
distinct groups today, plus a theoretical third that currently does not exist:

1) native applications built with MSVC (ucrt.dll); _WIN32 is defined
2) native applications built with MinGW (crtdll.dll); _WIN32 is defined
3) Unix applications built with Cygwin / MSYS2; _WIN32 is not defined
4) Unix applications built for Linux, run under WSL; _WIN32 is not defined

It is conceivable that these four types of applications are all mixed together 
in a single build, so they could be sharing the same data that P1689 is meant 
to share. And CMake is the prime example of this: it can be any of the four, 
driving a make and a compiler that is any of the four too.

The three groups are:

a) Wide API available and narrow is ANSI (1 and 2 above)
b) Wide API is available and narrow is UTF-8 (theoretical)
c) no Wide API, narrow is UTF-8 (3 and 4)

Group c only has open() and fopen() available. Fortunately, the Cygwin/MSYS2 
runtime take the narrow character input and converts to wchar_t using UTF-8 (I 
don't know whether it's CESU-8), so those applications just work. For them, 
option 2a is a pass-through; option 2b requires the UTF-16 to UTF-8 codec, 
then pass though; and option 1 admits the pass-through solution with an 
#ifdef.

Group b has both APIs available. For this group, pass through is available in 
both options 2a and 2b and can take the shortcut on option 1.

For both groups b and c, Unix applications can be rebuilt on Windows with 
little to no porting.

Group a MUST NOT use _open() and fopen(). No exceptions. This means Unix 
applications must be ported to Windows in order to operate properly if 
compiled with those compilers, so that they will use _wfopen() or 
CreateFileW(). For those, pass-through is only possible under option 2b, if 
the payload is 16-bit.

== Transport ==
P1689 suggests using JSON. I'm comparing that in the context of the three 
options with a binary format (CBOR).

One thing SG16 is completely in agreement of is that if you go with JSON, you 
must obey RFC 8259: there must not be a BOM and the file must be encoded in 
UTF-8.

Option 1) Since file names are text, JSON is actually well-placed and the file 
names are stored as JSON Strings. This is easy to debug in any UTF-8 capable 
text editor, though of course one that understands JSON is recommended. Most 
JSON APIs provide strings directly in UTF-8, so that content can be passed to 
the UTF-8 to locale encoder / decoder. CBOR also stores text strings as UTF-8, 
so the same ease of encoding and decoding to the locale is there.

Option 2a) File names are binary data, so they MOST NOT be stored as-is in 
JSON strings. I recommend either base64 in a string or an array of numbers. 
For this, a binary solution is better: CBOR has a type called "byte string", 
which can store binary data.

Option 2b) is an extension of 2a. You store the payload the same way, except 
that you must also store a tag indicating whether the data was 8 or 16-bit. If 
using Base64, it must also indicate whether it's big-endian or little (this 
problem does not exist for an array of numbers). The same constraints apply to 
CBOR and I do not recommend storing as an array of numbers as that will double 
the space necessary to store compared to a byte string and will be sloer to 
encode and decode.

This is it. I know this is a long email, but hopefully it helps you come to 
some conclusions.


-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products





More information about the Unicode mailing list