[SG16-Unicode] P1689: Encoding of filenames for interchange

Brad King brad.king at kitware.com
Fri Sep 6 15:38:45 CEST 2019


On 9/6/19 8:46 AM, Niall Douglas wrote:
> On 06/09/2019 05:18, Thiago Macieira wrote:
>> On Thursday, 5 September 2019 03:51:41 PDT Niall Douglas wrote:
>>> To solve the OP's problem, why doesn't P1689 simply store BOTH the
>>> UTF8-attempt and native filesystem encoding raw bytes edition of pathnames?
>>
>> That's what the paper currently proposes. My argument is that you should 
>> choose one only.
> 
> My reading of their paper was that they want to encode non-UTF8
> sequences into UTF8 paths in a JSON file. I don't think they should take
> that path, because it loses too much information.

We'll have to clarify the wording.  We propose two allowed representations:

- An array of integers tagged with the corresponding size of values in memory.
  This can represent an arbitrary binary sequence and is the general form.

  This variant also allows a "readable-name" field intended only for human
  consumption that is not meant for use in accessing the filesystem.
  It is optional and superfluous for tooling but useful for debugging.

- UTF-8.  This is allowed *only if a lossless round trip* is possible
  between the filesystem's native binary sequence and UTF-8.  E.g. on
  Windows we should not have to require the full general format to represent
  a simple path like "a.cxx" just because the filesystem APIs use wide chars.

  This is intended for the common use case of ASCII-only file paths to make
  the format simpler and more human readable (e.g. for debugging).  We then
  generalize beyond ASCII to allow any lossless UTF-8 round-trip (implying
  that the locale does not change).

-Brad


More information about the Unicode mailing list