[SG16-Unicode] P1689: Encoding of filenames for interchange

Thiago Macieira thiago at macieira.org
Fri Sep 6 06:18:38 CEST 2019


On Thursday, 5 September 2019 03:51:41 PDT Niall Douglas wrote:
> To solve the OP's problem, why doesn't P1689 simply store BOTH the
> UTF8-attempt and native filesystem encoding raw bytes edition of pathnames?

That's what the paper currently proposes. My argument is that you should 
choose one only.

> The UTF8-attempt edition is where one takes the raw bytes in the native
> filesystem encoding, and converts it to UTF-8. Note that even on POSIX,
> filesystem paths are not necessarily in valid UTF-8, and ought to be
> treated as raw bytes if you want to be able to reopen the original file
> after encoding into JSON.
> 
> If the raw bytes edition of pathnames in the JSON file is present, it is
> used first during lookup. If lookup with the raw byte edition fails, or
> if it is not present in the JSON file, the UTF-8 edition is converted to
> the native filesystem encoding, and that is used.

Sorry Niall, I don't think this will work.

If the raw bytes edition is optional, then it means a valid payload can 
include only the UTF-8 representation in the JSON String. But that opens the 
possibility that two tools will disagree as to what file it represents. For 
example:
{
    "file": "/tmp/é.c"
}

$ ls -1ib *.c                        
5303210 \351.c
5303209 é.c

$ LC_ALL=en_US.ISO-8859-1 ls -1ib *.c | iconv -f latin1
5303209 é.c
5303210 é.c

Which of the two inodes is the JSON file referring to?

Using the UTF-8 encoded text is Option 1 in my proposal. I don't have a 
problem with it, but if adopted, then implementers need to understand the 
problems shown above in the ls outputs will happen (note how there's a second 
issue).

If the raw form is mandatory, then the text form is superfluous. That's both 
options 2, differing only on what raw forms are required.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products





More information about the Unicode mailing list