[SG16-Unicode] P1689: Encoding of filenames for interchange

Thiago Macieira thiago at macieira.org
Sat Sep 7 05:28:50 CEST 2019


On Friday, 6 September 2019 19:17:00 PDT Lyberta wrote:
> Thiago Macieira:
> > [*] The only remaining issue is the perfectly valid case of setting
> > LC_ALL=C in the environment for reading other tools' output. I would
> > recommend just ignoring that.
> 
> I think if the machine-readable output depends on locale, the author of
> the program seriously messed up.

Oh, I agree with you. The problem is that the standard C library (as extended 
by POSIX) does not provide the API to make that happen *and* support 
internationalisation. And that's assuming the tool even have a "machine 
readable" format in the first place. In the Unix tradition, you just scrape 
the output of tools.

$ du -sh
1,8G    .

Note the comma instead of dot?

$ find -ls
  4719721      4 drwxr-xr-x   3  tjmaciei users        4096 set  6 19:39 .
  4722472      4 -rw-r--r--   1  tjmaciei users        2927 set  6 19:39 ./
generate.pl
  4719722      4 drwxr-xr-x   2  tjmaciei users        4096 jun 18 17:18 ./
packages
  4719723   2228 -rw-r--r--   1  tjmaciei users     2280402 fev  8  2019 ./
packages/freedesktop.org.xml
  4742041    236 -rw-r--r--   1  tjmaciei users      239063 fev  8  2019 ./
packages/freedesktop.org.xml.zst
  4721630      4 -rw-r--r--   1  tjmaciei users        2391 set  6 19:39 ./
generate.bat
  4722858      4 -rw-r--r--   1  tjmaciei users        1739 set  6 19:39 ./
hexdump.ps1

Note the month names in Portuguese (in a date format that is neither valid 
Portuguese nor English, because no one in their sane mind would put day 
between month and year).

> Corentin:
> > Supporting non displayable characters in build tools has no value. For
> > anyone.  "Someone might do that" is the reason we don't have nice things.
> 
> 100% agree. If the user has non-UTF paths, the job of the build system
> is to show message "Mate, you shot yourself in the foot. Fix your file
> system." It's that simple.

This is the philosophy that Qt has adopted too: file names that cannot be 
decoded by the locale codec are filesystem corruption. The build tools do not 
need to support them.

C++ might have, but Niall has that well in hand.

> > int main(int argc, char *argv[])
> 
> So,
> 
> int main(std::span<std::unicode::text_view> args)
> 
> then?

I worked with Erich Keane to come up with a solution for this. I think we even 
had a discussion in one of the mailing lists.

But the input is not Unicode, it's file paths. On Unix, it is possible to pass 
binary input in the command-line. With some effort, you can even pass NULs to 
specially crafted receiver applications. The std::filesystem API appears to 
have a way to retrieve the native raw format, which some application may need.

Qt doesn't care about those. QCoreApplication::arguments() is a list of 
QStrings, decoded using QFile::decodeName. Binary data will be silently 
corrupted:

 $ strace uic $'\xe9.ui' |& grep -aF .ui
execve("/home/tjmaciei/bin/uic", ["uic", "\351.ui"], 0x7fffa282edc8 /* 118 
vars */) = 0
execve("/home/tjmaciei/obj/qt/qt5/qtbase/bin/uic", ["/home/tjmaciei/obj/qt/
qt5/qtbase"..., "\351.ui"], 0x7ffc30f6ad60 /* 118 vars */) = 0
openat(AT_FDCWD, "\357\277\275.ui", O_RDONLY|O_CLOEXEC) = -1 ENOENT (Arquivo 
ou diretório inexistente)
write(2, "File '\357\277\275.ui' is not valid\n", 27File '�.ui' is not valid

Oh, we got errors in Portuguese. Let me set LC_ALL=C:

$ LC_ALL=C strace uic $'\xe9.ui' |& grep -aF .ui  
execve("/home/tjmaciei/bin/uic", ["uic", "\351.ui"], 0x7ffd0cd9f448 /* 119 
vars */) = 0
execve("/home/tjmaciei/obj/qt/qt5/qtbase/bin/uic", ["/home/tjmaciei/obj/qt/
qt5/qtbase"..., "\351.ui"], 0x7ffe3a738e00 /* 119 vars */) = 0
openat(AT_FDCWD, "?.ui", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or 
directory)
write(2, "File '?.ui' is not valid\n", 25File '?.ui' is not valid

I'm right now tempted to submit a patch that makes Qt assume that locale "C" 
is actually "C.UTF-8".

And since we're on the subject of strace, see how it is not parseable without 
LC_ALL=C:

$ strace -c true
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 24,09    0,000191          19        10         8 openat
 22,19    0,000176          25         7           mmap
 13,75    0,000109          13         8         7 stat
 11,85    0,000094          23         4           mprotect
  6,56    0,000052          52         1           munmap
  6,18    0,000049          49         1         1 access
  5,30    0,000042          42         1           brk
  2,65    0,000021          10         2           fstat
  2,52    0,000020          10         2           close
  1,89    0,000015          15         1           execve
  1,64    0,000013          13         1           read
  1,39    0,000011          11         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0,000793                    39        16 total

Note the commas for the percentages and times (except for the 100.00!).

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products





More information about the Unicode mailing list