[SG16-Unicode] P1689: Encoding of filenames for interchange

Thiago Macieira thiago at macieira.org
Sat Sep 7 03:07:48 CEST 2019


On Friday, 6 September 2019 16:33:03 PDT Niall Douglas wrote:
> >     I'm interpreting this in two cases:
> >      1) on Unix, the bag of 8-bit bytes obtained from the FS API can be
> >     decoded
> >         using UTF-8
> >      2) on Windows, the bag of 16-bit words can be decoded using UTF-16,
> >        which means I can encode it to 8-bit with UTF-8
> 
> You're excluding ANSI on Windows. 

Yes, intentionally.

> I keep bringing it up, because:
> 
> int main(int argc, char *argv[])
> {
>   std::filesystem::path(argv[1]);
>   ...
> 
> ... involves a conversion of the system narrow encoding, which is locale
> dependent, to the filesystem native encoding, which on Windows is
> currently incorrectly defined by the standard to only ever be UTF-16
> wchar_t. This is still the case even when _UNICODE is defined. And there
> is a ton of build tooling out there which works with char arrays,
> including on Windows.

The mistake was to use argv. If you're on Windows and you want to deal with 
proper file names on the command-line, call GetCommandLineW and get the actual 
command-line.

> It's all well and good for Thiago etc to say "you must use wmain()". I
> think P1689 must be a taker when it comes to persuading existing build
> tooling to use their interchange format. If they're using char arrays,
> if they're using main() not wmain(), you need to support that.

Indeed, the proposal for Option 2 is specifically that if _WIN32 is defined, 
you must use the W API. The ANSI API is banned, including argv and fopen.

Interestingly, Cygwin/MSYS2 and WSL have shown that it's possible to fix this 
on Windows. It requires no kernel modification, just a different C runtime. I 
don't claim it's easy, only that there is a solution. (it needs to be coupled 
with deprecating and banning the ANSI API)

> Otherwise they're either going to corrupt your JSON on non-US locales,
> which upsets developers. Or they're going to extend your JSON to have
> been correct in the first place. Or they're going to use their own
> interchange format, and say in the docs "don't use the standard JSON
> format, it's broken".

If you corrupt the JSON file, then your JSON encoder is broken in the first 
place or you misused the API.

void json_add_string(JsonWriter *, const char *utf8String);

If you pass non-UTF-8 there, you made a mistake. It's a bug in your code. Use 
mbsrntoc8s().

> I have not currently decided what LLFIO will do on this. I really hate
> the ANSI APIs. But Billy O' Neal gave me a very convincing motivating
> use case:
> 
> int main(int argc, char *argv[])
> {
>   auto fh = file({}, argv[1]);
> 
> If LLFIO calls the ANSI API here, this "just works" even on Shift-JIS
> and all the other weird legacy encodings Windows supports.
> 
> I still haven't brought myself to implement the support, though.

Convert from ANSI on creation.

If that makes it impossible to have an allocation-free class, then an 
allocation-free class is impossible.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products





More information about the Unicode mailing list