[SG16-Unicode] It???s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

Thiago Macieira thiago at macieira.org
Wed May 1 00:11:57 CEST 2019


On Tuesday, 30 April 2019 08:53:35 PDT Tom Honermann wrote:
> > This means filenames on VFAT and NTFS *do* have an encoding. You cannot
> > use
> > arbitrary binary file names since those wouldn't convert to UTF-16 and
> > couldn't be saved.
> 
> This is not quite correct.  Windows, at least, does permit creating
> files with names that are invalid UTF-16 as mentioned above.  This
> allows arbitrary binary file names, just with 16-bit code units.

Indeed, but we were arguing about the Unix API, especially that in the Linux 
implementation, where you have no access to 16-bit API. So you simply can't 
safe a file called "\xff" on a VFAT filesystem if it was mounted with the 
default (iocharset=utf-8).

> > Quite frankly, you shouldn't choose any iocharset=
> > different from UTF-8, since there could be file names on disk that
> > wouldn't
> > convert and couldn't be represented.
> 
> Arguably, WTF-8 [1] is a better choice as it can convert and represent
> all VFAT and NTFS file names (though I wouldn't mind if Microsoft were
> to start requiring well-formed UTF-16 file names).

And it might be like that, so the 8-bit API presented to the VFS layer and 
userspace can represent all filenames found on disk, so long as you choose 
"iocharset=utf-8". Choosing anything else may mean some files do not get 
listed, since they can't be represented in the first place.

Conclusion: you really need UTF-8 these days.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products





More information about the Unicode mailing list