[SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

Thiago Macieira thiago at macieira.org
Wed Aug 14 19:58:42 CEST 2019


On Tuesday, 13 August 2019 15:29:29 PDT keld at keldix.com wrote:
> > I guess you never used windows?
> 
> I have not done much programming on window systems. I sometimes lived on
> administering them. What are the problems wrt to this?

Two problems:

1) there are actually three different, active character sets: the DOS 
codepage, the 8-bit "ANSI" codepage, and the 16-bit wide character codepage. 
For everyone except Niall, the 16-bit codepage is always UTF-16 (he'll tell 
you it's actually binary 16-bit, with no surrogate interpretation).

The DOS and 8-bit "ANSI" codepages are different 8-bit encodings. But I think 
we can leave the DOS codepage in the past, since it's much less relevant these 
days.

That leaves the problem that the 8-bit encoding is *not* UTF-8, for the vast 
majority of people. I read somewhere that Vietnamese Windows uses UTF-8, but 
for almost everyone else it's usually a Windows-specific encoding. The one 
used by English Windows is CP1252, which mostly matches ISO-8859-1, but 
encodes different things in the 0x80-0x9F range.

The big problem with this is that the entire C API, like fopen() and printf(), 
and the POSIX-imported API like _open(), is using the 8-bit "ANSI" encoding. 
Since C++ builds on those, we are similarly affected. This also means that 
fopen() cannot all files in the system, main()'s argv does not receive the 
full command-line, etc.

2) MSVC has the "traditional" interpretation of the source and execution 
charsets. Unlike GCC and Clang, it will not do the pass-through of source 
bytes into narrow character string literals. And since wide-character literals 
are fairly common due to the 16-bit W API, the chances of mojibake are 
actually considerable.

Worse, because the entire source code is read using the system's 8-bit ANSI 
encoding, you can produce uncompileable sources with *comments*. For example, 
if Corentin's friend Bjørn had in his source:

	// Copyright (C) 2019 Bjørn Bjørnsen

Then his friend Yamada Tarō with a Japanese Windows might not be able to 
compile the file because the ø sequence (whether UTF-8 or Latin1 or Latin9) is 
not valid. I'm not making this up. We had this problem in Qt because of a 
copyright line (the ä in "Klarälvdalens Datakonsult AB", and ä is not "ae" in 
Swedish). Note how I did not use ©.

This means that using MSVC with the /utf-8 option is the only sane 
alternative, but it's not the default.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products





More information about the Unicode mailing list