[SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

Thiago Macieira thiago at macieira.org
Thu Aug 15 16:54:51 CEST 2019


On Wednesday, 14 August 2019 21:27:56 PDT Tom Honermann wrote:
> > I *want* UTF-8, because we have a lot of code that does like
> > 
> > 	QString("é")
> > 
> > And our rule is that source code is encoded UTF-8, therefore I expect this
> > constructor to be passed a 2-byte string containing 0xc3, 0xa9.
> 
> I don't understand why, just because the source file is UTF-8 encoded,
> that you would expect the string to be UTF-8 at run-time. I can
> understand *wanting* UTF-8, just not the implication that such desire is
> based on the source encoding.

Because it works on Unix and for people using any Windows where the compiler 
effectively makes a byte copy of the source to the literal. Think about it: 
using CP1251, 1252, etc., the compiler decodes the source into UTF-16 using 
something like MultiByteToWideChar, processes, then writes the strings into 
the .obj file using WideCharToMultiByte-equivalent. That means the UTF-8 
sequence above (bytes 0xc3 0xa9) do get written into the .obj file as 0xc3 
0xa9, which is correct UTF-8.

That has worked since time immemorial and continues to, today.

I admit this is a Western-centric view, since it's highly likely the sequence 
isn't valid Shift-JIS (is that what Windows uses in Japan?). In order to have 
cross-platform code, we'd have had to write QString("\xc3\xa9") and for our 
own sources, we did. But our limitation shouldn't be imposed on those who 
weren't under the same constraints.

And there was no alternative.

> > This is what
> > GCC, Clang and ICC (at least on Linux and macOS) will do. I need
> > interoperability of the source code with the cross-platform API.
> 
> gcc has -finput-charset and -fexec-charset that match the MSVC options,
> but is UTF-8 by default.  Clang only supports UTF-8.  I don't know about
> ICC.
> 
> Since C++11, I would have written the above as `QString(u8"é")` rather
> than requiring that the (presumed) execution encoding be set to UTF-8.

Because the codebases in question are much older than the ability to write 
u8"" in UTF-8 sources. Saying "C++11" here is a red herring, since we need 
compilers to support it and we need to be able to require those compilers. The 
compiler support happened with the /source-charset option, which was added in 
MSVC 2015 Update 2 (my commit log says we enabled in Qt in Jan 2017). And we 
didn't drop MSVC 2013 until Qt 5.11, released in March 2018.

So you see, we've had little more than a year on the ability to use u8"". But 
the requirement that sources be UTF-8 is much older than that. We made that 
change when we changed the QString constructor from the local 8 bit encoding 
to UTF-8 and that happened in mid 2012, before we could even require C++11. 

And be glad we didn't begin using u8"", since that would have broken with 
C++20 and char8_t. If we had had a large codebase using u8"", SG16 would have 
had to make a different choice regarding the hard break that the introduction 
of char8_t is. At least that change is post MS's adption of SG1's feature 
detection macros.

> > And if you did:
> > 	QFile f("é.txt");
> > 	f.open();
> > 
> > It would call CreateFile((wchar_t[])[0xe9, '.', 't', 'x', 't'}, ...),
> > which is the expected behaviour.
> 
> That looks to me like the expected behavior in either the case that
> QFile works on execution encoding (and /execution-charset is set or
> defaulted to Windows-1252) or if QFile requires UTF-8 (and
> /execution-charset:utf-8 is specified).

QFile takes a QString input, so it knows nothing about the execution encoding 
on Windows (on Unix, it does convert from UTF-16 back to the local 8-bit 
encoding, including proper NFD on macOS). My point is that the sequence above, 
through the implicit QString, opens the file that was expected. 

The difference between that and 

	FILE *f = _wfopen(L"é.txt", L"r");

is that the Qt-based one works whether you had the compiler's source-charset 
setting configured correctly to match the source's encoding or not, at least 
in locales where the compiler effectively byte-copied the source. And since 
you *couldn't* configure it to UTF-8 until January 2017, that means the source 
above simply couldn't have been written until very recently.

And remember that we had working code in all encodings since 2012 with

	QFile f("\xc3\xa9.txt");

Previously, since 2003, you had to write

	QFile f(QString::fromUtf8("\xc3\xa9.txt"));
or
	QFile f(QString::fromLatin1("\xe9.txt"));

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products





More information about the Unicode mailing list