[SG16-Unicode] [isocpp-core] Source file encoding

Tom Honermann tom at honermann.net
Wed Aug 14 15:37:35 CEST 2019


On 8/14/19 5:00 AM, Corentin wrote:
>
>
> On Wed, Aug 14, 2019, 4:17 AM Tom Honermann via Core 
> <core at lists.isocpp.org <mailto:core at lists.isocpp.org>> wrote:
>
>     Niall, this is again off topic for this thread.  But now that you put
>     this out there, I feel obligated to respond.  But please start a new
>     thread with a different set of mailing lists if you wish to continue
>     this any further; this is not a CWG issue.
>
>     On 8/13/19 12:03 PM, Niall Douglas via Liaison wrote:
>     > On 13/08/2019 15:27, Herring, Davis via Core wrote:
>     >>> Is it politically feasible for C++ 23 and C 2x to require
>     >>> implementations to default to interpreting source files as
>     either (i) 7
>     >>> bit ASCII or (ii) UTF-8? To be specific, char literals would
>     thus be
>     >>> either 7 bit ASCII or UTF-8.
>     >> We could specify the source file directly as a sequence of ISO
>     10646 abstract characters, or even as a sequence of UTF-8 code
>     units, but the implementation could choose to interpret the disk
>     file to contain KOI-7 N1 with some sort of escape sequences for
>     other characters.  You might say "That's not UTF-8 on disk!", to
>     which the implementation replies "That's how my operating system
>     natively stores UTF-8." and the standard replies "What's a disk?".
>     > I think that's an unproductive way of looking at the situation.
>     >
>     > I'd prefer to look at it this way:
>     >
>     >
>     > 1. How much existing code gets broken if when recompiled as C++
>     23, the
>     > default is now to assume UTF-8 input unless input is obviously
>     not that?
>     *All* code built on non-ASCII platforms, some amount of code
>     (primarily
>     in regions outside the US) that is currently built with the Microsoft
>     compiler and encoded according to the Windows Active Code Page for
>     that
>     region, and source code encoded in Shift-JIS or GB18030.
>     >
>     > (My guess: a fair bit of older code will break, but almost all of it
>     > will never be compiled as C++ 23)
>
>     I think you'll need to find a way to measure the breakage if you
>     want to
>     pursue such a change.
>
>     Personally, I don't think this is the right approach as adding more
>     assumptions about encodings seems likely to lead to even more
>     problems.
>     My preference is to focus on explicit solutions like adding an
>     encoding
>     pragma similarly to what is done in Python and HTML and is existing
>     practice for IBM's xlC compiler
>     (https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/zos_pragma_filetag.htm).
>
>
>
> Except all cross platform (windows, Linux, Mac) code ever written - 
> which includes all of GitHub, etc, would use ASCII or utf8 already.
> Most internal code would avoid non basic character set characters 
> already. Because they know it's not portable
I lack confidence that this is true, so citation needed please.  I know 
that Shift-JIS (for example) is still in use and we hear that from 
Microsoft representatives.  Regardless, I think it is a mistake to 
assume that cross-platform code is more important than code that is 
written for specific platforms.
>
> So while I find the idea of pragma interesting, I question whether it 
> is the right default. I do not want to have to do that to 100% of the 
> I have or will ever write.

It would certainly be the wrong default if we were doing a clean room 
design.  But we are evolving a language that has been around for several 
decades and that inherits from a language that was around for 
considerably longer.

>
> It doesn't mean a pragma is not helpful for people working on an old 
> code base so they can transition away from codepage encoding if they 
> are ie, a windows shop only. I think it would very much be.
>
> I think it would also be useful to encourage utf8 by default even if 
> that would have no impact whatsoever on existing toolchains.
I agree. I strongly think the right approach is:

 1. Keep source file encoding implementation defined.
 2. Introduce the pragma option to explicitly specify per-source-file
    encoding.
 3. Encourage implementors to provide options to default the assumed
    source file encoding to UTF-8 (in practice, most already provide this)
 4. Encourage projects to pass /source-file-encoding-is-utf-8 (however
    spelled) to their compiler invocations.

That approach approximates the "right" default fairly closely if (4) is 
followed (which may be an existing trend).

>
>
> But at the same time it seems it would be beneficial to restrict the 
> set of features that require Unicode to be limited to Unicode source 
> files, including literals and identifiers outside of the basic 
> character sets.
> The intent is that making a program ill-formed (ndr) encourages a 
> warning which I really want to have when the compiler is not 
> interpreting my utf-8 source as utf-8.
I strongly disagree with this.  I think you are conflating two distinct 
things (source file encoding and support for Unicode) as a proxy to get 
a diagnostic that, in practice, would not be reliable.
>
>
> You could argue that people on windows
> can just compile with /source-charset: utf-8, which yes they can and 
> should (it's standard practice in Qt, vcpkg, etc), but avoid 
> potentially lossy encoding due to wrong presumption of how a text file 
> was encoded would help people write portable code with the assurance 
> that the compiler would not miss interpret their intent silently.
>
> I agree with you that reinterpreting all existing code overnight as 
> utf-8 would hinder the adoption of future c++ version enough that we 
> should probably avoid to do that, but maybe a slight encouragement to 
> use utf8 would be beneficial to everyone.
>
> I agree with Niall, people in NA/Europe underestimate the extent of 
> the issue with source encoding.

I agree with this.  But I think there is a reverse underestimation as 
well - that being the extent to which people outside English speaking 
regions use non-UTF-8 encodings. IBM/Windows code pages and the ISO-8859 
series of character sets have a long history.  I think there is good 
reason to believe they are still in use, particularly in older code bases.

Tom.

>
>
>
>
>     >
>     >
>     > 2. How much do we care if code containing non-UTF8 high bit
>     characters
>     > in its string literals breaks when the compiler language version
>     is set
>     > to C++ 23 or higher?
>     >
>     > (My opinion: people using non-ASCII in string literals without an
>     > accompanying unit test to verify the compiler is doing what you
>     assumed
>     > deserve to experience breakage)
>
>     Instead of non-ASCII, I think you mean characters outside the basic
>     source character set.
>
>     Testing practices have varied widely over time and across
>     projects.  I
>     don't think it is acceptable to think it ok for other people's
>     code to
>     break because it wasn't developed to your standards.
>
>     >
>     >
>     > 3. What is the benefit to the ecosystem if the committee
>     standardises
>     > Unicode source files moving forwards?
>     >
>     > (My opinion: people consistently underestimate the benefit if
>     they live
>     > in North America and work only with North American source code.
>     I've had
>     > contracts in the past where a full six weeks of my life went on
>     > attempting mostly lossless up-conversions from multiple legacy
>     encoded
>     > source files into UTF-8 source files. Consider that most, but
>     not all,
>     > use of high bit characters in string literals is typically for
>     testing
>     > that i18n code works right in various borked character encodings, so
>     > yes, fun few weeks. And by the way, there is an *amazing* Python
>     module
>     > full of machine learning heuristics for lossless upconverting legacy
>     > encodings to UTF-8, it saved me a ton of work)
>     I agree we need to provide better means for handling source file
>     encodings.  But this all-or-nothing approach strikes me as very
>     costly.
>     Many applications are composed from multiple projects. Improving
>     support
>     for UTF-8 encoded source files will require means to adopt them
>     gradually.  That means that there will be scenarios where a single
>     TU is
>     built from differently encoded source files. We need a more fine
>     grained
>     solution.
>     >
>     >
>     > But all the above said:
>     >
>     > 4. Is this a productive use of committee time, when it would
>     displace
>     > other items?
>     >
>     > (My opinion: No, probably not, we have much more important stuff
>     before
>     > WG21 for C++ 23. However I wouldn't say the same for WG14,
>     personally, I
>     > think there is a much bigger bang for the buck over there. Hence
>     I ask
>     > here for objections, if none, I'll ask WG14 what they think of
>     the idea)
>
>     I think this is a productive use of SG16's time.  I don't think it
>     is a
>     productive use of the rest of the committee's time until we have a
>     proposal to offer.
>
>     Tom.
>
>     >
>     >
>     > Niall
>     > _______________________________________________
>     > Liaison mailing list
>     > Liaison at lists.isocpp.org <mailto:Liaison at lists.isocpp.org>
>     > Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
>     > Link to this post: http://lists.isocpp.org/liaison/2019/08/0009.php
>
>
>     _______________________________________________
>     Core mailing list
>     Core at lists.isocpp.org <mailto:Core at lists.isocpp.org>
>     Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>     Link to this post: http://lists.isocpp.org/core/2019/08/7045.php
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190814/b82411fd/attachment-0001.html 


More information about the Unicode mailing list