[SG16-Unicode] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

Tom Honermann tom at honermann.net
Wed Aug 14 04:39:23 CEST 2019


On 8/13/19 8:35 AM, Corentin Jabot wrote:
>
> Chiming in with my favorite solution:
>
>   * Forbid lossy source -> presumed execution encoding conversion (all
>     ready ill formed in gcc but not msvc)
>
I think this may be reasonable.
>
>   * Forbid u8/u16/u32 literals in non unicode encoded files
>
I don't understand this at all.  u8/u16/u32 specify the encoding to be 
used at run-time.  The source file encoding isn't relevant at all (as 
Steve noted, source file characters are converted to internal encoding).
>
>   * Expose the "presumed execution encoding" (= "narrow/wide character
>     literal encoding") as a consteval function returning the name as
>     specified by iana
>     https://www.iana.org/assignments/character-sets/character-sets.txt
>
This may be useful, but needs more justification (preferably in the form 
of a paper).

> I would expect changing the encoding of char would break everything... 
> I'd leave char and wchar_t mostly alone and start clean on char8_t.
I agree, but I don't think that will be suffiicent.  Not all projects 
are going to adopt char8_t.  A substantial portion, especially on 
Linux/UNIX systems will choose to continue use of UTF-8 using char.  I 
think we're going to have to provide Unicode support for char and 
char8_t (and char16_t, and perhaps char32_t).
>
> Anyhow, I agree with Tom that the names are not indicative
> How about: "narrow/wide character literal encoding" ?

"execution encoding" has a long history in both WG14 and WG21 (though 
not POSIX I think) and that makes me reluctant to try and challenge it.  
In Slack, discussion, I think Steve Downey probably hit on the right 
approach; provide a formal definition of it.  I think we *might* be 
successful in using "execution encoding" to apply to both the 
compile-time and run-time encodings by extending the term with specific 
qualifiers; e.g., "presumed execution encoding" and 
"run-time/system/native execution encoding".

Tom.

>
>
>
>
> On Tue, 13 Aug 2019 at 10:39, Niall Douglas <s_sourceforge at nedprod.com 
> <mailto:s_sourceforge at nedprod.com>> wrote:
>
>     Before progressing with a solution, can I ask the question:
>
>     Is it politically feasible for C++ 23 and C 2x to require
>     implementations to default to interpreting source files as either
>     (i) 7
>     bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
>     either 7 bit ASCII or UTF-8.
>
>     (The reason for the 7 bit ASCII is that it is a perfect subset of
>     UTF-8,
>     and that C very much wants to retain the language being
>     implementable in
>     a small code base i.e. without UTF-8 support. Note the qualifier
>     "default" as well)
>
>     An answer to the above would determine how best to solve your
>     issue Tom,
>     I think. As much as we all expect IBM et al to veto such a
>     proposal, one
>     never gets anywhere without asking first.
>
>     Niall
>
>     On 13/08/2019 03:25, Tom Honermann wrote:
>     > I agree with this (mostly), but would prefer not to discuss
>     further in
>     > this thread.  The only reason I included the filesystem
>     references is
>     > because the wording there uses "native" for an encoding that is
>     related
>     > (though distinct) from the encodings referenced in the codecvt
>     and ctype
>     > wording, where "native" is also used.  This suggests that "native"
>     > serves (or should serve) a role in naming these run-time
>     encodings, or
>     > is a source of conflation (or both).
>     >
>     > Tom.
>     >
>     > On 8/12/19 5:08 PM, Niall Douglas wrote:
>     >>>   1. [fs.path.type.cvt]p1
>     <http://eel.is/c++draft/fs.path.type.cvt#1>:
>     >>>      (though the definition provided here appears to be
>     specific to path
>     >>>      names).
>     >>>      "The /native encoding/ of an ordinary character string is the
>     >>>      operating system dependent current encoding for path
>     names.  The
>     >>>      /native encoding/ for wide character strings is the
>     >>>      implementation-defined execution wide-character set
>     encoding."
>     >> We discussed the problems with the choice of normative wording in
>     >> http://eel.is/c++draft/fs.class.path#fs.path.cvt, if you remember,
>     >> during SG16's discussion of filesystem::path_view.
>     >>
>     >> The problem is that filesystem paths have different encoding and
>     >> interpretation per-path-component i.e. for a path
>     >>
>     >> /A/B/C/D
>     >>
>     >> ... A, B, C and D may each have its own, individual, encoding and
>     >> interpretation depending on the mount points and filesystems
>     configured
>     >> on the current system. This is not what is suggested by the current
>     >> normative wording, which appears to think that some mapping exists
>     >> between C++ paths and OS kernel paths.
>     >>
>     >> There *is* a mapping, but it is 100% C++-side. The OS kernel
>     generally
>     >> consumes arrays of bytes.
>     >>
>     >> A more correct normative wording would more clearly separate
>     these two
>     >> kinds of path representation. OS kernel paths are arrays of
>     `byte`, but
>     >> with certain implementation-defined byte sequences not
>     permitted. C++
>     >> paths can be in char, wchar_t, char8_t, char16_t, char32_t etc, and
>     >> there are well defined conversions between those C++ paths and
>     the array
>     >> of bytes supplied to the OS kernel. The standard can say
>     nothing useful
>     >> about how the OS kernel may interpret the byte array C++
>     supplies to it.
>     >>
>     >> If path_view starts the standards track, I'll need to propose a
>     document
>     >> fixing up http://eel.is/c++draft/fs.class.path#fs.path.cvt in
>     any case.
>     >> But to come back to your original question, I think that you
>     ought to
>     >> split off filesystem paths from everything else, consider them
>     separate,
>     >> and then I think you'll find it much easier to make the non-path
>     >> normative wording more consistent.
>     >>
>     >> Niall
>     >> _______________________________________________
>     >> SG16 Unicode mailing list
>     >> Unicode at isocpp.open-std.org <mailto:Unicode at isocpp.open-std.org>
>     >> http://www.open-std.org/mailman/listinfo/unicode
>     >
>     >
>     _______________________________________________
>     SG16 Unicode mailing list
>     Unicode at isocpp.open-std.org <mailto:Unicode at isocpp.open-std.org>
>     http://www.open-std.org/mailman/listinfo/unicode
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode at isocpp.open-std.org
> http://www.open-std.org/mailman/listinfo/unicode


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190813/cfd7ec53/attachment-0001.html 


More information about the Unicode mailing list