[SG16-Unicode] Abstract and notes for D1859R0: Standard terminology for execution character set encodings

Tom Honermann tom at honermann.net
Mon Sep 9 05:16:38 CEST 2019


On 9/8/19 12:02 PM, Steve Downey wrote:
> Character repertoire sounds good, and I will eventually learn to spell 
> it. Character set is definitely terminology from the pre-unicode 
> times, and unfortunately tends to merge the repertoire and encoding, 
> https://www.iana.org/assignments/character-sets/character-sets.xhtml

I think I was a little over zealous earlier in stating that Unicode uses 
"character repertoire" as I described.  I looked again and don't find 
that term formally defined in the standard. However, "repertoire" is 
used throughout the standard in ways that I believe are consistent with 
my description.  I wasn't able to find an alternative formal term.

The way I've been thinking about it is that a "character repertoire" 
describes a set of /abstract characters/ (a formal Unicode term) and a 
"character set" describes a set of /encoded characters/ (a formal 
Unicode term) that associate each /abstract character/ member of a 
"character repertoire" with a /code point/ (a formal Unicode term) 
within a /codespace/ (A formal Unicode term).  See sections 2.4 and 3.4 
of Unicode 12 and uses of the word "repertoire" within those chapters.  
The Unicode standard does use the term "character set", but I didn't 
find a formal definition.

>
> Basic source character set is defined in [lex.charset] 
> http://eel.is/c++draft/lex.charset#def:character_set,basic_source
Yes, and it defines a character repertoire.  "Physical source file 
characters" is the closest I've found to a term that describes the 
actual implementation defined source character set.
>
> I'd like to get away from "execution encoding" because it conflates 
> the presumed encoding and the one selected by the current locale. Now, 
> admittedly, everyone conflates these and it's a source of error and 
> mojibake, but perhaps with better words it would be easier to teach.
I agree.  I like "dynamic encoding" because it accurately reflects the 
reality that the encoding can be changed dynamically (by calls to 
std::setlocale).
>
> As to UB. I'd like, if possible, to avoid creating new UB classes. 
> Some things should probably be ill-formed, like unencodable 
> characters. Others fall into existing UB, like specifying an inline 
> string literal with two different encodings. Reading a string with the 
> wrong encoding, I think, should be at worst unspecified, unless for 
> some reason your decoder has UB, in which case it's the decoders 
> problem, not the incorrect or mixed encoding isssue. That said, I'd 
> defer to Core on this.
Wherever Core says we can get away with unspecified, I'm all for it.
>
> Internal encoding is required to preserve distinct universal character 
> names and treat all representations of the same universal character 
> the same. So, the standard effectively requires unicode, but in terms 
> of observables.

Agreed, I don't think anything is accomplished by trying to prescribe 
implementation details.

Tom.

>
>
>
> On Sun, Sep 8, 2019 at 5:39 AM Corentin Jabot <corentinjabot at gmail.com 
> <mailto:corentinjabot at gmail.com>> wrote:
>
>
>
>     On Sun, 8 Sep 2019 at 05:46, Tom Honermann <tom at honermann.net
>     <mailto:tom at honermann.net>> wrote:
>
>         On 9/5/19 9:41 PM, Steve Downey wrote:
>>         Because I needed to circulate what I'm doing for Belfast,
>>         I've thrown together an abstract for the paper we've
>>         peripherally discussed about modernizing and tightening the
>>         specification around encodings of characters generally, and
>>         the source and execution character sets.
>>
>>         "
>>         This document proposes new standard terms for the various
>>         encodings for character and string literals, and the
>>         encodings associated with some character types. It also
>>         proposes that the wording used for [lex.charset], [lex.ccon],
>>         [lex.string], and [basic.fundamental] 8 be modified to
>>         reflect the new terminology. This paper does not intend to
>>         propose any changes that would require changes in any
>>         currently conforming implementation.
>>         "
>>
>>         I'm hoping to have some preliminary work by the next telecon.
>>         The direction I'm thinking is that both Source and Execution
>>         Character Set are descriptions of the abstract characters,
>>         selected from 10646, that must be present to support C++.
>>         Encodings, both source and execution, are implementation
>>         defined. I would like to introduce terminology to describe
>>         the encoding used when translating narrow and wide character
>>         and string literals. I'd also like to make it explicit
>>         somewhere up front that there are associated encodings for
>>         some, but not all, character types. This is mentioned now in
>>         filesystem, but should be moved to a section with wider
>>         scope. The encoding for `char` and `wchar_t` is controlled by
>>         `locale`. The encoding for the unicode character types is
>>         fixed. The encoding used for literals was chosen at compile
>>         time, and is implementation defined. If locale and that
>>         endcoding conflict, behavior is unspecified. Combining TU
>>         with different encodings is in general unspecified, unless it
>>         results in an ODR violation.
>         This all sounds great.  My only question is behavior being
>         unspecified vs undefined.  It seems challenging to get away
>         with making it only unspecified.
>
>
>     Specifically, I'd like something along the line of:
>     If a character literal contains a c-char that do not have the same
>     representation in the character literal encoding (aka *presumed"
>     execution encoding) and the execution encoding, the behavior is
>     undefined.
>
>
>
>>
>>         Some possible terms:
>>         {"",Narrow,Wide} Literal Encoding - encoding on char and
>>         string literals
>>         Dynamic Encoding - encoding implied by locale
>>         *Character Set - A set of abstract characters ( Latin Capital
>>         letter A, Digit Zero, Left Parenthesis ...)
>         Unicode uses "character repertoire" for abstract sets of
>         characters.  I favor following suit there.
>
>
>     +1 to sticking to Unicode terms
>
>>         *Basic Character Set - minimum required to be encoded
>>         *Extended Character Set - what can be encoded
>>         *Source Character Set - must be encodable in C++ source
>         I don't think "source character set" is defined today.  The
>         closest we get is "Physical source file characters" in
>         [lex.phases]p1 <http://eel.is/c++draft/lex.phases#1.1>.
>>         *Execution Character Set - Source + control characters
>
>
>     Be careful not to break that code
>     https://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers
>     More seriously i think it would be beneficial (necessary even) to
>     have a source character encoding / character repertoire.
>
>
>     I wonder if we could specified that the internal character
>     repertoire is Unicode. It kinda has to be already make that clearer.
>
>
>     I would also propose
>
>     Universal Character Name -> Unicode Code point
>     (character name should be reserved to the \N proposal)
>
>
>>
>>         * Current terms, with what I think the actual meanings are today.
>>
>>
>         I think these are good.  With these, there is no need for a
>         term like "execution encoding", correct? At compile-time,
>         "literal encoding" encodes "execution character set"
>         characters, and at run-time, "dynamic encoding" encodes
>         "extended character set" characters, yes?
>
>     I prefer "execution" to dynamic
>
>         I like that this doesn't stray far from the existing terms.
>
>         Tom.
>
>         _______________________________________________
>         SG16 Unicode mailing list
>         Unicode at isocpp.open-std.org <mailto:Unicode at isocpp.open-std.org>
>         http://www.open-std.org/mailman/listinfo/unicode
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190908/223080e0/attachment-0001.html 


More information about the Unicode mailing list