[SG16-Unicode] Abstract and notes for D1859R0: Standard terminology for execution character set encodings

Tom Honermann tom at honermann.net
Sun Sep 8 05:46:20 CEST 2019


On 9/5/19 9:41 PM, Steve Downey wrote:
> Because I needed to circulate what I'm doing for Belfast, I've thrown 
> together an abstract for the paper we've peripherally discussed about 
> modernizing and tightening the specification around encodings of 
> characters generally, and the source and execution character sets.
>
> "
> This document proposes new standard terms for the various encodings 
> for character and string literals, and the encodings associated with 
> some character types. It also proposes that the wording used for 
> [lex.charset], [lex.ccon], [lex.string], and [basic.fundamental] 8 be 
> modified to reflect the new terminology. This paper does not intend to 
> propose any changes that would require changes in any currently 
> conforming implementation.
> "
>
> I'm hoping to have some preliminary work by the next telecon. The 
> direction I'm thinking is that both Source and Execution Character Set 
> are descriptions of the abstract characters, selected from 10646, that 
> must be present to support C++. Encodings, both source and execution, 
> are implementation defined. I would like to introduce terminology to 
> describe the encoding used when translating narrow and wide character 
> and string literals. I'd also like to make it explicit somewhere up 
> front that there are associated encodings for some, but not all, 
> character types. This is mentioned now in filesystem, but should be 
> moved to a section with wider scope. The encoding for `char` and 
> `wchar_t` is controlled by `locale`. The encoding for the unicode 
> character types is fixed. The encoding used for literals was chosen at 
> compile time, and is implementation defined. If locale and that 
> endcoding conflict, behavior is unspecified. Combining TU with 
> different encodings is in general unspecified, unless it results in an 
> ODR violation.
This all sounds great.  My only question is behavior being unspecified 
vs undefined.  It seems challenging to get away with making it only 
unspecified.
>
> Some possible terms:
> {"",Narrow,Wide} Literal Encoding - encoding on char and string literals
> Dynamic Encoding - encoding implied by locale
> *Character Set - A set of abstract characters ( Latin Capital letter 
> A, Digit Zero, Left Parenthesis ...)
Unicode uses "character repertoire" for abstract sets of characters.  I 
favor following suit there.
> *Basic Character Set - minimum required to be encoded
> *Extended Character Set - what can be encoded
> *Source Character Set - must be encodable in C++ source
I don't think "source character set" is defined today.  The closest we 
get is "Physical source file characters" in [lex.phases]p1 
<http://eel.is/c++draft/lex.phases#1.1>.
> *Execution Character Set - Source + control characters
>
> * Current terms, with what I think the actual meanings are today.
>
>
I think these are good.  With these, there is no need for a term like 
"execution encoding", correct?  At compile-time, "literal encoding" 
encodes "execution character set" characters, and at run-time, "dynamic 
encoding" encodes "extended character set" characters, yes?

I like that this doesn't stray far from the existing terms.

Tom.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190907/b25f9ae8/attachment.html 


More information about the Unicode mailing list