[SG16-Unicode] Abstract and notes for D1859R0: Standard terminology for execution character set encodings
Tom Honermann
tom at honermann.net
Sun Sep 8 05:46:20 CEST 2019
On 9/5/19 9:41 PM, Steve Downey wrote:
> Because I needed to circulate what I'm doing for Belfast, I've thrown
> together an abstract for the paper we've peripherally discussed about
> modernizing and tightening the specification around encodings of
> characters generally, and the source and execution character sets.
>
> "
> This document proposes new standard terms for the various encodings
> for character and string literals, and the encodings associated with
> some character types. It also proposes that the wording used for
> [lex.charset], [lex.ccon], [lex.string], and [basic.fundamental] 8 be
> modified to reflect the new terminology. This paper does not intend to
> propose any changes that would require changes in any currently
> conforming implementation.
> "
>
> I'm hoping to have some preliminary work by the next telecon. The
> direction I'm thinking is that both Source and Execution Character Set
> are descriptions of the abstract characters, selected from 10646, that
> must be present to support C++. Encodings, both source and execution,
> are implementation defined. I would like to introduce terminology to
> describe the encoding used when translating narrow and wide character
> and string literals. I'd also like to make it explicit somewhere up
> front that there are associated encodings for some, but not all,
> character types. This is mentioned now in filesystem, but should be
> moved to a section with wider scope. The encoding for `char` and
> `wchar_t` is controlled by `locale`. The encoding for the unicode
> character types is fixed. The encoding used for literals was chosen at
> compile time, and is implementation defined. If locale and that
> endcoding conflict, behavior is unspecified. Combining TU with
> different encodings is in general unspecified, unless it results in an
> ODR violation.
This all sounds great. My only question is behavior being unspecified
vs undefined. It seems challenging to get away with making it only
unspecified.
>
> Some possible terms:
> {"",Narrow,Wide} Literal Encoding - encoding on char and string literals
> Dynamic Encoding - encoding implied by locale
> *Character Set - A set of abstract characters ( Latin Capital letter
> A, Digit Zero, Left Parenthesis ...)
Unicode uses "character repertoire" for abstract sets of characters. I
favor following suit there.
> *Basic Character Set - minimum required to be encoded
> *Extended Character Set - what can be encoded
> *Source Character Set - must be encodable in C++ source
I don't think "source character set" is defined today. The closest we
get is "Physical source file characters" in [lex.phases]p1
<http://eel.is/c++draft/lex.phases#1.1>.
> *Execution Character Set - Source + control characters
>
> * Current terms, with what I think the actual meanings are today.
>
>
I think these are good. With these, there is no need for a term like
"execution encoding", correct? At compile-time, "literal encoding"
encodes "execution character set" characters, and at run-time, "dynamic
encoding" encodes "extended character set" characters, yes?
I like that this doesn't stray far from the existing terms.
Tom.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190907/b25f9ae8/attachment.html
More information about the Unicode
mailing list