<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 9/5/19 9:41 PM, Steve Downey wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJEGDKrtozGfAtzp0GewXKDdWm7AYrHgNBUgzC1tga33byLEtA@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Because I needed to circulate what I'm doing for
Belfast, I've thrown together an abstract for the paper we've
peripherally discussed about modernizing and tightening the
specification around encodings of characters generally, and the
source and execution character sets. <br>
<br>
"<br>
This document proposes new standard terms for the various
encodings for character and string literals, and the encodings
associated with some character types. It also proposes that the
wording used for [lex.charset], [lex.ccon], [lex.string], and
[basic.fundamental] 8 be modified to reflect the new
terminology. This paper does not intend to propose any changes
that would require changes in any currently conforming
implementation.<br>
"<br>
<br>
I'm hoping to have some preliminary work by the next telecon.
The direction I'm thinking is that both Source and Execution
Character Set are descriptions of the abstract characters,
selected from 10646, that must be present to support C++.
Encodings, both source and execution, are implementation
defined. I would like to introduce terminology to describe the
encoding used when translating narrow and wide character and
string literals. I'd also like to make it explicit somewhere up
front that there are associated encodings for some, but not all,
character types. This is mentioned now in filesystem, but should
be moved to a section with wider scope. The encoding for `char`
and `wchar_t` is controlled by `locale`. The encoding for the
unicode character types is fixed. The encoding used for literals
was chosen at compile time, and is implementation defined. If
locale and that endcoding conflict, behavior is unspecified.
Combining TU with different encodings is in general unspecified,
unless it results in an ODR violation. <br>
</div>
</blockquote>
This all sounds great. My only question is behavior being
unspecified vs undefined. It seems challenging to get away with
making it only unspecified.<br>
<blockquote type="cite"
cite="mid:CAJEGDKrtozGfAtzp0GewXKDdWm7AYrHgNBUgzC1tga33byLEtA@mail.gmail.com">
<div dir="ltr"><br>
Some possible terms:<br>
{"",Narrow,Wide} Literal Encoding - encoding on char and string
literals<br>
Dynamic Encoding - encoding implied by locale<br>
*Character Set - A set of abstract characters ( Latin Capital
letter A, Digit Zero, Left Parenthesis ...)<br>
</div>
</blockquote>
Unicode uses "character repertoire" for abstract sets of
characters. I favor following suit there.<br>
<blockquote type="cite"
cite="mid:CAJEGDKrtozGfAtzp0GewXKDdWm7AYrHgNBUgzC1tga33byLEtA@mail.gmail.com">
<div dir="ltr">*Basic Character Set - minimum required to be
encoded<br>
*Extended Character Set - what can be encoded<br>
*Source Character Set - must be encodable in C++ source<br>
</div>
</blockquote>
I don't think "source character set" is defined today. The closest
we get is "Physical source file characters" in <a
moz-do-not-send="true"
href="http://eel.is/c++draft/lex.phases#1.1">[lex.phases]p1</a>.<br>
<blockquote type="cite"
cite="mid:CAJEGDKrtozGfAtzp0GewXKDdWm7AYrHgNBUgzC1tga33byLEtA@mail.gmail.com">
<div dir="ltr">*Execution Character Set - Source + control
characters<br>
<br>
* Current terms, with what I think the actual meanings are
today.<br>
<br>
<br>
</div>
</blockquote>
<p>I think these are good. With these, there is no need for a term
like "execution encoding", correct? At compile-time, "literal
encoding" encodes "execution character set" characters, and at
run-time, "dynamic encoding" encodes "extended character set"
characters, yes?</p>
<p>I like that this doesn't stray far from the existing terms.<br>
</p>
<p>Tom.<br>
</p>
</body>
</html>