<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 9/8/19 12:02 PM, Steve Downey wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAJEGDKrpg0y0bg13aJjr1-idow8JU6R84DZM5c_oL9ti9c1Lzg@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">Character repertoire sounds good, and I will
        eventually learn to spell it. Character set is
        definitely terminology from the pre-unicode times, and
        unfortunately tends to merge the repertoire and encoding, <a
href="https://www.iana.org/assignments/character-sets/character-sets.xhtml"
          moz-do-not-send="true">https://www.iana.org/assignments/character-sets/character-sets.xhtml</a><br>
      </div>
    </blockquote>
    <p>I think I was a little over zealous earlier in stating that
      Unicode uses "character repertoire" as I described.  I looked
      again and don't find that term formally defined in the standard. 
      However, "repertoire" is used throughout the standard in ways that
      I believe are consistent with my description.  I wasn't able to
      find an alternative formal term.<br>
    </p>
    <p>The way I've been thinking about it is that a "character
      repertoire" describes a set of <i>abstract characters</i> (a
      formal Unicode term) and a "character set" describes a set of <i>encoded
        characters</i> (a formal Unicode term) that associate each <i>abstract
        character</i> member of a "character repertoire" with a <i>code
        point</i> (a formal Unicode term) within a <i>codespace</i> (A
      formal Unicode term).  See sections 2.4 and 3.4 of Unicode 12 and
      uses of the word "repertoire" within those chapters.  The Unicode
      standard does use the term "character set", but I didn't find a
      formal definition.<br>
    </p>
    <blockquote type="cite"
cite="mid:CAJEGDKrpg0y0bg13aJjr1-idow8JU6R84DZM5c_oL9ti9c1Lzg@mail.gmail.com">
      <div dir="ltr"><br>
        Basic source character set is defined in [lex.charset] <a
          href="http://eel.is/c++draft/lex.charset#def:character_set,basic_source"
          moz-do-not-send="true">http://eel.is/c++draft/lex.charset#def:character_set,basic_source</a><br>
      </div>
    </blockquote>
    Yes, and it defines a character repertoire.  "Physical source file
    characters" is the closest I've found to a term that describes the
    actual implementation defined source character set.
    <blockquote type="cite"
cite="mid:CAJEGDKrpg0y0bg13aJjr1-idow8JU6R84DZM5c_oL9ti9c1Lzg@mail.gmail.com">
      <div dir="ltr"><br>
        I'd like to get away from "execution encoding" because it
        conflates the presumed encoding and the one selected by the
        current locale. Now, admittedly, everyone conflates these and
        it's a source of error and mojibake, but perhaps with better
        words it would be easier to teach. <br>
      </div>
    </blockquote>
    I agree.  I like "dynamic encoding" because it accurately reflects
    the reality that the encoding can be changed dynamically (by calls
    to std::setlocale).<br>
    <blockquote type="cite"
cite="mid:CAJEGDKrpg0y0bg13aJjr1-idow8JU6R84DZM5c_oL9ti9c1Lzg@mail.gmail.com">
      <div dir="ltr"><br>
        As to UB. I'd like, if possible, to avoid creating new UB
        classes. Some things should probably be ill-formed, like
        unencodable characters. Others fall into existing UB, like
        specifying an inline string literal with two different
        encodings. Reading a string with the wrong encoding, I think,
        should be at worst unspecified, unless for some reason your
        decoder has UB, in which case it's the decoders problem, not the
        incorrect or mixed encoding isssue. That said, I'd defer to Core
        on this. <br>
      </div>
    </blockquote>
    Wherever Core says we can get away with unspecified, I'm all for it.<br>
    <blockquote type="cite"
cite="mid:CAJEGDKrpg0y0bg13aJjr1-idow8JU6R84DZM5c_oL9ti9c1Lzg@mail.gmail.com">
      <div dir="ltr"><br>
        Internal encoding is required to preserve distinct universal
        character names and treat all representations of the same
        universal character the same. So, the standard effectively
        requires unicode, but in terms of observables. <br>
      </div>
    </blockquote>
    <p>Agreed, I don't think anything is accomplished by trying to
      prescribe implementation details.<br>
    </p>
    <p>Tom.<br>
    </p>
    <blockquote type="cite"
cite="mid:CAJEGDKrpg0y0bg13aJjr1-idow8JU6R84DZM5c_oL9ti9c1Lzg@mail.gmail.com">
      <div dir="ltr"><br>
        <br>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Sun, Sep 8, 2019 at 5:39 AM
          Corentin Jabot &lt;<a href="mailto:corentinjabot@gmail.com"
            moz-do-not-send="true">corentinjabot@gmail.com</a>&gt;
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div dir="ltr">
            <div dir="ltr"><br>
            </div>
            <br>
            <div class="gmail_quote">
              <div dir="ltr" class="gmail_attr">On Sun, 8 Sep 2019 at
                05:46, Tom Honermann &lt;<a
                  href="mailto:tom@honermann.net" target="_blank"
                  moz-do-not-send="true">tom@honermann.net</a>&gt;
                wrote:<br>
              </div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">
                <div bgcolor="#FFFFFF">
                  <div
class="gmail-m_1528546080172065353gmail-m_7194630821368723447moz-cite-prefix">On
                    9/5/19 9:41 PM, Steve Downey wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <div dir="ltr">Because I needed to circulate what
                      I'm doing for Belfast, I've thrown together an
                      abstract for the paper we've peripherally
                      discussed about modernizing and tightening the
                      specification around encodings of characters
                      generally, and the source and execution character
                      sets. <br>
                      <br>
                      "<br>
                      This document proposes new standard terms for the
                      various encodings for character and string
                      literals, and the encodings associated with some
                      character types. It also proposes that the wording
                      used for [lex.charset], [lex.ccon], [lex.string],
                      and [basic.fundamental] 8 be modified to reflect
                      the new terminology. This paper does not intend to
                      propose any changes that would require changes in
                      any currently conforming implementation.<br>
                      "<br>
                      <br>
                      I'm hoping to have some preliminary work by the
                      next telecon. The direction I'm thinking is that
                      both Source and Execution Character Set are
                      descriptions of the abstract characters, selected
                      from 10646, that must be present to support C++.
                      Encodings, both source and execution, are
                      implementation defined. I would like to introduce
                      terminology to describe the encoding used when
                      translating narrow and wide character and string
                      literals. I'd also like to make it explicit
                      somewhere up front that there are associated
                      encodings for some, but not all, character types.
                      This is mentioned now in filesystem, but should be
                      moved to a section with wider scope. The encoding
                      for `char` and `wchar_t` is controlled by
                      `locale`. The encoding for the unicode character
                      types is fixed. The encoding used for literals was
                      chosen at compile time, and is implementation
                      defined. If locale and that endcoding conflict,
                      behavior is unspecified. Combining TU with
                      different encodings is in general unspecified,
                      unless it results in an ODR violation. <br>
                    </div>
                  </blockquote>
                  This all sounds great.  My only question is behavior
                  being unspecified vs undefined.  It seems challenging
                  to get away with making it only unspecified.<br>
                </div>
              </blockquote>
              <div><br>
              </div>
              <div>Specifically, I'd like something along the line of:</div>
              <div>If a character literal contains a c-char that do not
                have the same representation in the character literal
                encoding (aka *presumed" execution encoding) and the
                execution encoding, the behavior is undefined.</div>
              <div><br>
              </div>
              <div><br>
              </div>
              <div><br>
              </div>
              <div> </div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">
                <div bgcolor="#FFFFFF">
                  <blockquote type="cite">
                    <div dir="ltr"><br>
                      Some possible terms:<br>
                      {"",Narrow,Wide} Literal Encoding - encoding on
                      char and string literals<br>
                      Dynamic Encoding - encoding implied by locale<br>
                      *Character Set - A set of abstract characters (
                      Latin Capital letter A, Digit Zero, Left
                      Parenthesis ...)<br>
                    </div>
                  </blockquote>
                  Unicode uses "character repertoire" for abstract sets
                  of characters.  I favor following suit there.<br>
                </div>
              </blockquote>
              <div><br>
              </div>
              <div>+1 to sticking to Unicode terms </div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">
                <div bgcolor="#FFFFFF">
                  <blockquote type="cite">
                    <div dir="ltr">*Basic Character Set - minimum
                      required to be encoded<br>
                      *Extended Character Set - what can be encoded<br>
                      *Source Character Set - must be encodable in C++
                      source<br>
                    </div>
                  </blockquote>
                  I don't think "source character set" is defined
                  today.  The closest we get is "Physical source file
                  characters" in <a
                    href="http://eel.is/c++draft/lex.phases#1.1"
                    target="_blank" moz-do-not-send="true">[lex.phases]p1</a>.<br>
                  <blockquote type="cite">
                    <div dir="ltr">*Execution Character Set - Source +
                      control characters<br>
                    </div>
                  </blockquote>
                </div>
              </blockquote>
              <div><br>
              </div>
              <div>Be careful not to break that code <a
href="https://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers"
                  target="_blank" moz-do-not-send="true">https://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers</a></div>
              <div>More seriously i think it would be beneficial
                (necessary even) to have a source character encoding /
                character repertoire.</div>
              <div><br>
              </div>
              <div><br>
              </div>
              <div>I wonder if we could specified that the
                internal character repertoire is Unicode. It kinda has
                to be already make that clearer.</div>
              <div><br>
              </div>
              <div><br>
              </div>
              <div>I would also propose</div>
              <div><br>
              </div>
              <div>Universal Character Name -&gt; Unicode Code point<br>
              </div>
              <div>(character name should be reserved to the \N
                proposal)</div>
              <div><br>
              </div>
              <div><br>
              </div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">
                <div bgcolor="#FFFFFF">
                  <blockquote type="cite">
                    <div dir="ltr"> <br>
                      * Current terms, with what I think the actual
                      meanings are today.<br>
                      <br>
                      <br>
                    </div>
                  </blockquote>
                  <p>I think these are good.  With these, there is no
                    need for a term like "execution encoding", correct? 
                    At compile-time, "literal encoding" encodes
                    "execution character set" characters, and at
                    run-time, "dynamic encoding" encodes "extended
                    character set" characters, yes?</p>
                </div>
              </blockquote>
              <div>I prefer "execution" to dynamic</div>
              <div> </div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">
                <div bgcolor="#FFFFFF">
                  <p>I like that this doesn't stray far from the
                    existing terms.<br>
                  </p>
                  <p>Tom.<br>
                  </p>
                </div>
                _______________________________________________<br>
                SG16 Unicode mailing list<br>
                <a href="mailto:Unicode@isocpp.open-std.org"
                  target="_blank" moz-do-not-send="true">Unicode@isocpp.open-std.org</a><br>
                <a
                  href="http://www.open-std.org/mailman/listinfo/unicode"
                  rel="noreferrer" target="_blank"
                  moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a><br>
              </blockquote>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>