<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 3/8/19 10:31 AM, Mathias Stearn
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAH4rMhgz1YCoqDm8-RuxrtYn1-x_AmJRqz=d0M8p9a=z_ygr7g@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="auto">
        <div dir="ltr">
          <div dir="ltr">
            <div dir="ltr"><br>
            </div>
            <br>
            <div class="gmail_quote">
              <div dir="ltr" class="gmail_attr">On Thu, Mar 7, 2019 at
                7:19 PM Tom Honermann &lt;<a
                  href="mailto:tom@honermann.net" target="_blank"
                  rel="noreferrer" moz-do-not-send="true">tom@honermann.net</a>&gt;
                wrote:</div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">
                <div bgcolor="#FFFFFF">
                  <p>I think the committee currently has a UTF-8 bias
                    that doesn't necessarily reflect the global C++
                    community.  We don't have much representation from
                    Japan or China where, as I understand it, Shift-JIS
                    and GB18030 still have significant usage.  We also
                    have few, if any, z/OS users in the committee
                    outside of IBM representatives.</p>
                </div>
              </blockquote>
              <div>Not to be dismissive, but z/OS developers are a tiny
                subset of C++ developers.</div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    This is true, but they also service an important market and already
    face challenges due to being in a more niche space.  If we can
    reasonably make things easier for them, I think we should.<br>
    <blockquote type="cite"
cite="mid:CAH4rMhgz1YCoqDm8-RuxrtYn1-x_AmJRqz=d0M8p9a=z_ygr7g@mail.gmail.com">
      <div dir="auto">
        <div dir="ltr">
          <div dir="ltr">
            <div class="gmail_quote">
              <div>Even when targeting z series hardware (we have for a
                few years now), there is the option of using linux which
                seems to be a fully supported platform.</div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <p>Linux on z is great, but not helpful for those that have actual
      z/OS requirements.</p>
    <blockquote type="cite"
cite="mid:CAH4rMhgz1YCoqDm8-RuxrtYn1-x_AmJRqz=d0M8p9a=z_ygr7g@mail.gmail.com">
      <div dir="auto">
        <div dir="ltr">
          <div dir="ltr">
            <div class="gmail_quote">
              <div>If supporting z/OS makes the experience worse or more
                complicated for other users, then I think the best
                option for the broader ecosystem is to leave it out of
                scope for the TR. That platform can offer an equivalent
                mechanism that better fits its eccentricities. I want to
                point out that EBCDIC seems to be the only remaining
                encoding that isn't an ASCII-superset (shift-jis
                replaces 2 characters in ASCII, but they don't matter
                for our purposes), so to support it we would be taking
                on substantial additional complexity that is only needed
                for that one niche platform.<br>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    I don't consider any of what we've discussed so far as proposing
    substantial additional complexity.  In fact, what we've discussed is
    also relevant to ASCII platforms.<br>
    <blockquote type="cite"
cite="mid:CAH4rMhgz1YCoqDm8-RuxrtYn1-x_AmJRqz=d0M8p9a=z_ygr7g@mail.gmail.com">
      <div dir="auto">
        <div dir="ltr">
          <div dir="ltr">
            <div class="gmail_quote">
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">
                <div bgcolor="#FFFFFF">
                  <p>  UTF-8 dominates the web, no one questions that. 
                    But within the C++ ecosystem, I don't think UTF-8
                    dominates to a similar degree, at least not outside
                    of the US and Europe.  I wish I had data to back
                    that up.</p>
                </div>
              </blockquote>
              <div>From <a href="http://www.tomazos.com/actcd16.pdf"
                  target="_blank" rel="noreferrer"
                  moz-do-not-send="true">http://www.tomazos.com/actcd16.pdf</a>:
                "We executed standard C++ translation phase 1 through 3
                on the source files assuming a UTF­8encoding. We found
                that 99.0% of the source files tokenized successfully.
                Of the remaining1.0% the majority of the errors were
                decoding problems (most likely from ISO­8859 /
                Latin1encoding)"</div>
              <div><br>
              </div>
              <div>This was a scan of all C and C++ packages in Ubuntu.
                While that obviously only represents the open source,
                unix-targetting subset of the C++ community, this seems
                to imply that for that sub-community utf-8 (and the
                ascii subset) dominates the source content. On top of
                that, I would expect file names to have even less
                non-ascii characters that file content, since it is
                common to limit non-ascii characters to comments and
                strings.<br>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <p>For that subset, I agree and those results match my expectations
      for that subset.  Worth noting that the survey doesn't answer the
      question of what might break if characters outside the ASCII range
      were introduced into that 99% of source files.  e.g., those files
      aren't necessarily consumed as UTF-8.<br>
    </p>
    <p>Tom.<br>
    </p>
  </body>
</html>