<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 9/9/19 3:47 PM, Corentin wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CA+Om+ShjBpvr0DmTbpSjNjbWkCvwWa2y-vFZOyUh8B_tUDfVxQ@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div dir="ltr"><br>
        </div>
        <br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Mon, 9 Sep 2019 at 21:29,
            Tom Honermann &lt;<a href="mailto:tom@honermann.net"
              moz-do-not-send="true">tom@honermann.net</a>&gt; wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div bgcolor="#FFFFFF">
              <div class="gmail-m_948892757684974694moz-cite-prefix">On
                9/9/19 3:26 AM, Corentin wrote:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div dir="ltr"><br>
                  </div>
                  <div dir="auto">On Mon, Sep 9, 2019, 4:34 AM Tom
                    Honermann &lt;<a href="mailto:tom@honermann.net"
                      target="_blank" moz-do-not-send="true">tom@honermann.net</a>&gt;
                    wrote:<br>
                    <div class="gmail_quote">
                      <blockquote class="gmail_quote" style="margin:0px
                        0px 0px 0.8ex;border-left:1px solid
                        rgb(204,204,204);padding-left:1ex">
                        <div bgcolor="#FFFFFF">
                          <p>My preferred direction for exploration is a
                            future extension that enables opt-in to
                            field widths that are encoding dependent
                            (and therefore locale dependent for char and
                            wchar_t).  For example (using 'L' appended
                            to the width; 'L' doesn't conflict with the
                            existing type options):<br>
                          </p>
                          <p><tt>std::format("{:3L}", "\xC3\x81"); //
                              produces "\xC3\x81\x20\x20"; 3 EGCs.</tt></p>
                        </div>
                      </blockquote>
                      <div>std::format("{:3L}", "ch"); what does that
                        produces?</div>
                    </div>
                  </div>
                </div>
              </blockquote>
              "ch " (one trailing space).  The implied constraint with
              respect to literals is that they must be compatible with
              whatever the locale dependent encoding is.  If your
              question was intended to ask whether transliteration
              should occur here or whether "ch" might be presented with
              a ligature, well that is yet another dimension of why
              field widths don't really work for aligning text (in
              general, it works just fine for characters for which one
              code unit == one code point == one glyph that can be
              presented in a monospace font).<br>
            </div>
          </blockquote>
          <div><br>
          </div>
          <div>See <a
              href="https://en.wikipedia.org/wiki/Slovak_orthography"
              moz-do-not-send="true">https://en.wikipedia.org/wiki/Slovak_orthography</a></div>
        </div>
      </div>
    </blockquote>
    Ah, digraphs.  Unicode doesn't provide general support for digraphs
    so whether "ch" represents the individual Slovak "c" and "h"
    characters or the letter "ch" is not apparent.  If "c" and "h" was
    intended, then U+<span class="nowrap"><span class="monospaced">034F
        { </span></span><br>
    <span class="nowrap"><span class="monospaced">COMBINING GRAPHEME
        JOINER } could be used to indicate that (the name of this joiner
        is a misnomer).  </span></span><span class="nowrap"><span
        class="monospaced">U+200C {</span></span><span class="nowrap"><span
        class="monospaced"> ZERO WIDTH NON-JOINER } and </span></span><span
      class="nowrap"><span class="monospaced">U+200D { ZERO WIDTH JOINER
        } could be used to prevent ligation, but doesn't help to
        determine which character is intended.  I tend to think Unicode
        is deficient in this area, but I'm no expert in it.  Regardless,
        this is more support for field widths being insufficient for
        display alignment.<br>
      </span></span>
    <blockquote type="cite"
cite="mid:CA+Om+ShjBpvr0DmTbpSjNjbWkCvwWa2y-vFZOyUh8B_tUDfVxQ@mail.gmail.com">
      <div dir="ltr">
        <div class="gmail_quote">
          <div> </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div bgcolor="#FFFFFF">
              <blockquote type="cite">
                <div dir="ltr">
                  <div dir="auto">
                    <div class="gmail_quote">
                      <div>Locale specifiers should only affect region
                        specific rules, not whether something is
                        interpreted as bytes or not <br>
                      </div>
                    </div>
                  </div>
                </div>
              </blockquote>
              Ideally I agree, but that isn't the reality we are faced
              with.<br>
            </div>
          </blockquote>
          <div><br>
          </div>
          <div>I feel like we completely talk past each other and i am
            sorry I don't make my point clear.<br>
          </div>
          <div>Yes, the encoding is currently derived from the locale,
            no, it does not have to be.</div>
          <div><br>
          </div>
          <div>It is possible to answer the question "what is the
            encoding the current process" without pulling the
            &lt;locale&gt; header.</div>
          <div>Pulling the locale header does NOT give you that
            information.</div>
        </div>
      </div>
    </blockquote>
    I don't see how the &lt;locale&gt; header is relevant here.  The
    standard doesn't have to answer the question of where the locale
    information comes from.  LANG=C vs LANG=C.UTF-8 isn't (currently)
    reflected in &lt;locale&gt;.<br>
    <blockquote type="cite"
cite="mid:CA+Om+ShjBpvr0DmTbpSjNjbWkCvwWa2y-vFZOyUh8B_tUDfVxQ@mail.gmail.com">
      <div dir="ltr">
        <div class="gmail_quote">
          <div>And yes on some systems (linux), it is attached  to the
            idea of locale.</div>
        </div>
      </div>
    </blockquote>
    All POSIX systems and Windows.<br>
    <blockquote type="cite"
cite="mid:CA+Om+ShjBpvr0DmTbpSjNjbWkCvwWa2y-vFZOyUh8B_tUDfVxQ@mail.gmail.com">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div>It is important to separate the two when dealing with
            Unicode</div>
        </div>
      </div>
    </blockquote>
    We're not dealing solely with Unicode here.  We're discussing char
    and wchar_t which may or may not (depending on platform and locale)
    indicate a Unicode or non-Unicode encoding.  I don't see a way to
    separate them today.<br>
    <blockquote type="cite"
cite="mid:CA+Om+ShjBpvr0DmTbpSjNjbWkCvwWa2y-vFZOyUh8B_tUDfVxQ@mail.gmail.com">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div> </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div bgcolor="#FFFFFF">
              <blockquote type="cite">
                <div dir="ltr">
                  <div dir="auto">
                    <div class="gmail_quote">
                      <blockquote class="gmail_quote" style="margin:0px
                        0px 0px 0.8ex;border-left:1px solid
                        rgb(204,204,204);padding-left:1ex">
                        <div bgcolor="#FFFFFF">
                          <p><tt> </tt></p>
                          <p>But again, I'm far from convinced that this
                            is actually useful since EGCs don't suffice
                            to ensure an aligned result anyway as nicely
                            described in Henri's post (<a
                              href="https://hsivonen.fi/string-length"
                              rel="noreferrer" target="_blank"
                              moz-do-not-send="true">https://hsivonen.fi/string-length</a>).</p>
                        </div>
                      </blockquote>
                      <div>Agreed but i think you know that code units
                        is the least useful option in this case and i am
                        concerned about choosing a bad option to make a
                        fix easy.</div>
                      <div> </div>
                    </div>
                  </div>
                </div>
              </blockquote>
              <p>I didn't propose code units in order to make an easy
                fix.  The intent was to choose the best option given the
                trade offs involved.  Since none of code units, code
                points, scalar values, or EGCs would result in reliable
                alignment and most uses of such alignment (e.g., via
                printf) are used in situations where characters outside
                the basic source character set are unlikely to appear
                [citation needed], I felt that avoiding the locale
                dependency was the more important goal.</p>
            </div>
          </blockquote>
          <div>I think the user intent is more important. I  don't want
            an emoji to be considered 17 width units to quote Henri's</div>
          <div>EGCs is the less bad approximation<br>
          </div>
        </div>
      </div>
    </blockquote>
    I guess that is one place we disagree.<br>
    <blockquote type="cite"
cite="mid:CA+Om+ShjBpvr0DmTbpSjNjbWkCvwWa2y-vFZOyUh8B_tUDfVxQ@mail.gmail.com">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div>But stating that the char overload is bytes and the
            upcoming char8_t one is text would be okay, I think. Maybe.
            even if surprising <br>
          </div>
        </div>
      </div>
    </blockquote>
    <p>And this is another.  Repeating what I stated earlier, If the
      current locale has a UTF-8 encoding, I would be disappointed if
      the following two calls produced different string contents: </p>
    <p><tt>std::format(  "{:3}",   "\xC3\x81"); // U+00C1</tt><tt> { </tt><tt>LATIN
        CAPITAL LETTER A WITH ACUTE }<br>
      </tt><tt>std::format(u8"{:3}", u8"\xC3\x81"); // U+00C1</tt><tt> {
      </tt><tt>LATIN CAPITAL LETTER A WITH ACUTE }</tt></p>
    <p>Perhaps it would be helpful to enumerate what we expect to be
      portable uses of field widths.  My personal take is that they are
      useful to specify widths for fields where the content is
      restricted to members of the basic source character set where we
      already have a guarantee that each character can be represented
      with one code unit.  That is sufficient to allow field widths to
      portably work as expected (assuming a monospace font if display is
      relevant) for formatting of arithmetic and pointer types as none
      of those require characters outside of the basic source character
      set.  It is also sufficient for character and string literals
      restricted to the basic source character set.  I think it is
      reasonable to require that, for text in general, some other means
      is required to achieve alignment.  Those restrictions make the
      distinction between code unit, code point, scalar values, and EGCs
      meaningless in the context of field widths.<br>
    </p>
    <p>Tom.</p>
  </body>
</html>