<div dir="ltr"><div dir="ltr"><br></div><div dir="auto"><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Sep 9, 2019, 4:34 AM Tom Honermann &lt;<a href="mailto:tom@honermann.net" target="_blank">tom@honermann.net</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF">
    <div class="gmail-m_-3313212866199587940m_-7380627940264454650moz-cite-prefix">On 9/8/19 7:05 PM, Zach Laine wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div dir="ltr">On Sun, Sep 8, 2019 at 3:00 PM Tom Honermann via
          Lib &lt;<a href="mailto:lib@lists.isocpp.org" rel="noreferrer" target="_blank">lib@lists.isocpp.org</a>&gt; wrote:<br>
        </div>
        <div class="gmail_quote">
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div dir="auto"><br>
              <div dir="ltr">On Sep 8, 2019, at 2:46 PM, Corentin via
                Lib &lt;<a href="mailto:lib@lists.isocpp.org" rel="noreferrer" target="_blank">lib@lists.isocpp.org</a>&gt;
                wrote:<br>
                <br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div dir="ltr">
                    <div dir="ltr"><br>
                    </div>
                    <br>
                    <div class="gmail_quote">
                      <div dir="ltr" class="gmail_attr">On Sun, 8 Sep
                        2019 at 19:30, Tom Honermann &lt;<a href="mailto:tom@honermann.net" rel="noreferrer" target="_blank">tom@honermann.net</a>&gt;
                        wrote:<br>
                      </div>
                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                        <div bgcolor="#FFFFFF">
                          <div class="gmail-m_-3313212866199587940m_-7380627940264454650gmail-m_3952312726224711374gmail-m_4045717672081106664moz-cite-prefix">On
                            9/8/19 12:40 PM, Corentin wrote:<br>
                          </div>
                          <blockquote type="cite">
                            <div dir="ltr">
                              <div dir="ltr"><br>
                              </div>
                              <br>
                              <div class="gmail_quote">
                                <div dir="ltr" class="gmail_attr">On
                                  Sun, 8 Sep 2019 at 18:12, Tom
                                  Honermann &lt;<a href="mailto:tom@honermann.net" rel="noreferrer" target="_blank">tom@honermann.net</a>&gt;
                                  wrote:<br>
                                </div>
                                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                  <div bgcolor="#FFFFFF">
                                    <div class="gmail-m_-3313212866199587940m_-7380627940264454650gmail-m_3952312726224711374gmail-m_4045717672081106664gmail-m_1796657059973223044moz-cite-prefix">On
                                      9/8/19 6:00 AM, Corentin via Lib
                                      wrote:<br>
                                    </div>
                                    <blockquote type="cite">
                                      <div dir="ltr">
                                        <div dir="ltr"><br>
                                        </div>
                                        <br>
                                        <div class="gmail_quote">
                                          <div dir="ltr" class="gmail_attr">On Sun, 8
                                            Sep 2019 at 11:17, Corentin
                                            &lt;<a href="mailto:corentin.jabot@gmail.com" rel="noreferrer" target="_blank">corentin.jabot@gmail.com</a>&gt;
                                            wrote:<br>
                                          </div>
                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                            <div dir="ltr">
                                              <div dir="ltr"><br>
                                              </div>
                                              <br>
                                              <div class="gmail_quote">
                                                <div dir="ltr" class="gmail_attr">On
                                                  Sun, 8 Sep 2019 at
                                                  09:52, Billy O&#39;Neal
                                                  (VC LIBS) &lt;<a href="mailto:bion@microsoft.com" rel="noreferrer" target="_blank">bion@microsoft.com</a>&gt;
                                                  wrote:<br>
                                                </div>
                                                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                                  <div>
                                                    <div class="gmail-m_-3313212866199587940m_-7380627940264454650gmail-m_3952312726224711374gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396WordSection1">
                                                      <p class="MsoNormal">&gt;
                                                        I agree that
                                                        EGCS is the best
                                                        option. That
                                                        doesn&#39;t drag
                                                        locale</p>
                                                      <p class="MsoNormal"> </p>
                                                      <p class="MsoNormal">Because
                                                        we don’t get to
                                                        assume that
                                                        we’re talking
                                                        about Unicode at
                                                        all, it
                                                        absolutely drags
                                                        in locale.</p>
                                                    </div>
                                                  </div>
                                                </blockquote>
                                                <div><br>
                                                </div>
                                                <div>Sorry, I should
                                                  have been more
                                                  specific.</div>
                                                <div>There is a
                                                  non-tailored Unicode
                                                  EGCS boundary
                                                  algorithm (but it can
                                                  be tailored)</div>
                                                <div>I didn&#39;t mean to
                                                  imply that text
                                                  manipulation can be
                                                  done without knowing
                                                  its encoding and never
                                                  use &quot;locale&quot; to mean
                                                  encoding. </div>
                                                <div><br>
                                                </div>
                                                <div>EGCS are only
                                                  defined for text whose
                                                  character repertoire
                                                  is Unicode, other
                                                  encodings deal with
                                                  codepoints</div>
                                              </div>
                                            </div>
                                          </blockquote>
                                          <div><br>
                                          </div>
                                          <div><br>
                                          </div>
                                          <div>To be clear, the
                                            difference of whether
                                            the EGC algorithm is
                                            required to be tailored or
                                            not is that tailoring for
                                            all intent and purposes
                                            requires</div>
                                          <div>icu or something
                                            with CLDR, which restrict
                                            the platforms on which this
                                            can be implemented <br>
                                          </div>
                                        </div>
                                      </div>
                                    </blockquote>
                                    <p>Tailoring is not relevant to this
                                      discussion.</p>
                                  </div>
                                </blockquote>
                                <div>It is - see <a href="https://unicode.org/reports/tr29/" rel="noreferrer" target="_blank">https://unicode.org/reports/tr29/</a> &quot;ch&quot;
                                  is 2 EGCS in most locales but in
                                  Slovak it&#39;s 1. I don&#39;t make the rules
                                  :D</div>
                              </div>
                            </div>
                          </blockquote>
                          It isn&#39;t relevant in determining how we
                          resolve this issue.  If the resolution is that
                          field widths are measured in EGCs, then we&#39;ve
                          already decided that the width is locale
                          dependent and tailoring becomes an
                          implementation detail.<br>
                        </div>
                      </blockquote>
                      <div><br>
                      </div>
                      <div>No, format decided to be locale-independent
                        (for good reason) and applying locale specific
                        behavior implicitly would be against that.</div>
                      <div>I&#39;n arguing for encoding specific behavior</div>
                    </div>
                  </div>
                </div>
              </blockquote>
              <div><br>
              </div>
              You seem to be missing the point that, for char and
              wchar_t, the encoding can’t be known (in general) without
              consulting the locale. Again, LANG=C vs LANG=C.UTF-8. 
              <div><br>
              </div>
              <div>Tom. </div>
            </div>
          </blockquote>
          <div><br>
          </div>
          <div>Tom, you seem to be missing the point that std::format
            does not such consultation!  It is locale-agnostic.  It is
            assumed to be char-based, not Windows 1252, not UTF-8, not
            even ASCII.</div>
        </div>
      </div>
    </blockquote>
    That is exactly my point!  And why my proposed resolution was to
    specify width in terms of code units.<br>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div>This means that the definition of width as being a CU is
            the de facto status quo.  I&#39;m suggesting that later on, we
            pull a fast one and specify that we meant that it should
            have been UTF-8-based instead of char-based.  This may mean
            that we need to add a char8_t overload, or it may be
            palatable to just change the current interface&#39;s contract. 
            I assume the former will be necessary, since people tend to
            hate silent contract changes (with good reason).<br>
          </div>
        </div>
      </div>
    </blockquote>
    <p>Victor&#39;s fmtlib implementation already effectively does what you
      suggest.  See
<a class="gmail-m_-3313212866199587940m_-7380627940264454650moz-txt-link-freetext" href="https://github.com/fmtlib/fmt/commit/38325248e5310ddbea41390974e496e8495f7324" rel="noreferrer" target="_blank">https://github.com/fmtlib/fmt/commit/38325248e5310ddbea41390974e496e8495f7324</a>.</p>
    <p>I think this isn&#39;t a good state to be in though.  If the current
      locale has a UTF-8 encoding, I would be disappointed if the
      following two calls produced different string contents:</p>
    <p><tt>std::format(  &quot;{:3}&quot;,   &quot;\xC3\x81&quot;); // U+00C1</tt><tt> { </tt><tt>LATIN
        CAPITAL LETTER A WITH ACUTE }<br>
      </tt><tt>std::format(u8&quot;{:3}&quot;, u8&quot;\xC3\x81&quot;); // U+00C1</tt><tt> {
      </tt><tt>LATIN CAPITAL LETTER A WITH ACUTE }</tt></p>
    <p>If the width is code units for the char based overload and EGCs
      for the char8_t based one, then the first will produce
      &quot;\xC3\x81\x20&quot; (one inserted space) and the second
      &quot;\xC3\x81\x20\x20&quot; (two inserted spaces).  I think users would
      find that surprising.<br></p></div></blockquote><div><br></div><div>I think we are going there 0- we will have to if we take the code units route.</div><div>It matches a discussion I recall we had probably at kona that at the moment fmt is more of a bytes formatting library - with the expectation that u8 overload would format text</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          
          <div>So, if we do nothing, we get what you want.  If we
            *specify* that CUs are the width, we color the future debate
            about the Unicode-aware version in a Unicode-unfriendly
            direction.</div></div></div></blockquote></div></blockquote><div>+1<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><blockquote type="cite"><div dir="ltr"><div class="gmail_quote">
        </div>
      </div>
    </blockquote>
    <p>If we do nothing, we are in the situation where different
      implementors may do different things</p></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
    <p>My preferred direction for exploration is a future extension that
      enables opt-in to field widths that are encoding dependent (and
      therefore locale dependent for char and wchar_t).  For example
      (using &#39;L&#39; appended to the width; &#39;L&#39; doesn&#39;t conflict with the
      existing type options):<br>
    </p>
    <p><tt>std::format(&quot;{:3L}&quot;, &quot;\xC3\x81&quot;); // produces
        &quot;\xC3\x81\x20\x20&quot;; 3 EGCs.</tt></p></div></blockquote><div>std::format(&quot;{:3L}&quot;, &quot;ch&quot;); what does that produces?</div><div>Locale specifiers should only affect region specific rules, not whether something is interpreted as bytes or not </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><p><tt>
      </tt></p>
    <p>But again, I&#39;m far from convinced that this is actually useful
      since EGCs don&#39;t suffice to ensure an aligned result anyway as
      nicely described in Henri&#39;s post (<a href="https://hsivonen.fi/string-length" rel="noreferrer" target="_blank">https://hsivonen.fi/string-length</a>).</p></div></blockquote><div>Agreed but i think you know that code units is the least useful option in this case and i am concerned about choosing a bad option to make a fix easy.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
    <p>Tom.<br>
    </p>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div>Zach</div>
          <div><br>
          </div>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
  </div>

</blockquote></div></div>
</div>