<div dir="ltr"><div dir="ltr"><br></div><div dir="auto"><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Sep 9, 2019, 4:34 AM Tom Honermann <<a href="mailto:tom@honermann.net" target="_blank">tom@honermann.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div class="gmail-m_-3313212866199587940m_-7380627940264454650moz-cite-prefix">On 9/8/19 7:05 PM, Zach Laine wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">On Sun, Sep 8, 2019 at 3:00 PM Tom Honermann via
Lib <<a href="mailto:lib@lists.isocpp.org" rel="noreferrer" target="_blank">lib@lists.isocpp.org</a>> wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="auto"><br>
<div dir="ltr">On Sep 8, 2019, at 2:46 PM, Corentin via
Lib <<a href="mailto:lib@lists.isocpp.org" rel="noreferrer" target="_blank">lib@lists.isocpp.org</a>>
wrote:<br>
<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun, 8 Sep
2019 at 19:30, Tom Honermann <<a href="mailto:tom@honermann.net" rel="noreferrer" target="_blank">tom@honermann.net</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div class="gmail-m_-3313212866199587940m_-7380627940264454650gmail-m_3952312726224711374gmail-m_4045717672081106664moz-cite-prefix">On
9/8/19 12:40 PM, Corentin wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Sun, 8 Sep 2019 at 18:12, Tom
Honermann <<a href="mailto:tom@honermann.net" rel="noreferrer" target="_blank">tom@honermann.net</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div class="gmail-m_-3313212866199587940m_-7380627940264454650gmail-m_3952312726224711374gmail-m_4045717672081106664gmail-m_1796657059973223044moz-cite-prefix">On
9/8/19 6:00 AM, Corentin via Lib
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun, 8
Sep 2019 at 11:17, Corentin
<<a href="mailto:corentin.jabot@gmail.com" rel="noreferrer" target="_blank">corentin.jabot@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Sun, 8 Sep 2019 at
09:52, Billy O'Neal
(VC LIBS) <<a href="mailto:bion@microsoft.com" rel="noreferrer" target="_blank">bion@microsoft.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div class="gmail-m_-3313212866199587940m_-7380627940264454650gmail-m_3952312726224711374gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396WordSection1">
<p class="MsoNormal">>
I agree that
EGCS is the best
option. That
doesn't drag
locale</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Because
we don’t get to
assume that
we’re talking
about Unicode at
all, it
absolutely drags
in locale.</p>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Sorry, I should
have been more
specific.</div>
<div>There is a
non-tailored Unicode
EGCS boundary
algorithm (but it can
be tailored)</div>
<div>I didn't mean to
imply that text
manipulation can be
done without knowing
its encoding and never
use "locale" to mean
encoding. </div>
<div><br>
</div>
<div>EGCS are only
defined for text whose
character repertoire
is Unicode, other
encodings deal with
codepoints</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<div>To be clear, the
difference of whether
the EGC algorithm is
required to be tailored or
not is that tailoring for
all intent and purposes
requires</div>
<div>icu or something
with CLDR, which restrict
the platforms on which this
can be implemented <br>
</div>
</div>
</div>
</blockquote>
<p>Tailoring is not relevant to this
discussion.</p>
</div>
</blockquote>
<div>It is - see <a href="https://unicode.org/reports/tr29/" rel="noreferrer" target="_blank">https://unicode.org/reports/tr29/</a> "ch"
is 2 EGCS in most locales but in
Slovak it's 1. I don't make the rules
:D</div>
</div>
</div>
</blockquote>
It isn't relevant in determining how we
resolve this issue. If the resolution is that
field widths are measured in EGCs, then we've
already decided that the width is locale
dependent and tailoring becomes an
implementation detail.<br>
</div>
</blockquote>
<div><br>
</div>
<div>No, format decided to be locale-independent
(for good reason) and applying locale specific
behavior implicitly would be against that.</div>
<div>I'n arguing for encoding specific behavior</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
You seem to be missing the point that, for char and
wchar_t, the encoding can’t be known (in general) without
consulting the locale. Again, LANG=C vs LANG=C.UTF-8.
<div><br>
</div>
<div>Tom. </div>
</div>
</blockquote>
<div><br>
</div>
<div>Tom, you seem to be missing the point that std::format
does not such consultation! It is locale-agnostic. It is
assumed to be char-based, not Windows 1252, not UTF-8, not
even ASCII.</div>
</div>
</div>
</blockquote>
That is exactly my point! And why my proposed resolution was to
specify width in terms of code units.<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>This means that the definition of width as being a CU is
the de facto status quo. I'm suggesting that later on, we
pull a fast one and specify that we meant that it should
have been UTF-8-based instead of char-based. This may mean
that we need to add a char8_t overload, or it may be
palatable to just change the current interface's contract.
I assume the former will be necessary, since people tend to
hate silent contract changes (with good reason).<br>
</div>
</div>
</div>
</blockquote>
<p>Victor's fmtlib implementation already effectively does what you
suggest. See
<a class="gmail-m_-3313212866199587940m_-7380627940264454650moz-txt-link-freetext" href="https://github.com/fmtlib/fmt/commit/38325248e5310ddbea41390974e496e8495f7324" rel="noreferrer" target="_blank">https://github.com/fmtlib/fmt/commit/38325248e5310ddbea41390974e496e8495f7324</a>.</p>
<p>I think this isn't a good state to be in though. If the current
locale has a UTF-8 encoding, I would be disappointed if the
following two calls produced different string contents:</p>
<p><tt>std::format( "{:3}", "\xC3\x81"); // U+00C1</tt><tt> { </tt><tt>LATIN
CAPITAL LETTER A WITH ACUTE }<br>
</tt><tt>std::format(u8"{:3}", u8"\xC3\x81"); // U+00C1</tt><tt> {
</tt><tt>LATIN CAPITAL LETTER A WITH ACUTE }</tt></p>
<p>If the width is code units for the char based overload and EGCs
for the char8_t based one, then the first will produce
"\xC3\x81\x20" (one inserted space) and the second
"\xC3\x81\x20\x20" (two inserted spaces). I think users would
find that surprising.<br></p></div></blockquote><div><br></div><div>I think we are going there 0- we will have to if we take the code units route.</div><div>It matches a discussion I recall we had probably at kona that at the moment fmt is more of a bytes formatting library - with the expectation that u8 overload would format text</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div>So, if we do nothing, we get what you want. If we
*specify* that CUs are the width, we color the future debate
about the Unicode-aware version in a Unicode-unfriendly
direction.</div></div></div></blockquote></div></blockquote><div>+1<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><blockquote type="cite"><div dir="ltr"><div class="gmail_quote">
</div>
</div>
</blockquote>
<p>If we do nothing, we are in the situation where different
implementors may do different things</p></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
<p>My preferred direction for exploration is a future extension that
enables opt-in to field widths that are encoding dependent (and
therefore locale dependent for char and wchar_t). For example
(using 'L' appended to the width; 'L' doesn't conflict with the
existing type options):<br>
</p>
<p><tt>std::format("{:3L}", "\xC3\x81"); // produces
"\xC3\x81\x20\x20"; 3 EGCs.</tt></p></div></blockquote><div>std::format("{:3L}", "ch"); what does that produces?</div><div>Locale specifiers should only affect region specific rules, not whether something is interpreted as bytes or not </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><p><tt>
</tt></p>
<p>But again, I'm far from convinced that this is actually useful
since EGCs don't suffice to ensure an aligned result anyway as
nicely described in Henri's post (<a href="https://hsivonen.fi/string-length" rel="noreferrer" target="_blank">https://hsivonen.fi/string-length</a>).</p></div></blockquote><div>Agreed but i think you know that code units is the least useful option in this case and i am concerned about choosing a bad option to make a fix easy.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
<p>Tom.<br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>Zach</div>
<div><br>
</div>
</div>
</div>
</blockquote>
<p><br>
</p>
</div>
</blockquote></div></div>
</div>