<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, 8 Sep 2019 at 19:30, Tom Honermann <<a href="mailto:tom@honermann.net">tom@honermann.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div class="gmail-m_4045717672081106664moz-cite-prefix">On 9/8/19 12:40 PM, Corentin wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun, 8 Sep 2019 at 18:12,
Tom Honermann <<a href="mailto:tom@honermann.net" target="_blank">tom@honermann.net</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div class="gmail-m_4045717672081106664gmail-m_1796657059973223044moz-cite-prefix">On
9/8/19 6:00 AM, Corentin via Lib wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun, 8 Sep 2019
at 11:17, Corentin <<a href="mailto:corentin.jabot@gmail.com" target="_blank">corentin.jabot@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun, 8
Sep 2019 at 09:52, Billy O'Neal (VC LIBS)
<<a href="mailto:bion@microsoft.com" target="_blank">bion@microsoft.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396WordSection1">
<p class="MsoNormal">> I agree that
EGCS is the best option. That doesn't
drag locale</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Because we don’t
get to assume that we’re talking about
Unicode at all, it absolutely drags in
locale.</p>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Sorry, I should have been more specific.</div>
<div>There is a non-tailored Unicode EGCS
boundary algorithm (but it can be tailored)</div>
<div>I didn't mean to imply that text
manipulation can be done without knowing its
encoding and never use "locale" to mean
encoding. </div>
<div><br>
</div>
<div>EGCS are only defined for text whose
character repertoire is Unicode, other
encodings deal with codepoints</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<div>To be clear, the difference of whether the EGC
algorithm is required to be tailored or not is
that tailoring for all intent and purposes
requires</div>
<div>icu or something with CLDR, which restrict the
platforms on which this can be implemented <br>
</div>
</div>
</div>
</blockquote>
<p>Tailoring is not relevant to this discussion.</p>
</div>
</blockquote>
<div>It is - see <a href="https://unicode.org/reports/tr29/" target="_blank">https://unicode.org/reports/tr29/</a> "ch"
is 2 EGCS in most locales but in Slovak it's 1. I don't make
the rules :D</div>
</div>
</div>
</blockquote>
It isn't relevant in determining how we resolve this issue. If the
resolution is that field widths are measured in EGCs, then we've
already decided that the width is locale dependent and tailoring
becomes an implementation detail.<br></div></blockquote><div><br></div><div>No, format decided to be locale-independent (for good reason) and applying locale specific behavior implicitly would be against that.</div><div>I'n arguing for encoding specific behavior</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>The locale dependency stems from the encoding itself
being dependent on locale. Again, LANG=C vs
LANG=C.UTF-8. If the specified behavior is encoding
dependent (as it would have to be for field width to be
a count of any of code points, scalar values, or EGCs),
then it is also locale dependent (for char and
wchar_t). Thus there is a trade off:</p>
<ol>
<li>Either the behavior is locale dependent in which
case, field widths could be specified such that they
count code points, scalar values, or EGCs when the
locale selects a Unicode encoding (and something else
for non-Unicode encodings), or</li>
<li>The behavior is not locale dependent in which case,
field widths can only be specified in terms of code
units.<br>
</li>
</ol>
</div>
</blockquote>
<div><br>
</div>
<div>Agreed, but let me rephrase:</div>
<div><br>
</div>
<div>Either a string is text and therefore we need and to know
its encoding, or it is a sequence of bytes (in the case of
char)</div>
<div>I have an opinion about what we are dealing with in this
context :D</div>
</div>
</div>
</blockquote>
<p>So your preference is for trade off #1 above and the cost is that
<tt>std::format</tt> is no longer locale insensitive even in the
cases where a <tt>std::locale</tt> argument is not provided.</p></div></blockquote><div>It would be _encoding_ sensitive</div><div>It would not change for example the decimal separator.</div><div><br></div><div>When Unicode is involved - and even when it is not, it is I think important not to conflate locale and encoding even if C kinda amalgamates the two and derives one from the other.</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
<p>Since I don't think field width works for alignment, even if EGCs
are used (see Henri's post - <a class="gmail-m_4045717672081106664moz-txt-link-freetext" href="https://hsivonen.fi/string-length" target="_blank">https://hsivonen.fi/string-length</a>), I
prefer trade off #2.<br>
</p>
<p>Tom.<br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>Recall that, unless there is a call to <tt>std::setlocale</tt>,
all C and C++ processes start with the locale set to <tt>"C"</tt></p>
</div>
</blockquote>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p> </p>
<p>Tom.<br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396WordSection1">
<p class="MsoNormal"> </p>
<p class="MsoNormal">Billy3</p>
<p class="MsoNormal"> </p>
</div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" color="#000000" face="Calibri,
sans-serif"><b>From:</b> Lib <<a href="mailto:lib-bounces@lists.isocpp.org" target="_blank">lib-bounces@lists.isocpp.org</a>>
on behalf of Corentin via Lib <<a href="mailto:lib@lists.isocpp.org" target="_blank">lib@lists.isocpp.org</a>><br>
<b>Sent:</b> Saturday, September 7,
2019 11:08:25 PM<br>
<b>To:</b> Library Working Group <<a href="mailto:lib@lists.isocpp.org" target="_blank">lib@lists.isocpp.org</a>><br>
<b>Cc:</b> Corentin <<a href="mailto:corentin.jabot@gmail.com" target="_blank">corentin.jabot@gmail.com</a>>;
Victor Zverovich <<a href="mailto:victor.zverovich@gmail.com" target="_blank">victor.zverovich@gmail.com</a>>;
Tom Honermann <<a href="mailto:tom@honermann.net" target="_blank">tom@honermann.net</a>>;
<a href="mailto:unicode@isocpp.open-std.org" target="_blank">unicode@isocpp.open-std.org</a>
<<a href="mailto:unicode@open-std.org" target="_blank">unicode@open-std.org</a>><br>
<b>Subject:</b> Re: [isocpp-lib] New
issue: Are std::format field widths
code units, code points, or something
else?</font>
<div> </div>
</div>
<div>
<div dir="auto">
<div><br>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Sun, Sep 8, 2019, 5:30 AM Tom
Honermann via Lib <<a href="mailto:lib@lists.isocpp.org" target="_blank">lib@lists.isocpp.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334moz-cite-prefix">On
9/7/19 10:44 PM, Victor
Zverovich wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>> <span class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Is
field width measured
in code units, code
points, or something
else?</span></div>
<div><span class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><br>
</span></div>
<div><span class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"></span>I
think the main
consideration here is
that width should be
locale-independent by
default for consistency
with the rest of
std::format's design.</div>
</div>
</blockquote>
I agree with that goal, but...<br>
<blockquote type="cite">
<div dir="ltr">
<div>If we can say that
width is measured in
grapheme clusters or
code points based on the
execution encoding (or
whatever the standardese
term) without querying
the locale then I
suggest doing so.</div>
</div>
</blockquote>
I don't know how to do that.
From my response to Zach, if
code units aren't used, then
behavior should be different
for LANG=C vs LANG=C.UTF-8.<br>
<blockquote type="cite">
<div dir="ltr">
<div>I have slight
preference for grapheme
clusters since those
correspond to
user-perceived
characters, but only
have implementation
experience with code
points (this is what
both the fmt library and
Python do).<br>
</div>
</div>
</blockquote>
<p>I would definitely vote for
EGCs over code points. I
think code points are
probably the worst of the
options since it makes the
results dependent on Unicode
normalization form.<br>
</p>
</div>
</blockquote>
</div>
</div>
<div dir="auto"><br>
</div>
<div dir="auto">I disagree. Code Units
is the worse option. For me anything
involving code units is a big red
flag. I agree that EGCS is the best
option. That doesn't drag locale,
might be a bit involved for
implementers in 20. </div>
<div dir="auto">I don't think specify
EGCS for Unicode text and codepoints
otherwise wouldn't be too difficult
- implementation might be a bit
challenging on some platforms in the
20 time frame but they could
fallback to codepoints in the
meantime. Not perfect but I think we
need a good long term solution
rather than a bad short term one</div>
<div dir="auto"><br>
</div>
<div dir="auto">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>Tom.<br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div><span class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><br>
</span></span></div>
<div><span class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Cheers,</span></span></div>
<div><span class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span class="gmail-m_4045717672081106664gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Victor</span></span></div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Sat, Sep 7, 2019 at 5:13
PM Tom Honermann via Lib
<<a href="mailto:lib@lists.isocpp.org" rel="noreferrer" target="_blank">lib@lists.isocpp.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p><a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252854619&sdata=WsHw%2BM62uyiOBrr91P6W1GzwGe313EDe30bKN5i006Q%3D&reserved=0" rel="noreferrer" target="_blank">[format.string.std]p7</a>
states:</p>
<blockquote type="cite">
<p>The <i>positive-integer</i>
in <i>width</i>
is a decimal
integer defining
the minimum field
width. If <i>width</i>
is not specified,
there is no
minimum field
width, and the
field width is
determined based
on the content of
the field.</p>
</blockquote>
<p>Is field width
measured in code
units, code points,
or something else?</p>
<p>Consider the
following example
assuming a UTF-8
locale:<br>
</p>
<p><tt>std::format("{}",
"\xC3\x81");
// U+00C1</tt><tt>
{ </tt><tt>LATIN
CAPITAL LETTER A
WITH ACUTE }</tt><br>
<tt>std::format("{}",
"\x41\xCC\x81");
// U+0041 U+0301 {
</tt><tt>LATIN
CAPITAL LETTER A }
{ </tt><tt>COMBINING
ACUTE ACCENT }<br>
</tt></p>
<p>In both cases, the
arguments encode the
same user-perceived
character (Á). The
first uses two UTF-8
code units to encode
a single code point
that represents a
single glyph using a
composed Unicode
normalization form.
The second uses
three code units to
encode two code
points that
represent the same
glyph using a
decomposed Unicode
normalization form.</p>
<p>How is the field
width determined?
If measured in code
units, the first has
a width of 2 and the
second of 3. If
measured in code
points, the first
has a width of 1 and
the second of 2. If
measured in grapheme
clusters, both have
a width of 1. Is
the determination
locale dependent?</p>
<p><b>Proposed
resolution:</b></p>
<p>Field widths are
measured in code
units and are not
locale dependent.
Modify <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=36WpbP64Oqoi4Pne9kFrEu6nauHLNr2VunnfkvdWcPY%3D&reserved=0" rel="noreferrer" target="_blank">
[format.string.std]p7</a> as follows:</p>
<blockquote type="cite">
<p>The <i>positive-integer</i>
in <i>width</i>
is a decimal
integer defining
the minimum field
width. If <i>width</i>
is not specified,
there is no
minimum field
width, and the
field width is
determined based
on the content of
the field. <b><font color="#33cc00">Field width is measured in code units. Each byte of a
multibyte
character
contributes to
the field
width.</font></b><br>
</p>
</blockquote>
<p>(<i>code unit</i>
is not formally
defined in the
standard. Most uses
occur in UTF-8 and
UTF-16 specific
contexts, but <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.ext%235&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=UyG%2Fr7BXuLAPAXP78ekpXS%2FWhqdeU2QCHTmTeBPjImQ%3D&reserved=0" rel="noreferrer" target="_blank">
[lex.ext]p5</a>
uses it in an
encoding agnostic
context.)<br>
</p>
<p>Tom.<br>
</p>
</div>
_______________________________________________<br>
Lib mailing list<br>
<a href="mailto:Lib@lists.isocpp.org" rel="noreferrer" target="_blank">Lib@lists.isocpp.org</a><br>
Subscription: <a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=ieyJCXmZ0Bj3UfW4Lvi3hW1HlOq6oeEML86Xyry9uFI%3D&reserved=0" rel="noreferrer
noreferrer" target="_blank">
https://lists.isocpp.org/mailman/listinfo.cgi/lib</a><br>
Link to this post: <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13440.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=l4UxwaFExnxKireder%2F%2BAnU2mszZXMYatHrd2zGSSWQ%3D&reserved=0" rel="noreferrer
noreferrer" target="_blank">
http://lists.isocpp.org/lib/2019/09/13440.php</a><br>
</blockquote>
</div>
</blockquote>
<p><br>
</p>
</div>
_______________________________________________<br>
Lib mailing list<br>
<a href="mailto:Lib@lists.isocpp.org" rel="noreferrer" target="_blank">Lib@lists.isocpp.org</a><br>
Subscription: <a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252884602&sdata=B0%2BhF8pSkAy2MbEwWHk1r3uVjbIpvIoQ%2Fi%2BckyTQ94A%3D&reserved=0" rel="noreferrer noreferrer" target="_blank">
https://lists.isocpp.org/mailman/listinfo.cgi/lib</a><br>
Link to this post: <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13446.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252894598&sdata=NVwyEiiPWSwvAApse%2FxktecxI6oAiGhUWKjyXw8yYMw%3D&reserved=0" rel="noreferrer noreferrer" target="_blank">
http://lists.isocpp.org/lib/2019/09/13446.php</a><br>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
<br>
<fieldset class="gmail-m_4045717672081106664gmail-m_1796657059973223044mimeAttachmentHeader"></fieldset>
<pre class="gmail-m_4045717672081106664gmail-m_1796657059973223044moz-quote-pre">_______________________________________________
Lib mailing list
<a class="gmail-m_4045717672081106664gmail-m_1796657059973223044moz-txt-link-abbreviated" href="mailto:Lib@lists.isocpp.org" target="_blank">Lib@lists.isocpp.org</a>
Subscription: <a class="gmail-m_4045717672081106664gmail-m_1796657059973223044moz-txt-link-freetext" href="https://lists.isocpp.org/mailman/listinfo.cgi/lib" target="_blank">https://lists.isocpp.org/mailman/listinfo.cgi/lib</a>
Link to this post: <a class="gmail-m_4045717672081106664gmail-m_1796657059973223044moz-txt-link-freetext" href="http://lists.isocpp.org/lib/2019/09/13453.php" target="_blank">http://lists.isocpp.org/lib/2019/09/13453.php</a>
</pre>
</blockquote>
<p><br>
</p>
</div>
</blockquote>
</div>
</div>
</blockquote>
<p><br>
</p>
</div>
</blockquote></div></div>