<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, 8 Sep 2019 at 18:12, Tom Honermann <<a href="mailto:tom@honermann.net">tom@honermann.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div class="gmail-m_1796657059973223044moz-cite-prefix">On 9/8/19 6:00 AM, Corentin via Lib
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun, 8 Sep 2019 at 11:17,
Corentin <<a href="mailto:corentin.jabot@gmail.com" target="_blank">corentin.jabot@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun, 8 Sep 2019 at
09:52, Billy O'Neal (VC LIBS) <<a href="mailto:bion@microsoft.com" target="_blank">bion@microsoft.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396WordSection1">
<p class="MsoNormal">> I agree that EGCS is the
best option. That doesn't drag locale</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Because we don’t get to
assume that we’re talking about Unicode at all,
it absolutely drags in locale.</p>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Sorry, I should have been more specific.</div>
<div>There is a non-tailored Unicode EGCS boundary
algorithm (but it can be tailored)</div>
<div>I didn't mean to imply that text manipulation can
be done without knowing its encoding and never use
"locale" to mean encoding. </div>
<div><br>
</div>
<div>EGCS are only defined for text whose character
repertoire is Unicode, other encodings deal with
codepoints</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<div>To be clear, the difference of whether the EGC algorithm
is required to be tailored or not is that tailoring for all
intent and purposes requires</div>
<div>icu or something with CLDR, which restrict the platforms
on which this can be implemented <br>
</div>
</div>
</div>
</blockquote>
<p>Tailoring is not relevant to this discussion.</p></div></blockquote><div>It is - see <a href="https://unicode.org/reports/tr29/">https://unicode.org/reports/tr29/</a> "ch" is 2 EGCS in most locales but in Slovak it's 1. I don't make the rules :D</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
<p>The locale dependency stems from the encoding itself being
dependent on locale. Again, LANG=C vs LANG=C.UTF-8. If the
specified behavior is encoding dependent (as it would have to be
for field width to be a count of any of code points, scalar
values, or EGCs), then it is also locale dependent (for char and
wchar_t). Thus there is a trade off:</p>
<ol>
<li>Either the behavior is locale dependent in which case, field
widths could be specified such that they count code points,
scalar values, or EGCs when the locale selects a Unicode
encoding (and something else for non-Unicode encodings), or</li>
<li>The behavior is not locale dependent in which case, field
widths can only be specified in terms of code units.<br></li></ol></div></blockquote><div><br></div><div>Agreed, but let me rephrase:</div><div><br></div><div>Either a string is text and therefore we need and to know its encoding, or it is a sequence of bytes (in the case of char)</div><div>I have an opinion about what we are dealing with in this context :D</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
<p>Recall that, unless there is a call to <tt>std::setlocale</tt>,
all C and C++ processes start with the locale set to <tt>"C"</tt></p></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><p>
</p>
<p>Tom.<br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396WordSection1">
<p class="MsoNormal"> </p>
<p class="MsoNormal">Billy3</p>
<p class="MsoNormal"> </p>
</div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" color="#000000" face="Calibri, sans-serif"><b>From:</b>
Lib <<a href="mailto:lib-bounces@lists.isocpp.org" target="_blank">lib-bounces@lists.isocpp.org</a>>
on behalf of Corentin via Lib <<a href="mailto:lib@lists.isocpp.org" target="_blank">lib@lists.isocpp.org</a>><br>
<b>Sent:</b> Saturday, September 7, 2019
11:08:25 PM<br>
<b>To:</b> Library Working Group <<a href="mailto:lib@lists.isocpp.org" target="_blank">lib@lists.isocpp.org</a>><br>
<b>Cc:</b> Corentin <<a href="mailto:corentin.jabot@gmail.com" target="_blank">corentin.jabot@gmail.com</a>>;
Victor Zverovich <<a href="mailto:victor.zverovich@gmail.com" target="_blank">victor.zverovich@gmail.com</a>>;
Tom Honermann <<a href="mailto:tom@honermann.net" target="_blank">tom@honermann.net</a>>;
<a href="mailto:unicode@isocpp.open-std.org" target="_blank">unicode@isocpp.open-std.org</a>
<<a href="mailto:unicode@open-std.org" target="_blank">unicode@open-std.org</a>><br>
<b>Subject:</b> Re: [isocpp-lib] New issue: Are
std::format field widths code units, code
points, or something else?</font>
<div> </div>
</div>
<div>
<div dir="auto">
<div><br>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun,
Sep 8, 2019, 5:30 AM Tom Honermann via Lib
<<a href="mailto:lib@lists.isocpp.org" target="_blank">lib@lists.isocpp.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334moz-cite-prefix">On
9/7/19 10:44 PM, Victor Zverovich
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>> <span class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Is
field width measured in code
units, code points, or something
else?</span></div>
<div><span class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><br>
</span></div>
<div><span class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"></span>I
think the main consideration here
is that width should be
locale-independent by default for
consistency with the rest of
std::format's design.</div>
</div>
</blockquote>
I agree with that goal, but...<br>
<blockquote type="cite">
<div dir="ltr">
<div>If we can say that width is
measured in grapheme clusters or
code points based on the execution
encoding (or whatever the
standardese term) without querying
the locale then I suggest doing
so.</div>
</div>
</blockquote>
I don't know how to do that. From my
response to Zach, if code units aren't
used, then behavior should be different
for LANG=C vs LANG=C.UTF-8.<br>
<blockquote type="cite">
<div dir="ltr">
<div>I have slight preference for
grapheme clusters since those
correspond to user-perceived
characters, but only have
implementation experience with
code points (this is what both the
fmt library and Python do).<br>
</div>
</div>
</blockquote>
<p>I would definitely vote for EGCs over
code points. I think code points are
probably the worst of the options
since it makes the results dependent
on Unicode normalization form.<br>
</p>
</div>
</blockquote>
</div>
</div>
<div dir="auto"><br>
</div>
<div dir="auto">I disagree. Code Units is the
worse option. For me anything involving code
units is a big red flag. I agree that EGCS is
the best option. That doesn't drag locale,
might be a bit involved for implementers in
20. </div>
<div dir="auto">I don't think specify EGCS for
Unicode text and codepoints otherwise wouldn't
be too difficult - implementation might be a
bit challenging on some platforms in the 20
time frame but they could fallback to
codepoints in the meantime. Not perfect but I
think we need a good long term solution rather
than a bad short term one</div>
<div dir="auto"><br>
</div>
<div dir="auto">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>Tom.<br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div><span class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><br>
</span></span></div>
<div><span class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Cheers,</span></span></div>
<div><span class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span class="gmail-m_1796657059973223044m_-5900481427510438976gmail-m_-7176513910300778324gmail-m_-1423556694114109396m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Victor</span></span></div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Sat, Sep 7, 2019 at 5:13 PM Tom
Honermann via Lib <<a href="mailto:lib@lists.isocpp.org" rel="noreferrer" target="_blank">lib@lists.isocpp.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p><a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252854619&sdata=WsHw%2BM62uyiOBrr91P6W1GzwGe313EDe30bKN5i006Q%3D&reserved=0" rel="noreferrer" target="_blank">[format.string.std]p7</a>
states:</p>
<blockquote type="cite">
<p>The <i>positive-integer</i>
in <i>width</i> is a
decimal integer defining the
minimum field width. If
<i>width</i> is not
specified, there is no
minimum field width, and the
field width is determined
based on the content of the
field.</p>
</blockquote>
<p>Is field width measured in
code units, code points, or
something else?</p>
<p>Consider the following
example assuming a UTF-8
locale:<br>
</p>
<p><tt>std::format("{}",
"\xC3\x81"); // U+00C1</tt><tt>
{ </tt><tt>LATIN CAPITAL
LETTER A WITH ACUTE }</tt><br>
<tt>std::format("{}",
"\x41\xCC\x81"); // U+0041
U+0301 { </tt><tt>LATIN
CAPITAL LETTER A } {
</tt><tt>COMBINING ACUTE
ACCENT }<br>
</tt></p>
<p>In both cases, the arguments
encode the same user-perceived
character (Á). The first uses
two UTF-8 code units to encode
a single code point that
represents a single glyph
using a composed Unicode
normalization form. The
second uses three code units
to encode two code points that
represent the same glyph using
a decomposed Unicode
normalization form.</p>
<p>How is the field width
determined? If measured in
code units, the first has a
width of 2 and the second of
3. If measured in code
points, the first has a width
of 1 and the second of 2. If
measured in grapheme clusters,
both have a width of 1. Is
the determination locale
dependent?</p>
<p><b>Proposed resolution:</b></p>
<p>Field widths are measured in
code units and are not locale
dependent. Modify <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=36WpbP64Oqoi4Pne9kFrEu6nauHLNr2VunnfkvdWcPY%3D&reserved=0" rel="noreferrer" target="_blank">
[format.string.std]p7</a> as
follows:</p>
<blockquote type="cite">
<p>The <i>positive-integer</i>
in <i>width</i> is a
decimal integer defining the
minimum field width. If
<i>width</i> is not
specified, there is no
minimum field width, and the
field width is determined
based on the content of the
field.
<b><font color="#33cc00">Field
width is measured in
code units. Each byte
of a multibyte character
contributes to the field
width.</font></b><br>
</p>
</blockquote>
<p>(<i>code unit</i> is not
formally defined in the
standard. Most uses occur in
UTF-8 and UTF-16 specific
contexts, but
<a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.ext%235&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=UyG%2Fr7BXuLAPAXP78ekpXS%2FWhqdeU2QCHTmTeBPjImQ%3D&reserved=0" rel="noreferrer" target="_blank">
[lex.ext]p5</a> uses it in
an encoding agnostic context.)<br>
</p>
<p>Tom.<br>
</p>
</div>
_______________________________________________<br>
Lib mailing list<br>
<a href="mailto:Lib@lists.isocpp.org" rel="noreferrer" target="_blank">Lib@lists.isocpp.org</a><br>
Subscription: <a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=ieyJCXmZ0Bj3UfW4Lvi3hW1HlOq6oeEML86Xyry9uFI%3D&reserved=0" rel="noreferrer noreferrer" target="_blank">
https://lists.isocpp.org/mailman/listinfo.cgi/lib</a><br>
Link to this post: <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13440.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=l4UxwaFExnxKireder%2F%2BAnU2mszZXMYatHrd2zGSSWQ%3D&reserved=0" rel="noreferrer noreferrer" target="_blank">
http://lists.isocpp.org/lib/2019/09/13440.php</a><br>
</blockquote>
</div>
</blockquote>
<p><br>
</p>
</div>
_______________________________________________<br>
Lib mailing list<br>
<a href="mailto:Lib@lists.isocpp.org" rel="noreferrer" target="_blank">Lib@lists.isocpp.org</a><br>
Subscription: <a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252884602&sdata=B0%2BhF8pSkAy2MbEwWHk1r3uVjbIpvIoQ%2Fi%2BckyTQ94A%3D&reserved=0" rel="noreferrer noreferrer" target="_blank">
https://lists.isocpp.org/mailman/listinfo.cgi/lib</a><br>
Link to this post: <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13446.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252894598&sdata=NVwyEiiPWSwvAApse%2FxktecxI6oAiGhUWKjyXw8yYMw%3D&reserved=0" rel="noreferrer noreferrer" target="_blank">
http://lists.isocpp.org/lib/2019/09/13446.php</a><br>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
<br>
<fieldset class="gmail-m_1796657059973223044mimeAttachmentHeader"></fieldset>
<pre class="gmail-m_1796657059973223044moz-quote-pre">_______________________________________________
Lib mailing list
<a class="gmail-m_1796657059973223044moz-txt-link-abbreviated" href="mailto:Lib@lists.isocpp.org" target="_blank">Lib@lists.isocpp.org</a>
Subscription: <a class="gmail-m_1796657059973223044moz-txt-link-freetext" href="https://lists.isocpp.org/mailman/listinfo.cgi/lib" target="_blank">https://lists.isocpp.org/mailman/listinfo.cgi/lib</a>
Link to this post: <a class="gmail-m_1796657059973223044moz-txt-link-freetext" href="http://lists.isocpp.org/lib/2019/09/13453.php" target="_blank">http://lists.isocpp.org/lib/2019/09/13453.php</a>
</pre>
</blockquote>
<p><br>
</p>
</div>
</blockquote></div></div>