<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 9/7/19 10:44 PM, Victor Zverovich
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CANawtxY2avFFj7xgjYKMpcFDA=ViuyhmYufmZ0fLmw+G3dNAvA@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div>> <span
class="gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Is
field width measured in code units, code points, or
something else?</span></div>
<div><span
class="gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><br>
</span></div>
<div><span
class="gmail-m_-1131282094399464115m_5127634081229612262gmail-im"></span>I
think the main consideration here is that width should be
locale-independent by default for consistency with the rest of
std::format's design.</div>
</div>
</blockquote>
I agree with that goal, but...<br>
<blockquote type="cite"
cite="mid:CANawtxY2avFFj7xgjYKMpcFDA=ViuyhmYufmZ0fLmw+G3dNAvA@mail.gmail.com">
<div dir="ltr">
<div>If we can say that width is measured in grapheme clusters
or code points based on the execution encoding (or whatever
the standardese term) without querying the locale then I
suggest doing so.</div>
</div>
</blockquote>
I don't know how to do that. From my response to Zach, if code
units aren't used, then behavior should be different for LANG=C vs
LANG=C.UTF-8.<br>
<blockquote type="cite"
cite="mid:CANawtxY2avFFj7xgjYKMpcFDA=ViuyhmYufmZ0fLmw+G3dNAvA@mail.gmail.com">
<div dir="ltr">
<div>I have slight preference for grapheme clusters since those
correspond to user-perceived characters, but only have
implementation experience with code points (this is what both
the fmt library and Python do).<br>
</div>
</div>
</blockquote>
<p>I would definitely vote for EGCs over code points. I think code
points are probably the worst of the options since it makes the
results dependent on Unicode normalization form.<br>
</p>
<p>Tom.<br>
</p>
<blockquote type="cite"
cite="mid:CANawtxY2avFFj7xgjYKMpcFDA=ViuyhmYufmZ0fLmw+G3dNAvA@mail.gmail.com">
<div dir="ltr">
<div><span
class="gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span
class="gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><br>
</span></span></div>
<div><span
class="gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span
class="gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Cheers,</span></span></div>
<div><span
class="gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span
class="gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Victor</span></span></div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Sep 7, 2019 at 5:13 PM
Tom Honermann via Lib <<a
href="mailto:lib@lists.isocpp.org" moz-do-not-send="true">lib@lists.isocpp.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p><a href="http://eel.is/c++draft/format#string.std-7"
target="_blank" moz-do-not-send="true">[format.string.std]p7</a>
states:</p>
<p> </p>
<blockquote type="cite">
<p>The <i>positive-integer</i> in <i>width</i> is a
decimal integer defining the minimum field width. If <i>width</i>
is not specified, there is no minimum field width, and
the field width is determined based on the content of
the field.</p>
</blockquote>
<p>Is field width measured in code units, code points, or
something else?</p>
<p>Consider the following example assuming a UTF-8 locale:<br>
</p>
<p><tt>std::format("{}", "\xC3\x81"); // U+00C1</tt><tt>
{ </tt><tt>LATIN CAPITAL LETTER A WITH ACUTE }</tt><br>
<tt>std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 {
</tt><tt>LATIN CAPITAL LETTER A } { </tt><tt>COMBINING
ACUTE ACCENT }<br>
</tt></p>
<p>In both cases, the arguments encode the same
user-perceived character (Á). The first uses two UTF-8
code units to encode a single code point that represents a
single glyph using a composed Unicode normalization form.
The second uses three code units to encode two code points
that represent the same glyph using a decomposed Unicode
normalization form.</p>
<p>How is the field width determined? If measured in code
units, the first has a width of 2 and the second of 3. If
measured in code points, the first has a width of 1 and
the second of 2. If measured in grapheme clusters, both
have a width of 1. Is the determination locale dependent?</p>
<p><b>Proposed resolution:</b></p>
<p>Field widths are measured in code units and are not
locale dependent. Modify <a
href="http://eel.is/c++draft/format#string.std-7"
target="_blank" moz-do-not-send="true">[format.string.std]p7</a>
as follows:</p>
<p> </p>
<blockquote type="cite">
<p>The <i>positive-integer</i> in <i>width</i> is a
decimal integer defining the minimum field width. If <i>width</i>
is not specified, there is no minimum field width, and
the field width is determined based on the content of
the field. <b><font color="#33cc00">Field width is
measured in code units. Each byte of a multibyte
character contributes to the field width.</font></b><br>
</p>
</blockquote>
<p>(<i>code unit</i> is not formally defined in the
standard. Most uses occur in UTF-8 and UTF-16 specific
contexts, but <a href="http://eel.is/c++draft/lex.ext#5"
target="_blank" moz-do-not-send="true">[lex.ext]p5</a>
uses it in an encoding agnostic context.)<br>
</p>
<p>Tom.<br>
</p>
</div>
_______________________________________________<br>
Lib mailing list<br>
<a href="mailto:Lib@lists.isocpp.org" target="_blank"
moz-do-not-send="true">Lib@lists.isocpp.org</a><br>
Subscription: <a
href="https://lists.isocpp.org/mailman/listinfo.cgi/lib"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.isocpp.org/mailman/listinfo.cgi/lib</a><br>
Link to this post: <a
href="http://lists.isocpp.org/lib/2019/09/13440.php"
rel="noreferrer" target="_blank" moz-do-not-send="true">http://lists.isocpp.org/lib/2019/09/13440.php</a><br>
</blockquote>
</div>
</blockquote>
<p><br>
</p>
</body>
</html>