<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:"Yu Gothic";
        panose-1:2 11 4 0 0 0 0 0 0 0;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:"\@Yu Gothic";
        panose-1:2 11 4 0 0 0 0 0 0 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
.MsoChpDefault
        {mso-style-type:export-only;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style>
<div class="WordSection1">
<p class="MsoNormal">> I agree that EGCS is the best option. That doesn't drag locale<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Because we don’t get to assume that we’re talking about Unicode at all, it absolutely drags in locale.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Billy3</p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Lib <lib-bounces@lists.isocpp.org> on behalf of Corentin via Lib <lib@lists.isocpp.org><br>
<b>Sent:</b> Saturday, September 7, 2019 11:08:25 PM<br>
<b>To:</b> Library Working Group <lib@lists.isocpp.org><br>
<b>Cc:</b> Corentin <corentin.jabot@gmail.com>; Victor Zverovich <victor.zverovich@gmail.com>; Tom Honermann <tom@honermann.net>; unicode@isocpp.open-std.org <unicode@open-std.org><br>
<b>Subject:</b> Re: [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?</font>
<div> </div>
</div>
<div>
<div dir="auto">
<div><br>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun, Sep 8, 2019, 5:30 AM Tom Honermann via Lib <<a href="mailto:lib@lists.isocpp.org">lib@lists.isocpp.org</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div class="m_-5342112777345943334moz-cite-prefix">On 9/7/19 10:44 PM, Victor Zverovich wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>> <span class="m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im">
Is field width measured in code units, code points, or something else?</span></div>
<div><span class="m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><br>
</span></div>
<div><span class="m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"></span>I think the main consideration here is that width should be locale-independent by default for consistency with the rest of std::format's design.</div>
</div>
</blockquote>
I agree with that goal, but...<br>
<blockquote type="cite">
<div dir="ltr">
<div>If we can say that width is measured in grapheme clusters or code points based on the execution encoding (or whatever the standardese term) without querying the locale then I suggest doing so.</div>
</div>
</blockquote>
I don't know how to do that. From my response to Zach, if code units aren't used, then behavior should be different for LANG=C vs LANG=C.UTF-8.<br>
<blockquote type="cite">
<div dir="ltr">
<div>I have slight preference for grapheme clusters since those correspond to user-perceived characters, but only have implementation experience with code points (this is what both the fmt library and Python do).<br>
</div>
</div>
</blockquote>
<p>I would definitely vote for EGCs over code points. I think code points are probably the worst of the options since it makes the results dependent on Unicode normalization form.<br>
</p>
</div>
</blockquote>
</div>
</div>
<div dir="auto"><br>
</div>
<div dir="auto">I disagree. Code Units is the worse option. For me anything involving code units is a big red flag. I agree that EGCS is the best option. That doesn't drag locale, might be a bit involved for implementers in 20. </div>
<div dir="auto">I don't think specify EGCS for Unicode text and codepoints otherwise wouldn't be too difficult - implementation might be a bit challenging on some platforms in the 20 time frame but they could fallback to codepoints in the meantime. Not perfect
but I think we need a good long term solution rather than a bad short term one</div>
<div dir="auto"><br>
</div>
<div dir="auto">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<p></p>
<p>Tom.<br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div><span class="m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span class="m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><br>
</span></span></div>
<div><span class="m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span class="m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Cheers,</span></span></div>
<div><span class="m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im"><span class="m_-5342112777345943334gmail-m_-1131282094399464115m_5127634081229612262gmail-im">Victor</span></span></div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib <<a href="mailto:lib@lists.isocpp.org" target="_blank" rel="noreferrer">lib@lists.isocpp.org</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p><a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252854619&sdata=WsHw%2BM62uyiOBrr91P6W1GzwGe313EDe30bKN5i006Q%3D&reserved=0" originalsrc="http://eel.is/c++draft/format#string.std-7" shash="WVJ6BNmtHeeJgis+0lYcGS6IOhvd0al9+i9dw5FF1fc8K+QmFSD+5abr3PjhNDUoMObtYVFwtVCAyi25fDnPAfM0GCWBbVRGc3n+lK6JZLHfeRAIQQjyUpaojz7Xb8mCslOEdjB8fgf6vaCjonZbT6stDrJdQx2NLMLsZ5iPMFg=" target="_blank" rel="noreferrer">[format.string.std]p7</a>
states:</p>
<p></p>
<blockquote type="cite">
<p>The <i>positive-integer</i> in <i>width</i> is a decimal integer defining the minimum field width. If
<i>width</i> is not specified, there is no minimum field width, and the field width is determined based on the content of the field.</p>
</blockquote>
<p>Is field width measured in code units, code points, or something else?</p>
<p>Consider the following example assuming a UTF-8 locale:<br>
</p>
<p><tt>std::format("{}", "\xC3\x81"); // U+00C1</tt><tt> { </tt><tt>LATIN CAPITAL LETTER A WITH ACUTE }</tt><br>
<tt>std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { </tt><tt>LATIN CAPITAL LETTER A } {
</tt><tt>COMBINING ACUTE ACCENT }<br>
</tt></p>
<p>In both cases, the arguments encode the same user-perceived character (Á). The first uses two UTF-8 code units to encode a single code point that represents a single glyph using a composed Unicode normalization form. The second uses three code units to
encode two code points that represent the same glyph using a decomposed Unicode normalization form.</p>
<p>How is the field width determined? If measured in code units, the first has a width of 2 and the second of 3. If measured in code points, the first has a width of 1 and the second of 2. If measured in grapheme clusters, both have a width of 1. Is the
determination locale dependent?</p>
<p><b>Proposed resolution:</b></p>
<p>Field widths are measured in code units and are not locale dependent. Modify <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=36WpbP64Oqoi4Pne9kFrEu6nauHLNr2VunnfkvdWcPY%3D&reserved=0" originalsrc="http://eel.is/c++draft/format#string.std-7" shash="bDm63zMdPGavJuNolA5doFYDjmoe0O0VSdIfn5ovM5J42cSlpBDIXaaiHSFQzfektFjkcUAtKTRbsyQvCA0li03iLZ/2PFPfdYKH/TCyINBWxX6QaTmo9N3LPfoi3cNuq7GKuyLNUcuOXzG1+o1LMQHqxdO3Ga/6P2CIakSyIGU=" target="_blank" rel="noreferrer">
[format.string.std]p7</a> as follows:</p>
<p></p>
<blockquote type="cite">
<p>The <i>positive-integer</i> in <i>width</i> is a decimal integer defining the minimum field width. If
<i>width</i> is not specified, there is no minimum field width, and the field width is determined based on the content of the field.
<b><font color="#33cc00">Field width is measured in code units. Each byte of a multibyte character contributes to the field width.</font></b><br>
</p>
</blockquote>
<p>(<i>code unit</i> is not formally defined in the standard. Most uses occur in UTF-8 and UTF-16 specific contexts, but
<a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.ext%235&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=UyG%2Fr7BXuLAPAXP78ekpXS%2FWhqdeU2QCHTmTeBPjImQ%3D&reserved=0" originalsrc="http://eel.is/c++draft/lex.ext#5" shash="owJz3a2L7gaAkgi2z0U3AlQJkhCNFjRPmPcQZCqoL3hJt+3CB4IQ70Aak46wsj2REgyRc4VVrjWgWj+hDU/3pd4Lmf8qiRN3W6fqhpP/zkbZhfqhBm6vsSeq8k5acjVKJyo5TcicMM9rUw+luDxybm6EvDdrNlj/TGMojsjNDe8=" target="_blank" rel="noreferrer">
[lex.ext]p5</a> uses it in an encoding agnostic context.)<br>
</p>
<p>Tom.<br>
</p>
</div>
_______________________________________________<br>
Lib mailing list<br>
<a href="mailto:Lib@lists.isocpp.org" target="_blank" rel="noreferrer">Lib@lists.isocpp.org</a><br>
Subscription: <a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=ieyJCXmZ0Bj3UfW4Lvi3hW1HlOq6oeEML86Xyry9uFI%3D&reserved=0" originalsrc="https://lists.isocpp.org/mailman/listinfo.cgi/lib" shash="n3NN9G0tHzp8FSkFHslsO+Mgsa3vudryJyUQBGLP1/HVZEhafsO91Xr14XSBCD+yDgWpctd8qayXD53RH/71zcmsu/8pCse4bAsPADuHq6ROJlC1CuhjgRYt9QGz5gm0jG+P+N1vJG0DU65TQIRjSc6C9N/nn9LYNv5KlRbaoUg=" rel="noreferrer noreferrer" target="_blank">
https://lists.isocpp.org/mailman/listinfo.cgi/lib</a><br>
Link to this post: <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13440.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=l4UxwaFExnxKireder%2F%2BAnU2mszZXMYatHrd2zGSSWQ%3D&reserved=0" originalsrc="http://lists.isocpp.org/lib/2019/09/13440.php" shash="TpqGOY1mJ3iNYvaVYtnQiQkavDLvuRB/LhSJWzAdLNMjFQQ0tYNC7rG0z3VoQLcxExFYGGrwIZ0KlECD2gErJKb3xzfDu2xV6eylBYm5C6Avm+RaM8XX8NsraQ4SfE6T7ZdvyGH1hknKLG7H9oktDtc3sAPBnM11hOa8WMa3Wg4=" rel="noreferrer noreferrer" target="_blank">
http://lists.isocpp.org/lib/2019/09/13440.php</a><br>
</blockquote>
</div>
</blockquote>
<p><br>
</p>
</div>
_______________________________________________<br>
Lib mailing list<br>
<a href="mailto:Lib@lists.isocpp.org" target="_blank" rel="noreferrer">Lib@lists.isocpp.org</a><br>
Subscription: <a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252884602&sdata=B0%2BhF8pSkAy2MbEwWHk1r3uVjbIpvIoQ%2Fi%2BckyTQ94A%3D&reserved=0" originalsrc="https://lists.isocpp.org/mailman/listinfo.cgi/lib" shash="CmGtvFRtNlZLr5zGS5dYRvAPi+HJ91EZX1Ukzermg19t9Q6Bf94PwpFngPAptRVdPtZg2aKwf46AdlHbUree7oCM3+Zcog53oFj4tX+zZuAf1hzAqlzQGiPZ4WYP3pnixPXsxetFA2j0DYpvCKa78Mp/eush3wrmSo8W862k2vs=" rel="noreferrer noreferrer" target="_blank">
https://lists.isocpp.org/mailman/listinfo.cgi/lib</a><br>
Link to this post: <a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13446.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252894598&sdata=NVwyEiiPWSwvAApse%2FxktecxI6oAiGhUWKjyXw8yYMw%3D&reserved=0" originalsrc="http://lists.isocpp.org/lib/2019/09/13446.php" shash="VeJR/ak4D0i2CwKLrUoVd7siMJ2/A9rWFRwr0UwLPhtoKB3ANguaTDx5c/r/yAsLjrpgN1TiWqazH+Lwkgv/f5aeoup9oqOjztLWlK+8kzAoO9LW1t1FZrcAm9jkgMtuy5pAtUbY1XYBr2yJP9U/SI1EH8TFZJ6iXTa/5G4rE60=" rel="noreferrer noreferrer" target="_blank">
http://lists.isocpp.org/lib/2019/09/13446.php</a><br>
</blockquote>
</div>
</div>
</div>
</div>
</body>
</html>