[SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

Tom Honermann tom at honermann.net
Wed Sep 11 22:06:39 CEST 2019


On 9/11/19 3:32 PM, Marshall Clow wrote:
> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib 
> <lib at lists.isocpp.org <mailto:lib at lists.isocpp.org>> wrote:
>
>     [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
>     states:
>
>>     The /positive-integer/ in /width/ is a decimal integer defining
>>     the minimum field width.  If /width/ is not specified, there is
>>     no minimum field width, and the field width is determined based
>>     on the content of the field.
>>
>     Is field width measured in code units, code points, or something else?
>
>     Consider the following example assuming a UTF-8 locale:
>
>     std::format("{}", "\xC3\x81");     // U+00C1{ LATIN CAPITAL LETTER
>     A WITH ACUTE }
>     std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN
>     CAPITAL LETTER A } { COMBINING ACUTE ACCENT }
>
>     In both cases, the arguments encode the same user-perceived
>     character (Á).  The first uses two UTF-8 code units to encode a
>     single code point that represents a single glyph using a composed
>     Unicode normalization form.  The second uses three code units to
>     encode two code points that represent the same glyph using a
>     decomposed Unicode normalization form.
>
>     How is the field width determined?  If measured in code units, the
>     first has a width of 2 and the second of 3. If measured in code
>     points, the first has a width of 1 and the second of 2.  If
>     measured in grapheme clusters, both have a width of 1.  Is the
>     determination locale dependent?
>
>
>
> (Coming late to the party)
> Let's ask a different question.
>
>           std::string s = "/* some content */";
>           std::ostringstream oss;
>           oss << std::setw(22) << s;
>           std::string result1 = oss.str();
>           std::string result2 = std::format("{:22}", s);
>
> What can we say about the contents of "result1" and "result2"?
> Are they the same? Does it matter what the contents of `s` is?

Excellent questions.

I really want them to be the same (at least by default, additional 
opt-in support for locale/encoding sensitive alignment strike me as 
potentially reasonable assuming identification of compelling use cases).
I don't think the contents of `s` should matter (without additional opt-in).

Tom.

>
> -- Marshall


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190911/feb3e74d/attachment.html 


More information about the Unicode mailing list