[SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

Victor Zverovich victor.zverovich at gmail.com
Sun Sep 8 04:44:46 CEST 2019


> Is field width measured in code units, code points, or something else?

I think the main consideration here is that width should be
locale-independent by default for consistency with the rest of
std::format's design. If we can say that width is measured in grapheme
clusters or code points based on the execution encoding (or whatever the
standardese term) without querying the locale then I suggest doing so. I
have slight preference for grapheme clusters since those correspond to
user-perceived characters, but only have implementation experience with
code points (this is what both the fmt library and Python do).

Cheers,
Victor

On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib <lib at lists.isocpp.org>
wrote:

> [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7> states:
>
> The *positive-integer* in *width* is a decimal integer defining the
> minimum field width.  If *width* is not specified, there is no minimum
> field width, and the field width is determined based on the content of the
> field.
>
> Is field width measured in code units, code points, or something else?
>
> Consider the following example assuming a UTF-8 locale:
>
> std::format("{}", "\xC3\x81");     // U+00C1        { LATIN CAPITAL
> LETTER A WITH ACUTE }
> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL
> LETTER A } { COMBINING ACUTE ACCENT }
>
> In both cases, the arguments encode the same user-perceived character
> (Á).  The first uses two UTF-8 code units to encode a single code point
> that represents a single glyph using a composed Unicode normalization
> form.  The second uses three code units to encode two code points that
> represent the same glyph using a decomposed Unicode normalization form.
>
> How is the field width determined?  If measured in code units, the first
> has a width of 2 and the second of 3.  If measured in code points, the
> first has a width of 1 and the second of 2.  If measured in grapheme
> clusters, both have a width of 1.  Is the determination locale dependent?
>
> *Proposed resolution:*
>
> Field widths are measured in code units and are not locale dependent.
> Modify [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
> as follows:
>
> The *positive-integer* in *width* is a decimal integer defining the
> minimum field width.  If *width* is not specified, there is no minimum
> field width, and the field width is determined based on the content of the
> field.  *Field width is measured in code units.  Each byte of a multibyte
> character contributes to the field width.*
>
> (*code unit* is not formally defined in the standard.  Most uses occur in
> UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
> <http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic
> context.)
>
> Tom.
> _______________________________________________
> Lib mailing list
> Lib at lists.isocpp.org
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190907/81107fad/attachment.html 


More information about the Unicode mailing list