[SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

Tom Honermann tom at honermann.net
Sun Sep 8 05:30:48 CEST 2019


On 9/7/19 10:44 PM, Victor Zverovich wrote:
> > Is field width measured in code units, code points, or something else?
>
> I think the main consideration here is that width should be 
> locale-independent by default for consistency with the rest of 
> std::format's design.
I agree with that goal, but...
> If we can say that width is measured in grapheme clusters or code 
> points based on the execution encoding (or whatever the standardese 
> term) without querying the locale then I suggest doing so.
I don't know how to do that.  From my response to Zach, if code units 
aren't used, then behavior should be different for LANG=C vs LANG=C.UTF-8.
> I have slight preference for grapheme clusters since those correspond 
> to user-perceived characters, but only have implementation experience 
> with code points (this is what both the fmt library and Python do).

I would definitely vote for EGCs over code points.  I think code points 
are probably the worst of the options since it makes the results 
dependent on Unicode normalization form.

Tom.

>
> Cheers,
> Victor
>
> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib 
> <lib at lists.isocpp.org <mailto:lib at lists.isocpp.org>> wrote:
>
>     [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
>     states:
>
>>     The /positive-integer/ in /width/ is a decimal integer defining
>>     the minimum field width.  If /width/ is not specified, there is
>>     no minimum field width, and the field width is determined based
>>     on the content of the field.
>>
>     Is field width measured in code units, code points, or something else?
>
>     Consider the following example assuming a UTF-8 locale:
>
>     std::format("{}", "\xC3\x81");     // U+00C1{ LATIN CAPITAL LETTER
>     A WITH ACUTE }
>     std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN
>     CAPITAL LETTER A } { COMBINING ACUTE ACCENT }
>
>     In both cases, the arguments encode the same user-perceived
>     character (Á).  The first uses two UTF-8 code units to encode a
>     single code point that represents a single glyph using a composed
>     Unicode normalization form. The second uses three code units to
>     encode two code points that represent the same glyph using a
>     decomposed Unicode normalization form.
>
>     How is the field width determined?  If measured in code units, the
>     first has a width of 2 and the second of 3.  If measured in code
>     points, the first has a width of 1 and the second of 2.  If
>     measured in grapheme clusters, both have a width of 1.  Is the
>     determination locale dependent?
>
>     *Proposed resolution:*
>
>     Field widths are measured in code units and are not locale
>     dependent. Modify [format.string.std]p7
>     <http://eel.is/c++draft/format#string.std-7> as follows:
>
>>     The /positive-integer/ in /width/ is a decimal integer defining
>>     the minimum field width.  If /width/ is not specified, there is
>>     no minimum field width, and the field width is determined based
>>     on the content of the field. *Field width is measured in code
>>     units.  Each byte of a multibyte character contributes to the
>>     field width.*
>>
>     (/code unit/ is not formally defined in the standard.  Most uses
>     occur in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
>     <http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic
>     context.)
>
>     Tom.
>
>     _______________________________________________
>     Lib mailing list
>     Lib at lists.isocpp.org <mailto:Lib at lists.isocpp.org>
>     Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>     Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190907/c38665f6/attachment-0001.html 


More information about the Unicode mailing list