[SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

Tom Honermann tom at honermann.net
Sun Sep 8 18:12:07 CEST 2019


On 9/8/19 6:00 AM, Corentin via Lib wrote:
>
>
> On Sun, 8 Sep 2019 at 11:17, Corentin <corentin.jabot at gmail.com 
> <mailto:corentin.jabot at gmail.com>> wrote:
>
>
>
>     On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS)
>     <bion at microsoft.com <mailto:bion at microsoft.com>> wrote:
>
>         > I agree that EGCS is the best option. That doesn't drag locale
>
>         Because we don’t get to assume that we’re talking about
>         Unicode at all, it absolutely drags in locale.
>
>
>     Sorry, I should have been more specific.
>     There is a non-tailored Unicode EGCS boundary algorithm (but it
>     can be tailored)
>     I didn't mean to imply that text manipulation can be done without
>     knowing its encoding and never use "locale" to mean encoding.
>
>     EGCS are only defined for text whose character repertoire is
>     Unicode, other encodings deal with codepoints
>
>
>
> To be clear, the difference of whether the EGC algorithm is required 
> to be tailored or not is that tailoring for all intent and purposes 
> requires
> icu or something with CLDR, which restrict the platforms on which this 
> can be implemented

Tailoring is not relevant to this discussion.

The locale dependency stems from the encoding itself being dependent on 
locale.  Again, LANG=C vs LANG=C.UTF-8.  If the specified behavior is 
encoding dependent (as it would have to be for field width to be a count 
of any of code points, scalar values, or EGCs), then it is also locale 
dependent (for char and wchar_t).  Thus there is a trade off:

 1. Either the behavior is locale dependent in which case, field widths
    could be specified such that they count code points, scalar values,
    or EGCs when the locale selects a Unicode encoding (and something
    else for non-Unicode encodings), or
 2. The behavior is not locale dependent in which case, field widths can
    only be specified in terms of code units.

Recall that, unless there is a call to std::setlocale, all C and C++ 
processes start with the locale set to "C".

Tom.

>
>
>
>         Billy3
>
>         ------------------------------------------------------------------------
>         *From:* Lib <lib-bounces at lists.isocpp.org
>         <mailto:lib-bounces at lists.isocpp.org>> on behalf of Corentin
>         via Lib <lib at lists.isocpp.org <mailto:lib at lists.isocpp.org>>
>         *Sent:* Saturday, September 7, 2019 11:08:25 PM
>         *To:* Library Working Group <lib at lists.isocpp.org
>         <mailto:lib at lists.isocpp.org>>
>         *Cc:* Corentin <corentin.jabot at gmail.com
>         <mailto:corentin.jabot at gmail.com>>; Victor Zverovich
>         <victor.zverovich at gmail.com
>         <mailto:victor.zverovich at gmail.com>>; Tom Honermann
>         <tom at honermann.net <mailto:tom at honermann.net>>;
>         unicode at isocpp.open-std.org
>         <mailto:unicode at isocpp.open-std.org> <unicode at open-std.org
>         <mailto:unicode at open-std.org>>
>         *Subject:* Re: [isocpp-lib] New issue: Are std::format field
>         widths code units, code points, or something else?
>
>
>         On Sun, Sep 8, 2019, 5:30 AM Tom Honermann via Lib
>         <lib at lists.isocpp.org <mailto:lib at lists.isocpp.org>> wrote:
>
>             On 9/7/19 10:44 PM, Victor Zverovich wrote:
>>             > Is field width measured in code units, code points, or
>>             something else?
>>
>>             I think the main consideration here is that width should
>>             be locale-independent by default for consistency with the
>>             rest of std::format's design.
>             I agree with that goal, but...
>>             If we can say that width is measured in grapheme clusters
>>             or code points based on the execution encoding (or
>>             whatever the standardese term) without querying the
>>             locale then I suggest doing so.
>             I don't know how to do that.  From my response to Zach, if
>             code units aren't used, then behavior should be different
>             for LANG=C vs LANG=C.UTF-8.
>>             I have slight preference for grapheme clusters since
>>             those correspond to user-perceived characters, but only
>>             have implementation experience with code points (this is
>>             what both the fmt library and Python do).
>
>             I would definitely vote for EGCs over code points.  I
>             think code points are probably the worst of the options
>             since it makes the results dependent on Unicode
>             normalization form.
>
>
>         I disagree. Code Units is the worse option. For me anything
>         involving code units is a big red flag. I agree that EGCS is
>         the best option. That doesn't drag locale, might be a bit
>         involved for implementers in 20.
>         I don't think specify EGCS for Unicode text and codepoints
>         otherwise wouldn't be too difficult - implementation might be
>         a bit challenging on some platforms in the 20 time frame but
>         they could fallback to codepoints in the meantime. Not perfect
>         but I think we need a good long term solution rather than a
>         bad short term one
>
>             Tom.
>
>>
>>             Cheers,
>>             Victor
>>
>>             On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib
>>             <lib at lists.isocpp.org <mailto:lib at lists.isocpp.org>> wrote:
>>
>>                 [format.string.std]p7
>>                 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252854619&sdata=WsHw%2BM62uyiOBrr91P6W1GzwGe313EDe30bKN5i006Q%3D&reserved=0>
>>                 states:
>>
>>>                 The /positive-integer/ in /width/ is a decimal
>>>                 integer defining the minimum field width.  If
>>>                 /width/ is not specified, there is no minimum field
>>>                 width, and the field width is determined based on
>>>                 the content of the field.
>>>
>>                 Is field width measured in code units, code points,
>>                 or something else?
>>
>>                 Consider the following example assuming a UTF-8 locale:
>>
>>                 std::format("{}", "\xC3\x81");     // U+00C1{ LATIN
>>                 CAPITAL LETTER A WITH ACUTE }
>>                 std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 {
>>                 LATIN CAPITAL LETTER A } { COMBINING ACUTE ACCENT }
>>
>>                 In both cases, the arguments encode the same
>>                 user-perceived character (Á).  The first uses two
>>                 UTF-8 code units to encode a single code point that
>>                 represents a single glyph using a composed Unicode
>>                 normalization form.  The second uses three code units
>>                 to encode two code points that represent the same
>>                 glyph using a decomposed Unicode normalization form.
>>
>>                 How is the field width determined?  If measured in
>>                 code units, the first has a width of 2 and the second
>>                 of 3.  If measured in code points, the first has a
>>                 width of 1 and the second of 2.  If measured in
>>                 grapheme clusters, both have a width of 1.  Is the
>>                 determination locale dependent?
>>
>>                 *Proposed resolution:*
>>
>>                 Field widths are measured in code units and are not
>>                 locale dependent. Modify [format.string.std]p7
>>                 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=36WpbP64Oqoi4Pne9kFrEu6nauHLNr2VunnfkvdWcPY%3D&reserved=0>
>>                 as follows:
>>
>>>                 The /positive-integer/ in /width/ is a decimal
>>>                 integer defining the minimum field width.  If
>>>                 /width/ is not specified, there is no minimum field
>>>                 width, and the field width is determined based on
>>>                 the content of the field. *Field width is measured
>>>                 in code units.  Each byte of a multibyte character
>>>                 contributes to the field width.*
>>>
>>                 (/code unit/ is not formally defined in the
>>                 standard.  Most uses occur in UTF-8 and UTF-16
>>                 specific contexts, but [lex.ext]p5
>>                 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.ext%235&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=UyG%2Fr7BXuLAPAXP78ekpXS%2FWhqdeU2QCHTmTeBPjImQ%3D&reserved=0>
>>                 uses it in an encoding agnostic context.)
>>
>>                 Tom.
>>
>>                 _______________________________________________
>>                 Lib mailing list
>>                 Lib at lists.isocpp.org <mailto:Lib at lists.isocpp.org>
>>                 Subscription:
>>                 https://lists.isocpp.org/mailman/listinfo.cgi/lib
>>                 <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=ieyJCXmZ0Bj3UfW4Lvi3hW1HlOq6oeEML86Xyry9uFI%3D&reserved=0>
>>                 Link to this post:
>>                 http://lists.isocpp.org/lib/2019/09/13440.php
>>                 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13440.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=l4UxwaFExnxKireder%2F%2BAnU2mszZXMYatHrd2zGSSWQ%3D&reserved=0>
>>
>
>             _______________________________________________
>             Lib mailing list
>             Lib at lists.isocpp.org <mailto:Lib at lists.isocpp.org>
>             Subscription:
>             https://lists.isocpp.org/mailman/listinfo.cgi/lib
>             <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252884602&sdata=B0%2BhF8pSkAy2MbEwWHk1r3uVjbIpvIoQ%2Fi%2BckyTQ94A%3D&reserved=0>
>             Link to this post:
>             http://lists.isocpp.org/lib/2019/09/13446.php
>             <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13446.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252894598&sdata=NVwyEiiPWSwvAApse%2FxktecxI6oAiGhUWKjyXw8yYMw%3D&reserved=0>
>
>
> _______________________________________________
> Lib mailing list
> Lib at lists.isocpp.org
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2019/09/13453.php


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190908/59d9af9c/attachment-0001.html 


More information about the Unicode mailing list