[SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

Billy O'Neal (VC LIBS) bion at microsoft.com
Sun Sep 8 09:52:35 CEST 2019


> I agree that EGCS is the best option. That doesn't drag locale

Because we don’t get to assume that we’re talking about Unicode at all, it absolutely drags in locale.

Billy3

________________________________
From: Lib <lib-bounces at lists.isocpp.org> on behalf of Corentin via Lib <lib at lists.isocpp.org>
Sent: Saturday, September 7, 2019 11:08:25 PM
To: Library Working Group <lib at lists.isocpp.org>
Cc: Corentin <corentin.jabot at gmail.com>; Victor Zverovich <victor.zverovich at gmail.com>; Tom Honermann <tom at honermann.net>; unicode at isocpp.open-std.org <unicode at open-std.org>
Subject: Re: [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?



On Sun, Sep 8, 2019, 5:30 AM Tom Honermann via Lib <lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>> wrote:
On 9/7/19 10:44 PM, Victor Zverovich wrote:
> Is field width measured in code units, code points, or something else?

I think the main consideration here is that width should be locale-independent by default for consistency with the rest of std::format's design.
I agree with that goal, but...
If we can say that width is measured in grapheme clusters or code points based on the execution encoding (or whatever the standardese term) without querying the locale then I suggest doing so.
I don't know how to do that.  From my response to Zach, if code units aren't used, then behavior should be different for LANG=C vs LANG=C.UTF-8.
I have slight preference for grapheme clusters since those correspond to user-perceived characters, but only have implementation experience with code points (this is what both the fmt library and Python do).

I would definitely vote for EGCs over code points.  I think code points are probably the worst of the options since it makes the results dependent on Unicode normalization form.

I disagree. Code Units is the worse option. For me anything involving code units is a big red flag. I agree that EGCS is the best option. That doesn't drag locale, might be a bit involved for implementers in 20.
I don't think specify EGCS for Unicode text and codepoints otherwise wouldn't be too difficult - implementation might be a bit challenging on some platforms in the 20 time frame but they could fallback to codepoints in the meantime. Not perfect but I think we need a good long term solution rather than a bad short term one


Tom.

Cheers,
Victor

On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib <lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>> wrote:

[format.string.std]p7<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252854619&sdata=WsHw%2BM62uyiOBrr91P6W1GzwGe313EDe30bKN5i006Q%3D&reserved=0> states:

The positive-integer in width is a decimal integer defining the minimum field width.  If width is not specified, there is no minimum field width, and the field width is determined based on the content of the field.

Is field width measured in code units, code points, or something else?

Consider the following example assuming a UTF-8 locale:

std::format("{}", "\xC3\x81");     // U+00C1        { LATIN CAPITAL LETTER A WITH ACUTE }
std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL LETTER A } { COMBINING ACUTE ACCENT }

In both cases, the arguments encode the same user-perceived character (Á).  The first uses two UTF-8 code units to encode a single code point that represents a single glyph using a composed Unicode normalization form.  The second uses three code units to encode two code points that represent the same glyph using a decomposed Unicode normalization form.

How is the field width determined?  If measured in code units, the first has a width of 2 and the second of 3.  If measured in code points, the first has a width of 1 and the second of 2.  If measured in grapheme clusters, both have a width of 1.  Is the determination locale dependent?

Proposed resolution:

Field widths are measured in code units and are not locale dependent. Modify [format.string.std]p7<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=36WpbP64Oqoi4Pne9kFrEu6nauHLNr2VunnfkvdWcPY%3D&reserved=0> as follows:

The positive-integer in width is a decimal integer defining the minimum field width.  If width is not specified, there is no minimum field width, and the field width is determined based on the content of the field.  Field width is measured in code units.  Each byte of a multibyte character contributes to the field width.

(code unit is not formally defined in the standard.  Most uses occur in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.ext%235&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=UyG%2Fr7BXuLAPAXP78ekpXS%2FWhqdeU2QCHTmTeBPjImQ%3D&reserved=0> uses it in an encoding agnostic context.)

Tom.

_______________________________________________
Lib mailing list
Lib at lists.isocpp.org<mailto:Lib at lists.isocpp.org>
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=ieyJCXmZ0Bj3UfW4Lvi3hW1HlOq6oeEML86Xyry9uFI%3D&reserved=0>
Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13440.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=l4UxwaFExnxKireder%2F%2BAnU2mszZXMYatHrd2zGSSWQ%3D&reserved=0>


_______________________________________________
Lib mailing list
Lib at lists.isocpp.org<mailto:Lib at lists.isocpp.org>
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252884602&sdata=B0%2BhF8pSkAy2MbEwWHk1r3uVjbIpvIoQ%2Fi%2BckyTQ94A%3D&reserved=0>
Link to this post: http://lists.isocpp.org/lib/2019/09/13446.php<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13446.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252894598&sdata=NVwyEiiPWSwvAApse%2FxktecxI6oAiGhUWKjyXw8yYMw%3D&reserved=0>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190908/f9368409/attachment-0001.html 


More information about the Unicode mailing list