[SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

Tom Honermann tom at honermann.net
Mon Sep 9 04:34:41 CEST 2019


On 9/8/19 7:05 PM, Zach Laine wrote:
> On Sun, Sep 8, 2019 at 3:00 PM Tom Honermann via Lib 
> <lib at lists.isocpp.org <mailto:lib at lists.isocpp.org>> wrote:
>
>
>     On Sep 8, 2019, at 2:46 PM, Corentin via Lib <lib at lists.isocpp.org
>     <mailto:lib at lists.isocpp.org>> wrote:
>
>>
>>
>>     On Sun, 8 Sep 2019 at 19:30, Tom Honermann <tom at honermann.net
>>     <mailto:tom at honermann.net>> wrote:
>>
>>         On 9/8/19 12:40 PM, Corentin wrote:
>>>
>>>
>>>         On Sun, 8 Sep 2019 at 18:12, Tom Honermann
>>>         <tom at honermann.net <mailto:tom at honermann.net>> wrote:
>>>
>>>             On 9/8/19 6:00 AM, Corentin via Lib wrote:
>>>>
>>>>
>>>>             On Sun, 8 Sep 2019 at 11:17, Corentin
>>>>             <corentin.jabot at gmail.com
>>>>             <mailto:corentin.jabot at gmail.com>> wrote:
>>>>
>>>>
>>>>
>>>>                 On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS)
>>>>                 <bion at microsoft.com <mailto:bion at microsoft.com>> wrote:
>>>>
>>>>                     > I agree that EGCS is the best option. That
>>>>                     doesn't drag locale
>>>>
>>>>                     Because we don’t get to assume that we’re
>>>>                     talking about Unicode at all, it absolutely
>>>>                     drags in locale.
>>>>
>>>>
>>>>                 Sorry, I should have been more specific.
>>>>                 There is a non-tailored Unicode EGCS boundary
>>>>                 algorithm (but it can be tailored)
>>>>                 I didn't mean to imply that text manipulation can
>>>>                 be done without knowing its encoding and never use
>>>>                 "locale" to mean encoding.
>>>>
>>>>                 EGCS are only defined for text whose character
>>>>                 repertoire is Unicode, other encodings deal with
>>>>                 codepoints
>>>>
>>>>
>>>>
>>>>             To be clear, the difference of whether the EGC
>>>>             algorithm is required to be tailored or not is that
>>>>             tailoring for all intent and purposes requires
>>>>             icu or something with CLDR, which restrict the
>>>>             platforms on which this can be implemented
>>>
>>>             Tailoring is not relevant to this discussion.
>>>
>>>         It is - see https://unicode.org/reports/tr29/ "ch" is 2 EGCS
>>>         in most locales but in Slovak it's 1. I don't make the rules :D
>>         It isn't relevant in determining how we resolve this issue. 
>>         If the resolution is that field widths are measured in EGCs,
>>         then we've already decided that the width is locale dependent
>>         and tailoring becomes an implementation detail.
>>
>>
>>     No, format decided to be locale-independent (for good reason) and
>>     applying locale specific behavior implicitly would be against that.
>>     I'n arguing for encoding specific behavior
>
>     You seem to be missing the point that, for char and wchar_t, the
>     encoding can’t be known (in general) without consulting the
>     locale. Again, LANG=C vs LANG=C.UTF-8.
>
>     Tom.
>
>
> Tom, you seem to be missing the point that std::format does not such 
> consultation!  It is locale-agnostic.  It is assumed to be char-based, 
> not Windows 1252, not UTF-8, not even ASCII.
That is exactly my point!  And why my proposed resolution was to specify 
width in terms of code units.
>
> This means that the definition of width as being a CU is the de facto 
> status quo.  I'm suggesting that later on, we pull a fast one and 
> specify that we meant that it should have been UTF-8-based instead of 
> char-based.  This may mean that we need to add a char8_t overload, or 
> it may be palatable to just change the current interface's contract. I 
> assume the former will be necessary, since people tend to hate silent 
> contract changes (with good reason).

Victor's fmtlib implementation already effectively does what you 
suggest.  See 
https://github.com/fmtlib/fmt/commit/38325248e5310ddbea41390974e496e8495f7324.

I think this isn't a good state to be in though.  If the current locale 
has a UTF-8 encoding, I would be disappointed if the following two calls 
produced different string contents:

std::format(  "{:3}",   "\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER A 
WITH ACUTE }
std::format(u8"{:3}", u8"\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER A 
WITH ACUTE }

If the width is code units for the char based overload and EGCs for the 
char8_t based one, then the first will produce "\xC3\x81\x20" (one 
inserted space) and the second "\xC3\x81\x20\x20" (two inserted 
spaces).  I think users would find that surprising.

>
> So, if we do nothing, we get what you want.  If we *specify* that CUs 
> are the width, we color the future debate about the Unicode-aware 
> version in a Unicode-unfriendly direction.

If we do nothing, we are in the situation where different implementors 
may do different things.

My preferred direction for exploration is a future extension that 
enables opt-in to field widths that are encoding dependent (and 
therefore locale dependent for char and wchar_t).  For example (using 
'L' appended to the width; 'L' doesn't conflict with the existing type 
options):

std::format("{:3L}", "\xC3\x81"); // produces "\xC3\x81\x20\x20"; 3 EGCs.

But again, I'm far from convinced that this is actually useful since 
EGCs don't suffice to ensure an aligned result anyway as nicely 
described in Henri's post (https://hsivonen.fi/string-length).

Tom.

>
> Zach
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190908/8afa4934/attachment.html 


More information about the Unicode mailing list