[SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

Tom Honermann tom at honermann.net
Sun Sep 8 02:31:16 CEST 2019


On 9/7/19 8:27 PM, Tony V E wrote:
> I think we would want it to be measured in glyphs.
I agree that would be ideal, but...
> Are you suggesting code points because glyphs are too hard?
I don't know how to achieve that.  Field width doesn't really work for 
alignment unless one assumes a monospace font.  We could measure in 
terms of extended grapheme clusters, but EGCS width has changed over 
time (e.g., family emoji).  That makes alignment dependent on both 
display properties and Unicode version.  And, of course, this would drag 
in locale dependence as well.
> Should we specify glyphs anyhow and leave it to QoI?

Perhaps we could (I'm not sure how to specify that), but then we end up 
with the locale dependency, at least for char and wchar_t (which is all 
that is supported right now).

Tom.

>
> Sent from my BlackBerry portable Babbage Device
> *From: *Tom Honermann via Lib
> *Sent: *Saturday, September 7, 2019 8:13 PM
> *To: *Library Working Group; unicode at isocpp.open-std.org
> *Reply To: *lib at lists.isocpp.org
> *Cc: *Tom Honermann
> *Subject: *[isocpp-lib] New issue: Are std::format field widths code 
> units, code points, or something else?
>
>
> [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7> states:
>
>> The /positive-integer/ in /width/ is a decimal integer defining the 
>> minimum field width.  If /width/ is not specified, there is no 
>> minimum field width, and the field width is determined based on the 
>> content of the field.
>>
> Is field width measured in code units, code points, or something else?
>
> Consider the following example assuming a UTF-8 locale:
>
> std::format("{}", "\xC3\x81");     // U+00C1{ LATIN CAPITAL LETTER A 
> WITH ACUTE }
> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL 
> LETTER A } { COMBINING ACUTE ACCENT }
>
> In both cases, the arguments encode the same user-perceived character 
> (Á).  The first uses two UTF-8 code units to encode a single code 
> point that represents a single glyph using a composed Unicode 
> normalization form.  The second uses three code units to encode two 
> code points that represent the same glyph using a decomposed Unicode 
> normalization form.
>
> How is the field width determined?  If measured in code units, the 
> first has a width of 2 and the second of 3.  If measured in code 
> points, the first has a width of 1 and the second of 2.  If measured 
> in grapheme clusters, both have a width of 1.  Is the determination 
> locale dependent?
>
> *Proposed resolution:*
>
> Field widths are measured in code units and are not locale dependent. 
> Modify [format.string.std]p7 
> <http://eel.is/c++draft/format#string.std-7> as follows:
>
>> The /positive-integer/ in /width/ is a decimal integer defining the 
>> minimum field width.  If /width/ is not specified, there is no 
>> minimum field width, and the field width is determined based on the 
>> content of the field. *Field width is measured in code units.  Each 
>> byte of a multibyte character contributes to the field width.*
>>
> (/code unit/ is not formally defined in the standard. Most uses occur 
> in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5 
> <http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic 
> context.)
>
> Tom.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190907/63391ddb/attachment.html 


More information about the Unicode mailing list