[SG16-Unicode] New issue: Are std::format field widths code units, code points, or something else?

Tom Honermann tom at honermann.net
Sun Sep 8 02:13:12 CEST 2019


[format.string.std]p7 <http://eel.is/c++draft/format#string.std-7> states:

> The /positive-integer/ in /width/ is a decimal integer defining the 
> minimum field width.  If /width/ is not specified, there is no minimum 
> field width, and the field width is determined based on the content of 
> the field.
>
Is field width measured in code units, code points, or something else?

Consider the following example assuming a UTF-8 locale:

std::format("{}", "\xC3\x81");     // U+00C1{ LATIN CAPITAL LETTER A 
WITH ACUTE }
std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL 
LETTER A } { COMBINING ACUTE ACCENT }

In both cases, the arguments encode the same user-perceived character 
(Á).  The first uses two UTF-8 code units to encode a single code point 
that represents a single glyph using a composed Unicode normalization 
form.  The second uses three code units to encode two code points that 
represent the same glyph using a decomposed Unicode normalization form.

How is the field width determined?  If measured in code units, the first 
has a width of 2 and the second of 3.  If measured in code points, the 
first has a width of 1 and the second of 2.  If measured in grapheme 
clusters, both have a width of 1.  Is the determination locale dependent?

*Proposed resolution:*

Field widths are measured in code units and are not locale dependent. 
Modify [format.string.std]p7 
<http://eel.is/c++draft/format#string.std-7> as follows:

> The /positive-integer/ in /width/ is a decimal integer defining the 
> minimum field width.  If /width/ is not specified, there is no minimum 
> field width, and the field width is determined based on the content of 
> the field. *Field width is measured in code units.  Each byte of a 
> multibyte character contributes to the field width.*
>
(/code unit/ is not formally defined in the standard.  Most uses occur 
in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5 
<http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic context.)

Tom.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190907/fe9a9a69/attachment-0001.html 


More information about the Unicode mailing list