[SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

Tom Honermann tom at honermann.net
Sun Sep 8 05:25:43 CEST 2019


On 9/7/19 9:11 PM, Zach Laine wrote:
> On Sat, Sep 7, 2019 at 7:31 PM Tom Honermann via Lib 
> <lib at lists.isocpp.org <mailto:lib at lists.isocpp.org>> wrote:
>
>     On 9/7/19 8:27 PM, Tony V E wrote:
>>     I think we would want it to be measured in glyphs.
>     I agree that would be ideal, but...
>
>
> Stop right there.  If that's ideal, let's do that.  Or at least, let's 
> leave room for it to be done at some point. Specifying CUs now 
> prevents the ideal from ever being realized.
There are other options.  For example, a future extension could allow 
specifying what units are to be used for field width.
>
>>     Are you suggesting code points because glyphs are too hard?
>     I don't know how to achieve that.  Field width doesn't really work
>     for alignment unless one assumes a monospace font.  We could
>     measure in terms of extended grapheme clusters, but EGCS width has
>     changed over time (e.g., family emoji).  That makes alignment
>     dependent on both display properties and Unicode version.  And, of
>     course, this would drag in locale dependence as well.
>
>
> If you just count N=EGCs, you get the "right" answer.  if your 
> terminal shows more or less than N characters, get a new terminal.  
> What I mean by this is that there should be no consideration of fonts.
I see field width as either indicating storage (number of code units) or 
alignment.  The number of user perceived characters is not useful for 
aligning text unless a monospace font is assumed. Therefore, storage 
seems like the more useful measurement.  This also aligns with 
format_to_n and formatted_size which, unless I'm mistaken, work in code 
units.  (It would be nice to clarify the wording for these as well; what 
is meant by "number of characters in the character representation"?)
>
> As for the need for a locale, I don't get that.  Grapheme breaking is 
> simple, and requires no locale info.  Do you mean Unicode data?  
> Picking a version and sticking with it should be sufficient.  No 
> system that I know of has multiple Unicode versions to pick from 
> programatically.
For char and wchar_t, encoding is locale dependent.  Think POSIX LANG=C 
(probably ASCII or ISO-8859-1) vs LANG=C.UTF-8.
>
> Zach
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190907/41cfd332/attachment.html 


More information about the Unicode mailing list