[SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?
Tom Honermann
tom at honermann.net
Sun Sep 8 05:25:43 CEST 2019
On 9/7/19 9:11 PM, Zach Laine wrote:
> On Sat, Sep 7, 2019 at 7:31 PM Tom Honermann via Lib
> <lib at lists.isocpp.org <mailto:lib at lists.isocpp.org>> wrote:
>
> On 9/7/19 8:27 PM, Tony V E wrote:
>> I think we would want it to be measured in glyphs.
> I agree that would be ideal, but...
>
>
> Stop right there. If that's ideal, let's do that. Or at least, let's
> leave room for it to be done at some point. Specifying CUs now
> prevents the ideal from ever being realized.
There are other options. For example, a future extension could allow
specifying what units are to be used for field width.
>
>> Are you suggesting code points because glyphs are too hard?
> I don't know how to achieve that. Field width doesn't really work
> for alignment unless one assumes a monospace font. We could
> measure in terms of extended grapheme clusters, but EGCS width has
> changed over time (e.g., family emoji). That makes alignment
> dependent on both display properties and Unicode version. And, of
> course, this would drag in locale dependence as well.
>
>
> If you just count N=EGCs, you get the "right" answer. if your
> terminal shows more or less than N characters, get a new terminal.
> What I mean by this is that there should be no consideration of fonts.
I see field width as either indicating storage (number of code units) or
alignment. The number of user perceived characters is not useful for
aligning text unless a monospace font is assumed. Therefore, storage
seems like the more useful measurement. This also aligns with
format_to_n and formatted_size which, unless I'm mistaken, work in code
units. (It would be nice to clarify the wording for these as well; what
is meant by "number of characters in the character representation"?)
>
> As for the need for a locale, I don't get that. Grapheme breaking is
> simple, and requires no locale info. Do you mean Unicode data?
> Picking a version and sticking with it should be sufficient. No
> system that I know of has multiple Unicode versions to pick from
> programatically.
For char and wchar_t, encoding is locale dependent. Think POSIX LANG=C
(probably ASCII or ISO-8859-1) vs LANG=C.UTF-8.
>
> Zach
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190907/41cfd332/attachment.html
More information about the Unicode
mailing list