[SG16-Unicode] P0645 Text Formatting review followup

Tom Honermann tom at honermann.net
Mon Jul 23 06:04:48 CEST 2018


SG16 discussed this topic some more at our last meeting.  Notes are 
available at:
- 
https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#july-11th-2018

Victor suggested the idea of defining field widths using an encoding 
agnostic concept that maps to extended grapheme clusters (EGCs) for 
Unicode encodings.  For other encodings, it would presumably map in some 
implementation defined way - probably 1x1 to code points.  I like this 
idea.  During the discussion an experiment was suggested to take Eric 
Niebler's range-v3 calendar example 
<https://github.com/ericniebler/range-v3/blob/master/example/calendar.cpp> 
[1] and to modify it to print emojis for holidays.  For example, to 
substitute U+1F384 Christmas Tree for December 25th:

         October              November              December
               1  2  3   1  2  3  4  5  6  7         1  2 3  4  5
   4  5  6  7  8  9 10   8  9 10 11 12 13 14   6  7  8  9 10 11 12
  11 12 13 14 15 16 17  15 16 17 18 19 20 21  13 14 15 16 17 18 19
  18 19 20 21 22 23 24  22 23 24 25  🦃 27 28  20 21 22 23 24  🎄 26
  25 26 27 28 29 30  🎃  29 30                 27 28 29 30 31

This email is formatted to use a fixed width font for the calendar data 
above.  On my system, the emoji are rendered such that the emoji 
characters consume more than one (but less than two!) columns of output 
thus breaking the intended presentation.  This presumably occurs because 
fixed width variants of these characters are not available.  This makes 
me wonder how useful use of EGCs as field width units will be in practice.

Mark Davis recently posted the following link to the (not SG16) Unicode 
mailing list.  This discusses, amongst a number of other interesting 
topics, rendering of emojis as single user perceived characters vs 
multiple user perceived characters.  This is relevant for the discussion 
of family emojis.
- 
https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview

With regard to interpretation of fill characters, I think there needs to 
be a requirement that the fill "character" consume exactly one unit of 
field width, however that is defined.

Tom.

[1]: 
https://github.com/ericniebler/range-v3/blob/master/example/calendar.cpp

On 07/08/2018 05:24 PM, Corentin wrote:
> The *implementation* complexity of using grapheme clusters ( there is 
> also a runtime complexity ) - would only come from the order in which 
> things are standardized.
> I think grapheme clusters iterator is something SG16 wants, and the 
> day we have that, it would be easy to add it in fmt (since the wording 
> would be well defined, and the implementor would have to implement 
> support for grapheme cluster anyway).
>
> That's why It's probably wise to ignore all charX_t overloads and 
> specializations (for both parameters and format string), until such a 
> time that we can use these basic building blocks and common definition.
>
> Otherwise, I think you are probably right that grapheme cluster for 
> Unicode strings (those using charX_t) makes sense.
>
> Except of course this is completely in the hand of the renderer. your 
> family emoji renders as 4 emojis because my computer probably lacks 
> the appropriate font.
> We must accept that we can not provide a way for the value of match to 
> match anything that will be rendered. aka if people use it for text 
> alignment, it will never be right.
> The way graphical software deal with that is that they rely on font 
> metrics - aka they compute a width from the actual font used to to the 
> rendering
>
>
> 4/ For me ( I think it was lost in the chat ) - the semantic of N 
> should depend on the value_type of the output iterator/function
>
> Corentin
>
>
>
> Le dim. 8 juil. 2018 à 23:04, Victor Zverovich 
> <victor.zverovich at gmail.com <mailto:victor.zverovich at gmail.com>> a écrit :
>
>     Just a small followup on our discussion of P0645 Text Formatting
>     during the previous meeting.
>
>     1. Interpretation of width with multibyte encodings and combining
>     characters.
>
>     P0645R2 currently doesn't specify the units of width. Possible
>     options are (from lower to higher abstraction level):
>
>     * Code units
>     * Code points
>     * Grapheme clusters
>
>     Python 3 uses code points as can be seen from the following example:
>
>     >>> o = b'\x6F\xCC\x88'.decode('utf8')
>     >>> o
>     'ö'
>     >>> '{:>2}'.format(o)
>     'ö' # note missing space
>     >>> o = b'\xC3\xB6'.decode('utf8')
>     >>> o
>     'ö'
>     >>> '{:>2}'.format(o)
>     ' ö'
>
>     I have slight preference to grapheme clusters because according to
>     Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION
>     <http://unicode.org/reports/tr29/> they correspond
>     to “user-perceived characters” (at least that seems to be the
>     intention, whether they are successful in that is another question).
>
>     Zach provided an example of "👨\u200D👩\u200D👧\u200D👦", where
>     \u200D is a zero-width joiner (ZWJ), rendered as a single glyph
>     representing a family "👨‍👩‍👧‍👦". However, if I interpret the
>     following part of
>     http://www.unicode.org/reports/tr29/tr29-29.html#GB10 correctly:
>
>     > Do not break within emoji modifier sequences or emoji zwj sequences.
>
>     this is not a problem and "👨\u200D👩\u200D👧\u200D👦" will
>     constitute a single grapheme cluster.
>
>     That said, making grapheme clusters width units may add
>     significant complexity to the implementation with minor benefits,
>     so I'm fine going with code points especially since there is an
>     established example of doing this (Python) and it's already an
>     improvement over stdio & iostreams.
>
>     2. Interpretation of fill.
>
>     It seems there was a general agreement that fill should be a code
>     point but please let me know if you have other ideas.
>
>     3. There was a question about signed and unsigned char. I checked
>     and there is no special handling for these types which means that
>     they are treated as integral types, only char and wchar_t are
>     treated specially as character types (and later charN_t will be
>     added).
>
>     4. Interpretation of n in format_to_n.
>
>     There was no agreement whether n should be specified in code units
>     or code points. An argument in favor of code units is that n often
>     gives the output buffer size. On the other hand, using code points
>     would be more consistent with width.
>
>     I plan to add support for specifying width and fill as code points
>     in fmt (Zach gave some useful pointers on how to do that, thanks!)
>     and will report back with any user feedback.
>
>     Cheers,
>     Victor
>     _______________________________________________
>     SG16 Unicode mailing list
>     Unicode at isocpp.open-std.org <mailto:Unicode at isocpp.open-std.org>
>     http://www.open-std.org/mailman/listinfo/unicode
>
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode at isocpp.open-std.org
> http://www.open-std.org/mailman/listinfo/unicode


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20180723/791fc2f8/attachment-0001.html 


More information about the Unicode mailing list