<div dir="ltr">The *implementation* complexity of using grapheme clusters ( there is also a runtime complexity ) - would only come from the order in which things are standardized. <div>I think grapheme clusters iterator is something SG16 wants, and the day we have that, it would be easy to add it in fmt (since the wording would be well defined, and the implementor would have to implement support for grapheme cluster anyway).</div><div><br></div><div>That's why It's probably wise to ignore all charX_t overloads and specializations (for both parameters and format string), until such a time that we can use these basic building blocks and common definition.</div><div><br></div><div>Otherwise, I think you are probably right that grapheme cluster for Unicode strings (those using charX_t) makes sense.</div><div><br></div><div>Except of course this is completely in the hand of the renderer. your family emoji renders as 4 emojis because my computer probably lacks the appropriate font.</div><div>We must accept that we can not provide a way for the value of match to match anything that will be rendered. aka if people use it for text alignment, it will never be right.</div><div>The way graphical software deal with that is that they rely on font metrics - aka they compute a width from the actual font used to to the rendering</div><div><br></div><div><br></div><div>4/ For me ( I think it was lost in the chat ) - the semantic of N should depend on the value_type of the output iterator/function</div><div><br></div><div>Corentin </div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr">Le dim. 8 juil. 2018 à 23:04, Victor Zverovich <<a href="mailto:victor.zverovich@gmail.com">victor.zverovich@gmail.com</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Just a small followup on our discussion of P0645 Text Formatting during the previous meeting.<br><br>1. Interpretation of width with multibyte encodings and combining characters.<br><br>P0645R2 currently doesn't specify the units of width. Possible options are (from lower to higher abstraction level):<br><br>* Code units<br>* Code points<br>* Grapheme clusters<br><br>Python 3 uses code points as can be seen from the following example:<br><br>>>> o = b'\x6F\xCC\x88'.decode('utf8')<br>>>> o<br>'ö'<br>>>> '{:>2}'.format(o)<br>'ö' # note missing space<br>>>> o = b'\xC3\xB6'.decode('utf8')<br>>>> o<br>'ö'<br>>>> '{:>2}'.format(o)<br>' ö'<br><br>I have slight preference to grapheme clusters because according to <a href="http://unicode.org/reports/tr29/" target="_blank">Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION</a> they correspond to “user-perceived characters” (at least that seems to be the intention, whether they are successful in that is another question).<br><br>Zach provided an example of <span style="font-size:small;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">"👨\u200D👩\u200D👧\u200D👦", <span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">where \u200D is a zero-width joiner (ZWJ),</span> rendered as a single glyph representing a family <span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">"👨👩👧👦". However, if I interpret the following part of <a href="http://www.unicode.org/reports/tr29/tr29-29.html#GB10" target="_blank">http://www.unicode.org/reports/tr29/tr29-29.html#GB10</a> correctly:</span></span><div><br></div><div>> Do not break within emoji modifier sequences or emoji zwj sequences.</div><div><br></div><div>this is not a problem and <span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">"👨\u200D👩\u200D👧\u200D👦" will constitute a single grapheme cluster.</span></div><div><br></div><div>That said, making grapheme clusters width units may add significant complexity to the implementation with minor benefits, so I'm fine going with code points especially since there is an established example of doing this (Python) and it's already an improvement over stdio & iostreams.</div><div><div><br></div><div>2. Interpretation of fill.</div><div><br></div><div>It seems there was a general agreement that fill should be a code point but please let me know if you have other ideas.</div><div><br></div><div>3. There was a question about <span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">signed and unsigned char. I checked and there is no special handling for these types which means that they are treated as integral types, only char and wchar_t are treated specially as character types (and later charN_t will be added).</span></div><div><br></div><div>4. Interpretation of n in format_to_n.</div><div><br></div><div>There was no agreement whether n should be specified in code units or code points. An argument in favor of code units is that n often gives the output buffer size. On the other hand, using code points would be more consistent with width.</div><div><br></div><div>I plan to add support for specifying width and fill as code points in fmt (<span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Zach gave some useful pointers on how to do that, thanks!) and will report back with any user feedback.</span></div><div><br></div></div><div>Cheers,</div><div>Victor</div></div>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" target="_blank">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote></div>