<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">SG16 discussed this topic some more at
our last meeting. Notes are available at:<br>
-
<a class="moz-txt-link-freetext" href="https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#july-11th-2018">https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#july-11th-2018</a><br>
<br>
Victor suggested the idea of defining field widths using an
encoding agnostic concept that maps to extended grapheme clusters
(EGCs) for Unicode encodings. For other encodings, it would
presumably map in some implementation defined way - probably 1x1
to code points. I like this idea. During the discussion an
experiment was suggested to take <a moz-do-not-send="true"
href="https://github.com/ericniebler/range-v3/blob/master/example/calendar.cpp">Eric
Niebler's range-v3 calendar example</a> [1] and to modify it to
print emojis for holidays. For example, to substitute U+1F384
Christmas Tree for December 25th:<br>
<br>
<tt> October November December</tt><tt><br>
</tt><tt> 1 2 3 1 2 3 4 5 6 7 1 2
3 4 5</tt><tt><br>
</tt><tt> 4 5 6 7 8 9 10 8 9 10 11 12 13 14 6 7 8 9
10 11 12</tt><tt><br>
</tt><tt> 11 12 13 14 15 16 17 15 16 17 18 19 20 21 13 14 15 16
17 18 19</tt><tt><br>
</tt><tt> 18 19 20 21 22 23 24 22 23 24 25 🦃 27 28 20 21 22 23
24 🎄 26</tt><tt><br>
</tt><tt> 25 26 27 28 29 30 🎃 29 30 27 28 29 30
31</tt><br>
<br>
This email is formatted to use a fixed width font for the calendar
data above. On my system, the emoji are rendered such that the
emoji characters consume more than one (but less than two!)
columns of output thus breaking the intended presentation. This
presumably occurs because fixed width variants of these characters
are not available. This makes me wonder how useful use of EGCs as
field width units will be in practice.<br>
<br>
Mark Davis recently posted the following link to the (not SG16)
Unicode mailing list. This discusses, amongst a number of other
interesting topics, rendering of emojis as single user perceived
characters vs multiple user perceived characters. This is
relevant for the discussion of family emojis.<br>
-
<a class="moz-txt-link-freetext" href="https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview">https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview</a><br>
<br>
With regard to interpretation of fill characters, I think there
needs to be a requirement that the fill "character" consume
exactly one unit of field width, however that is defined.<br>
<br>
Tom.<br>
<br>
[1]:
<a class="moz-txt-link-freetext" href="https://github.com/ericniebler/range-v3/blob/master/example/calendar.cpp">https://github.com/ericniebler/range-v3/blob/master/example/calendar.cpp</a><br>
<br>
On 07/08/2018 05:24 PM, Corentin wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CA+Om+SgCSnjNqJpK=H1=ZLHwSRXNQH25TUWxsSMwCxUKK=cQMQ@mail.gmail.com">
<div dir="ltr">The *implementation* complexity of using
grapheme clusters ( there is also a runtime complexity ) - would
only come from the order in which things are standardized.
<div>I think grapheme clusters iterator is something SG16 wants,
and the day we have that, it would be easy to add it in fmt
(since the wording would be well defined, and the implementor
would have to implement support for grapheme cluster anyway).</div>
<div><br>
</div>
<div>That's why It's probably wise to ignore all charX_t
overloads and specializations (for both parameters and format
string), until such a time that we can use these basic
building blocks and common definition.</div>
<div><br>
</div>
<div>Otherwise, I think you are probably right that grapheme
cluster for Unicode strings (those using charX_t) makes sense.</div>
<div><br>
</div>
<div>Except of course this is completely in the hand of the
renderer. your family emoji renders as 4 emojis because my
computer probably lacks the appropriate font.</div>
<div>We must accept that we can not provide a way for the value
of match to match anything that will be rendered. aka if
people use it for text alignment, it will never be right.</div>
<div>The way graphical software deal with that is that they rely
on font metrics - aka they compute a width from the actual
font used to to the rendering</div>
<div><br>
</div>
<div><br>
</div>
<div>4/ For me ( I think it was lost in the chat ) - the
semantic of N should depend on the value_type of the output
iterator/function</div>
<div><br>
</div>
<div>Corentin </div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">Le dim. 8 juil. 2018 à 23:04, Victor Zverovich
<<a href="mailto:victor.zverovich@gmail.com"
moz-do-not-send="true">victor.zverovich@gmail.com</a>> a
écrit :<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Just a small followup on our discussion of
P0645 Text Formatting during the previous meeting.<br>
<br>
1. Interpretation of width with multibyte encodings and
combining characters.<br>
<br>
P0645R2 currently doesn't specify the units of width.
Possible options are (from lower to higher abstraction
level):<br>
<br>
* Code units<br>
* Code points<br>
* Grapheme clusters<br>
<br>
Python 3 uses code points as can be seen from the following
example:<br>
<br>
>>> o = b'\x6F\xCC\x88'.decode('utf8')<br>
>>> o<br>
'ö'<br>
>>> '{:>2}'.format(o)<br>
'ö' # note missing space<br>
>>> o = b'\xC3\xB6'.decode('utf8')<br>
>>> o<br>
'ö'<br>
>>> '{:>2}'.format(o)<br>
' ö'<br>
<br>
I have slight preference to grapheme clusters because
according to <a href="http://unicode.org/reports/tr29/"
target="_blank" moz-do-not-send="true">Unicode Standard
Annex #29 UNICODE TEXT SEGMENTATION</a> they correspond
to “user-perceived characters” (at least that seems to be
the intention, whether they are successful in that is
another question).<br>
<br>
Zach provided an example of <span
style="font-size:small;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">"👨\u200D👩\u200D👧\u200D👦", <span
style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">where
\u200D is a zero-width joiner (ZWJ),</span> rendered as
a single glyph representing a family <span
style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">"👨👩👧👦".
However, if I interpret the following part of <a
href="http://www.unicode.org/reports/tr29/tr29-29.html#GB10"
target="_blank" moz-do-not-send="true">http://www.unicode.org/reports/tr29/tr29-29.html#GB10</a>
correctly:</span></span>
<div><br>
</div>
<div>> Do not break within emoji modifier sequences or
emoji zwj sequences.</div>
<div><br>
</div>
<div>this is not a problem and <span
style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">"👨\u200D👩\u200D👧\u200D👦"
will constitute a single grapheme cluster.</span></div>
<div><br>
</div>
<div>That said, making grapheme clusters width units may add
significant complexity to the implementation with minor
benefits, so I'm fine going with code points especially
since there is an established example of doing this
(Python) and it's already an improvement over stdio &
iostreams.</div>
<div>
<div><br>
</div>
<div>2. Interpretation of fill.</div>
<div><br>
</div>
<div>It seems there was a general agreement that fill
should be a code point but please let me know if you
have other ideas.</div>
<div><br>
</div>
<div>3. There was a question about <span
style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">signed
and unsigned char. I checked and there is no special
handling for these types which means that they are
treated as integral types, only char and wchar_t are
treated specially as character types (and later
charN_t will be added).</span></div>
<div><br>
</div>
<div>4. Interpretation of n in format_to_n.</div>
<div><br>
</div>
<div>There was no agreement whether n should be specified
in code units or code points. An argument in favor of
code units is that n often gives the output buffer size.
On the other hand, using code points would be more
consistent with width.</div>
<div><br>
</div>
<div>I plan to add support for specifying width and fill
as code points in fmt (<span
style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Zach
gave some useful pointers on how to do that, thanks!)
and will report back with any user feedback.</span></div>
<div><br>
</div>
</div>
<div>Cheers,</div>
<div>Victor</div>
</div>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" target="_blank"
moz-do-not-send="true">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode"
rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote>
</div>
<!--'"--><br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
SG16 Unicode mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Unicode@isocpp.open-std.org">Unicode@isocpp.open-std.org</a>
<a class="moz-txt-link-freetext" href="http://www.open-std.org/mailman/listinfo/unicode">http://www.open-std.org/mailman/listinfo/unicode</a>
</pre>
</blockquote>
<p><br>
</p>
</body>
</html>