<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 10/09/2018 01:39 AM, Markus Scherer
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr">On Mon, Oct 8, 2018 at 7:45 PM Tom Honermann
<<a href="mailto:tom@honermann.net"
moz-do-not-send="true">tom@honermann.net</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<div class="m_-4949948655659317562moz-cite-prefix">On
10/08/2018 12:38 PM, Markus Scherer wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">> <span
style="color:rgb(0,0,0);font-family:sans-serif;font-size:medium">ICU
supports customization of its internal code unit
type, but </span><code
class="m_-4949948655659317562gmail-highlight"><span
style="color:rgb(153,0,85)">char16_t</span></code><span
style="color:rgb(0,0,0);font-family:sans-serif;font-size:medium"> is
used by default, following ICU’s adoption of C++11.</span><br>
<div><br>
</div>
<div>Not quite... ICU supports customization of its
code unit type <u><i>for C APIs</i></u>.
Internally, and in C++ APIs, we switched to
char16_t. And because that broke call sites, we
mitigated where we could with overloads and shim
classes.</div>
</div>
</blockquote>
<br>
Ah, thank you for the correction. If we end up submitting
a revision of the paper, I'll include this correction. I
had checked the ICU sources (<tt>include/unicode/umachine.h</tt>)
and verified that the <tt>UChar</tt> typedef was
configurable, but I didn't realize that configuration was
limited to C code.<br>
</div>
</blockquote>
<div><br>
</div>
<div>We limited it to C API by doing s/UChar/char16_t/g in C++
API, except where we replaced a raw pointer with a shim
class. So you won't see "UChar" in C++ API any more at all.</div>
<div><br>
</div>
<div>Internally to compiling ICU itself, we kept UChar in
existing code (so that we didn't have to change tens of
thousands of lines) but fixed it to be a typedef for
char16_t.</div>
<div><br>
</div>
<div>Unfortunately, if UChar is configured != char16_t, you
need casts or cast helpers for using C APIs from C++ code.</div>
</div>
</div>
</blockquote>
<br>
I see, thanks for the detail. The C standard defines a (very) few
functions in terms of the C <tt>char16_t</tt> typedef (<tt>mbrtoc16</tt>,
<tt>c16rtomb</tt>). Within C++, those functions are exposed in the
<tt>std</tt> namespace as though they were declared with the C++
builtin <tt>char16_t</tt> type. Has there been much consideration
for similarly exposing ICU's C APIs to C++ consumers? (This
technique is not without complexities. For example, attempting to
take the address of an overloaded function without a cast may be
ambiguous. I'm just curious how much this or similar techniques
were explored and what the conclusions were)<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">It would be
interesting to get more perspective on how and why ICU
evolved like it did. What was the motivation for ICU to
switch to <tt>char16_t</tt>? Were the anticipated
benefits realized despite the perhaps unanticipated
complexities?</div>
</blockquote>
<div><br>
</div>
<div>We assumed that C++ code was going to adopt char16_t and
maybe std::u16string, and we wanted it to be easy for ICU to
work with those types.</div>
</div>
</div>
</blockquote>
<br>
Perhaps that will still happen :)<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>In particular, the string literals weighed heavily. For
the most part, we can ignore the standard library when it
comes to Unicode, but the previous lack of real UTF-16
string literals was extremely inconvenient. We used to have
all kinds of static const UChar arrays with numeric
intializer lists, or init-once code for setting up string
"constants", even when they contained only ASCII characters.</div>
</div>
</div>
</blockquote>
<br>
I remember doing similarly back in the day :)<br>
<br>
I also remember looking forward to C99 compound literals so as to
avoid the statics:<br>
<br>
<tt>typedef unsigned char UChar;</tt><tt><br>
</tt><tt>typedef const UChar UTF16_LITERAL[];</tt><tt><br>
</tt><tt>void use(const UChar*);</tt><tt><br>
</tt><tt>void f() {</tt><tt><br>
</tt><tt> use((UTF16_LITERAL){ 0x48 /*H*/, 0x69 /*i*/, 0 });</tt><tt><br>
</tt><tt>}</tt><br>
<br>
I prefer real literals :)<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>Now that we can use u"literals" we managed to clean up
some of our code, and new library code and especially new
unit test code benefits greatly.</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">If Windows were to
suddenly sprout Win32 interfaces defined in terms of <tt>char16_t</tt>,
would the pain be substantially relieved?</div>
</blockquote>
<div><br>
</div>
<div>No. Many if not most of our users are not on Windows, or
at least not only on Windows. UTF-16 is fairly widely used.</div>
<div><br>
</div>
<div>Anyway, I doubt that Windows will do that. Operating
systems want to never break code like this, and these would
all be duplicates.</div>
<div>Although I suppose they could do it as a header-only
shim.</div>
</div>
</div>
</blockquote>
<br>
I've never heard of any plans to add such interfaces. I was just
curious that, if they were added, would it be helpful. I suspect it
would be helpful for Windows users, but perhaps not exceptionally
so.<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>Microsoft was pretty unhappy with this change in ICU.
They went with it because they were early in their
integration of ICU into Windows.</div>
<div><br>
</div>
<div>They also have some fewer problems: I believe they
concluded that the aliasing trick was so developer-hostile
that they decided never to optimize based on it, at least
for the types involved. I don't think our aliasing barrier
is defined on Windows.</div>
</div>
</div>
</blockquote>
<br>
I can understand that. It might make sense for us to consider
allowing <tt>reinterpret_cast<char8_t*>(char_pointer_expression)</tt>
to not be undefined behavior, at least as a deprecated feature. We
could actually specify this since the underlying type of <tt>char8_t</tt>
would be the same everywhere (unlike <tt>char16_t</tt>).<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>If u"literals" had just been uint16_t* without a new
type, then we could have used string literals without
changing API and breaking call sites, on most platforms
anyway. And if uint16_t==wchar_t on Windows, then that would
have been fine, too.</div>
</div>
</div>
</blockquote>
<br>
How would that have been fine on Windows? The reinterpret casts
would still have been required. I suspect the alias barrier would
still be needed for non-Microsoft compilers on Windows.<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>Note: Of course there are places where we use uint16_t*
binary data, but there is never any confusion whether a
function works with binary data vs. a string. You just
wouldn't use the same function or name for unrelated
operations.</div>
<div><br>
</div>
<div>Note also: While most of ICU works with UTF-16, we do
have some UTF-8 functions. We distinguish the two with
different function names, such as in <a
href="http://icu-project.org/apiref/icu4c/classicu_1_1CaseMap.html"
moz-do-not-send="true">class CaseMap</a> (toLower() vs.
utf8ToLower()).</div>
<div><br>
</div>
<div>If we had operations that worked on both UTF-8 and some
other charset, we would also use different names.</div>
</div>
</div>
</blockquote>
<br>
This may be where we have differing perspectives. The trend in
modern C++ is towards generic code and overloading plays an
important role there.<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">Are code bases that
use ICU on non-Windows platforms (slowly) migrating from <tt>uint16_t</tt>
to <tt>char16_t</tt>?<br>
</div>
</blockquote>
<div><br>
</div>
<div>I don't remember what Chromium and Android ended up
doing. You could take a look at their code.</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<blockquote type="cite">
<div dir="ltr">
<div>If you do want a distinct type, why not just
standardize on uint8_t? Why does it need to be a new
type that is distinct from that, too?<br>
</div>
</div>
</blockquote>
Lyberta provided one example; we do need to be able to
overload or specialize on character vs integer types.</div>
</blockquote>
<div><br>
</div>
<div>I don't find the examples so far convincing. Overloading
on primitive types to distinguish between UTF-8 vs. one or
more legacy charsets seems both unnecessary and like bad
practice. Explicit naming of things that are different is
good.</div>
</div>
</div>
</blockquote>
<br>
Lyberta provided one example, but there are others. For example,
serialization and logging libraries. Consider a modern JSON
library; it is convenient to be able to write code like the
following that just works.<br>
<br>
<tt><tt>json_object player;</tt></tt><br>
<tt><tt><tt>uint16_t scores[] = { 16, 27, 13 };<br>
</tt>player["id"] = 42;<br>
</tt>player["name"] = std::u16string("Skipper McGoof");<br>
player["nickname"] = u"Goofy"; // stores a string<br>
player["scores"] = scores; // stores an array of numbers.<br>
</tt><br>
Note that the above works because <tt>uint16_t</tt> is effectively
never defined in terms of a character type. That isn't true for <tt>uint8_t</tt>.<br>
<br>
Other examples come up in language binding libraries like sol2 where
it is desirable to map native types across language boundaries.<br>
<br>
Having different types for character data makes the above possible
without having to hard-code for specific string types. In the
concepts enabled world that we are moving into, this enables us to
write concepts like the following that can then be used to constrain
functions intended to work only on string-like types.<br>
<br>
<tt>template<typename T><br>
concept Character = AnySameUnqualified<T, char, wchar_t,
char8_t, char16_t, char32_t>;<br>
template<typename T></tt><tt><br>
</tt><tt>concept String = Range<T> &&
Character<ValueType<T>>;</tt><br>
<br>
For the imaginary JSON example above, we might then write:<br>
<br>
<tt>template<String S></tt><tt><br>
</tt><tt>json_value::operator=(const S& s) {</tt><tt><br>
</tt><tt> to_utf8_string(s);</tt><tt><br>
</tt><tt>};</tt><tt><br>
</tt><tt>template<Character C></tt><tt><br>
</tt><tt>json_value::operator=(const C* s) {</tt><tt><br>
</tt><tt> to_utf8_string(s);</tt><tt><br>
</tt><tt>};</tt><br>
<tt><tt>template<Number T, std::size_t N></tt><tt><br>
</tt><tt>json_value::operator=(const T (&a)[N]) {</tt><tt><br>
</tt><tt> to_array(a);</tt><tt><br>
</tt><tt>};<br>
<br>
</tt></tt>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>What makes sense to me is that "char" can be signed, and
that's bad for dealing with non-ASCII characters.</div>
</div>
</div>
</blockquote>
<br>
Yes, yes it is :)<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div>In ICU, when I get to actual UTF-8 processing, I tend to
either cast each byte to uint8_t or cast the whole pointer
to uint8_t* and call an internal worker function.</div>
<div>Somewhat ironically, the fastest way to test for a UTF-8
trail byte is via the opposite cast, testing if
(int8_t)b<-0x40.</div>
</div>
</div>
</blockquote>
<br>
Assuming a 2s complement representation, which we're nearly set to
be able to assume in C++20 (<a class="moz-txt-link-freetext" href="http://wg21.link/p0907">http://wg21.link/p0907</a>)!<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>This is why I said it would be much simpler if the "char"
default could be changed to be unsigned.</div>
<div>I realize that non-portable code that assumes a signed
char type would then need the opposite command-line option
that people now use to force it to unsigned.</div>
</div>
</div>
</blockquote>
<br>
I haven't thought about this enough yet to have a sense of how big a
change this would be.<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">Since <tt>uint8_t</tt>
is conditionally supported, we can't rely on its existence
within the standard (we'd have to use <tt>unsigned char</tt>
or <tt>uint_least8_t</tt> instead).<br>
</div>
</blockquote>
<div><br>
</div>
<div>I seriously doubt that there is a platform that keeps up
with modern C++ and does not have a real uint8_t.</div>
</div>
</div>
</blockquote>
<br>
That may be. Removing the conditionally supported qualification
might be a possibility these days. I'm really not sure.<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>ICU is one of the more widely portable libraries (or was,
until we adopted C++11 and left some behind) and would
likely fail royally if the uint8_t and uint16_t types we are
using were actually wider than advertised and revealed
larger values etc. Since ICU is also widely used, that would
break a lot of systems. But no one has ever reported a bug
(or request for porting patches) related to non-power-of-2
integer types.</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">I think there is value
in maintaining consistency with <tt>char16_t</tt> and <tt>char32_t</tt>.
<tt>char8_t</tt> provides the missing piece needed to
enable a clean, type safe, external vs internal encoding
model that allows use of any of UTF-8, UTF-16, or UTF-32
as the internal encoding, that is easy to teach, and that
facilitates generic libraries like text_view that work
seamlessly with any of these encodings.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Maybe. I don't see the need to use the same function
names for a variety of legacy charsets vs. UTF-8.<br>
</div>
</div>
</div>
</blockquote>
<br>
I do. Again, primarily for writing generic code. I expect the need
to do so to increase in modern C++.<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>20 years ago I wrote a set of macros that looked the same
but had versions for UTF-8, UTF-16, and UTF-32. I briefly
thought we could make (some of?) ICU essentially switchable
between UTFs. I quickly learned that any real, non-trivial
code you would want to write for either of them wants to be
specific to that UTF, especially when people want text
processing to be fast. (You can see remnants of this
youthful folly in ICU's unicode/utf_old.h header file.)</div>
</div>
</div>
</blockquote>
<br>
I agree that when you get down to actually manipulating the text,
you effectively need (chunks of) contiguous storage and encoding
specific support and at that point, the desire to overload or
specialize drops significantly. The advantages in being able to
deduce an encoding or overload/specialize appear at higher levels of
abstraction - in code that only needs to recognize and direct text
to the right low level function.<br>
<br>
Tom.<br>
<br>
<blockquote type="cite"
cite="mid:CAN49p6pixKYgb6cX3ZsgVhzVF-FqP3Yus5aW2utp6Ac3jFQM-A@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>Best regards,</div>
<div>markus</div>
</div>
</div>
</blockquote>
<p><br>
</p>
</body>
</html>