<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Tue, Oct 9, 2018 at 8:57 PM Tom Honermann <<a href="mailto:tom@honermann.net">tom@honermann.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<div class="m_3185301157471047293moz-cite-prefix">The C standard defines a (very) few
functions in terms of the C <tt>char16_t</tt> typedef (<tt>mbrtoc16</tt>,
<tt>c16rtomb</tt>). Within C++, those functions are exposed in the
<tt>std</tt> namespace as though they were declared with the C++
builtin <tt>char16_t</tt> type. Has there been much consideration
for similarly exposing ICU's C APIs to C++ consumers?</div></div></blockquote><div><br></div><div>C++ code calls ICU C APIs all the time.</div><div>People use C APIs because they can be binary stable, and they want to be able to link with multiple versions of the ICU DLL.</div><div><br></div><div>People who call C++ APIs either tightly control DLL versions or link everything statically.</div><div><br></div><div>It would be really nice if it was feasible to provide stable C++ API from a shared library.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF"><div class="m_3185301157471047293moz-cite-prefix">(This
technique is not without complexities. For example, attempting to
take the address of an overloaded function without a cast may be
ambiguous. I'm just curious how much this or similar techniques
were explored and what the conclusions were)<br></div></div></blockquote><div><br></div><div>Not sure what the question is.</div><div>There is of course no overloading on C APIs.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF"><div class="m_3185301157471047293moz-cite-prefix"></div>
<blockquote type="cite"><div dir="ltr"><div class="gmail_quote"><div>If u"literals" had just been uint16_t* without a new
type, then we could have used string literals without
changing API and breaking call sites, on most platforms
anyway. And if uint16_t==wchar_t on Windows, then that would
have been fine, too.<br></div>
</div>
</div>
</blockquote>
<br>
How would that have been fine on Windows? The reinterpret casts
would still have been required.<br></div></blockquote><div><br></div><div>Why? If the two types had been typedefs of each other, there would need not be any casts.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">Lyberta provided one example, but there are others. For example,
serialization and logging libraries. Consider a modern JSON
library; it is convenient to be able to write code like the
following that just works.<br>
<br>
<tt><tt>json_object player;</tt></tt><br>
<tt><tt><tt>uint16_t scores[] = { 16, 27, 13 };<br>
</tt>player["id"] = 42;<br>
</tt>player["name"] = std::u16string("Skipper McGoof");<br>
player["nickname"] = u"Goofy"; // stores a string<br>
player["scores"] = scores; // stores an array of numbers.<br>
</tt><br>
Note that the above works because <tt>uint16_t</tt> is effectively
never defined in terms of a character type.</div></blockquote><div><br></div><div>Sure, but that feels like cherry-picking: You introduce one new type for one specific kind of thing (a pointer to certain units holding a string), but every other data that's a vector of essentially the same base units is still not distinguishable -- you wouldn't be able to distinguish scores from coordinates from other lists of numbers etc.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">Having different types for character data makes the above possible
without having to hard-code for specific string types. In the
concepts enabled world that we are moving into, this enables us to
write concepts like the following that can then be used to constrain
functions intended to work only on string-like types.<br></div></blockquote><div><br></div><div>I take your word for it. I know nothing about "concepts".</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF"><blockquote type="cite"><div dir="ltr"><div class="gmail_quote"><div>In ICU, when I get to actual UTF-8 processing, I tend to
either cast each byte to uint8_t or cast the whole pointer
to uint8_t* and call an internal worker function.</div>
<div>Somewhat ironically, the fastest way to test for a UTF-8
trail byte is via the opposite cast, testing if
(int8_t)b<-0x40.</div>
</div>
</div>
</blockquote>
<br>
Assuming a 2s complement representation, which we're nearly set to
be able to assume in C++20 (<a class="m_3185301157471047293moz-txt-link-freetext" href="http://wg21.link/p0907" target="_blank">http://wg21.link/p0907</a>)!<br></div></blockquote><div><br></div><div>Well, this is nice! Especially</div></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div class="gmail_quote"><div><em style="color:rgb(0,0,0);font-family:sans-serif;font-size:medium">Change</em><span style="color:rgb(0,0,0);font-family:sans-serif;font-size:medium"> Right-shift is an arithmetic right shift which performs sign-extension.</span></div></div></blockquote><div class="gmail_quote"><div>which should get static-analysis tools off our backs.</div><div><br></div><div>Only because those have complained about code where we use arithmetic right shifts did I have to make a macro that does the normal (signed>>num_bits) on normal compilers, and a manual sign extension when compiling for static analysis...</div><div>I don't think it's been an issue on any real compiler. All machines that anyone ever ported ICU to seem to use two's-complement integers of 8/16/32/... bits.</div><div><br></div><div>markus</div></div></div>