[SG16-Unicode] Draft SG16 direction paper

Tom Honermann tom at honermann.net
Wed Oct 17 06:16:38 CEST 2018


On 10/16/2018 05:58 PM, Markus Scherer wrote:
> On Tue, Oct 9, 2018 at 8:57 PM Tom Honermann <tom at honermann.net 
> <mailto:tom at honermann.net>> wrote:
>
>     The C standard defines a (very) few functions in terms of the C
>     char16_t typedef (mbrtoc16, c16rtomb). Within C++, those functions
>     are exposed in the std namespace as though they were declared with
>     the C++ builtin char16_t type.  Has there been much consideration
>     for similarly exposing ICU's C APIs to C++ consumers?
>
>
> C++ code calls ICU C APIs all the time.

Of course, sorry, I wasn't very clear with that question.  Let me try 
again.  I was responding to this quote:

 > Unfortunately, if UChar is configured != char16_t, you need casts or 
cast helpers for using C APIs from C++ code.

The question is, effectively, whether consideration has been given to 
providing cast helpers in a manner similar to how standard C++ provides 
access to standard C functions; e.g., by exposing cast helpers in a C++ 
namespace.  More concretely, whether something like the following has 
been considered:

    U_STABLE UChar * U_EXPORT2
    u_strchr(const UChar *s, UChar c);

    #if defined(__cplusplus)
    namespace icu {
       char16_t * U_EXPORT2
       u_strchr(const char16_t *s, char16_t c);
    };
    #endif /* __cplusplus */

Noting that there are methods on at least some platforms that avoid 
having to actually write a definition for the namespace scoped signature 
when the functions have compatible calling conventions.

> People use C APIs because they can be binary stable, and they want to 
> be able to link with multiple versions of the ICU DLL.

Indeed.

>
> People who call C++ APIs either tightly control DLL versions or link 
> everything statically.

Despite not wanting to...

>
> It would be really nice if it was feasible to provide stable C++ API 
> from a shared library.

but having to because of this :)

>
>     (This technique is not without complexities.  For example,
>     attempting to take the address of an overloaded function without a
>     cast may be ambiguous.  I'm just curious how much this or similar
>     techniques were explored and what the conclusions were)
>
>
> Not sure what the question is.
> There is of course no overloading on C APIs.

Hopefully I've clarified this above.

>
>>     If u"literals" had just been uint16_t* without a new type, then
>>     we could have used string literals without changing API and
>>     breaking call sites, on most platforms anyway. And if
>>     uint16_t==wchar_t on Windows, then that would have been fine, too.
>
>     How would that have been fine on Windows?  The reinterpret casts
>     would still have been required.
>
>
> Why? If the two types had been typedefs of each other, there would 
> need not be any casts.

I overlooked your mention of uint16_t==wchar_t.  However, uint16_t was 
added in C99 and I suspect it would have already been too late to define 
it as wchar_t when u"literals" were adopted.  Additionally, that would 
have resulted in the same problems that we now face with int8_t commonly 
being defined in terms of a character type.

>
>     Lyberta provided one example, but there are others.  For example,
>     serialization and logging libraries.  Consider a modern JSON
>     library; it is convenient to be able to write code like the
>     following that just works.
>
>     json_object player;
>     uint16_t scores[] = { 16, 27, 13 };
>     player["id"] = 42;
>     player["name"] = std::u16string("Skipper McGoof");
>     player["nickname"] = u"Goofy"; // stores a string
>     player["scores"] = scores;     // stores an array of numbers.
>
>     Note that the above works because uint16_t is effectively never
>     defined in terms of a character type.
>
>
> Sure, but that feels like cherry-picking: You introduce one new type 
> for one specific kind of thing (a pointer to certain units holding a 
> string), but every other data that's a vector of essentially the same 
> base units is still not distinguishable -- you wouldn't be able to 
> distinguish scores from coordinates from other lists of numbers etc.

That is a fair criticism.  The trend is to improve the ability to 
distinguish such unit kinds.  We see this in the C++20 std::chrono 
library and other libraries like https://github.com/nholthaus/units.  
C++11 user defined literals (despite some usability issues) are intended 
to help in this respect.  Where we have core language features (e.g., 
string literals), I think it is reasonable to be able to differentiate 
them without having to further decorate them.

Tom.

>
>     Having different types for character data makes the above possible
>     without having to hard-code for specific string types.  In the
>     concepts enabled world that we are moving into, this enables us to
>     write concepts like the following that can then be used to
>     constrain functions intended to work only on string-like types.
>
>
> I take your word for it. I know nothing about "concepts".
>
>>     In ICU, when I get to actual UTF-8 processing, I tend to either
>>     cast each byte to uint8_t or cast the whole pointer to uint8_t*
>>     and call an internal worker function.
>>     Somewhat ironically, the fastest way to test for a UTF-8 trail
>>     byte is via the opposite cast, testing if (int8_t)b<-0x40.
>
>     Assuming a 2s complement representation, which we're nearly set to
>     be able to assume in C++20 (http://wg21.link/p0907)!
>
>
> Well, this is nice! Especially
>
>     /Change/ Right-shift is an arithmetic right shift which performs
>     sign-extension.
>
> which should get static-analysis tools off our backs.
>
> Only because those have complained about code where we use arithmetic 
> right shifts did I have to make a macro that does the normal 
> (signed>>num_bits) on normal compilers, and a manual sign extension 
> when compiling for static analysis...
> I don't think it's been an issue on any real compiler. All machines 
> that anyone ever ported ICU to seem to use two's-complement integers 
> of 8/16/32/... bits.
>
> markus


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20181017/9f63c006/attachment-0001.html 


More information about the Unicode mailing list