[SG16-Unicode] Draft SG16 direction paper

Tom Honermann tom at honermann.net
Tue Oct 9 04:45:30 CEST 2018


On 10/08/2018 12:38 PM, Markus Scherer wrote:
> > ICU supports customization of its internal code unit type, but 
> |char16_t| is used by default, following ICU’s adoption of C++11.
>
> Not quite... ICU supports customization of its code unit type _/for C 
> APIs/_. Internally, and in C++ APIs, we switched to char16_t. And 
> because that broke call sites, we mitigated where we could with 
> overloads and shim classes.

Ah, thank you for the correction.  If we end up submitting a revision of 
the paper, I'll include this correction.  I had checked the ICU sources 
(include/unicode/umachine.h) and verified that the UChar typedef was 
configurable, but I didn't realize that configuration was limited to C code.

>
> This was all quite painful.

I believe that.  I discovered the U_ALIASING_BARRIER macro used to work 
around the fact that, for example, reinterpret_cast<const wchar_t*> from 
a pointer to char16_t results in undefined behavior.  The need for such 
heroics is a bit more limited for char8_t since char and unsigned char 
are allowed to alias with char8_t (though not the other way around).

It would be interesting to get more perspective on how and why ICU 
evolved like it did.  What was the motivation for ICU to switch to 
char16_t? Were the anticipated benefits realized despite the perhaps 
unanticipated complexities?  If Windows were to suddenly sprout Win32 
interfaces defined in terms of char16_t, would the pain be substantially 
relieved?  Are code bases that use ICU on non-Windows platforms (slowly) 
migrating from uint16_t to char16_t?

>
> As for char8_t, I realize that you think the benefits outweigh the costs.
> I asked some C++ experts about the potential for performance gains 
> from better optimizations; one responded with a skeptical note.

This is something I would like to get more data on.  I've looked and 
I've asked, but so far haven't found any research that attempts to 
quantify the lost optimization cost due to aliasing char. I've heard 
claims that it is significant, but have not seen data to support such 
claims.  The benefits of TBAA in general are not disputed, and it seems 
reasonable to conclude that there is therefore a lost opportunity if 
TBAA cannot be applied fully for char. But whether that opportunity is 
large or small I really don't know. In theory, we could use the current 
support in gcc and Clang for char8_t to explore this further.

>
> If you do want a distinct type, why not just standardize on uint8_t? 
> Why does it need to be a new type that is distinct from that, too?

Lyberta provided one example; we do need to be able to overload or 
specialize on character vs integer types.  Since uint8_t is 
conditionally supported, we can't rely on its existence within the 
standard (we'd have to use unsigned char or uint_least8_t instead).

I think there is value in maintaining consistency with char16_t and 
char32_t. char8_t provides the missing piece needed to enable a clean, 
type safe, external vs internal encoding model that allows use of any of 
UTF-8, UTF-16, or UTF-32 as the internal encoding, that is easy to 
teach, and that facilitates generic libraries like text_view that work 
seamlessly with any of these encodings.

Tom.

>
> Best regards,
> markus


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20181008/21d6b1d7/attachment-0001.html 


More information about the Unicode mailing list