[SG16-Unicode] Questions about some corner cases of proposed std::basic_text encoding implementation

Ansel Sermersheim ansel at copperspice.com
Sat Nov 2 06:16:14 CET 2019


Hello all,

This email is an attempt to summarize for the mailing list some areas of 
concern I had after JeanHeyd's very helpful and explanatory presentation 
at CppCon regarding some of the current thinking on standardizing the 
Unicode story in C++. I hope these concerns are either unfounded, or 
developments since our conversation have rendered them moot. 
Nevertheless, I thought it would be beneficial to bring them up to this 
group for consideration.

1) There was some discussion about whether or not char32_t is guaranteed 
to be a Unicode Code Point. JeanHeyd pointed me to 
https://wg21.link/p1041, which makes it clear that for string literals 
at least this is guaranteed.

However, this is not sufficiently specified for all cases. For instance, 
a GB 18030 encoding *must* use codepoints in the PUA. If a string 
literal contains a PUA code point, how can you know the interpretation? 
Making this a compile error seems problematic, but the right answer is 
not clear to me.

2) The issue of PUA usage also comes up in the implementation of 
Encoding Objects. It seems likely that the current direction will 
necessitate some third party library to handle encodings other than the 
main UTF ones. That seems reasonable. But without some sort of standard 
mechanism that at least enumerates other common interpretations, and 
allows third party libraries to declare their support for such, there 
will be a combinatorial explosion of mutually incompatible encodings.

3) By a similar construction and often overlapping concerns, the 
availability of a standardized way for encodings to declare which 
version of unicode they support is quite important. It's also not clear 
how some of the round trip encodings can possibly be fully specified in 
the type system. For example, how could I properly encode "UTF-8 Unicode 
version 10" text containing emoji into "UTF-16 Unicode version 5" text 
using the PUA for representation for display on OS X 10.7?

4) The behavior of std::basic_text with respect to null termination is 
valid but seems potentially risky. As I understand it, std::basic_text 
will be null terminated if the underlying container is the default 
std::basic_string. This seems likely to result in encoding 
implementations which inadvertently assume null termination on their 
operands. Our work on early versions of the CsString library persuaded 
us that optional null termination is the source of some really obscure 
bugs of the buffer overrun variety, and we eventually elected to force 
null termination for all strings.

Thanks for reading and I hope these comments are of value to inform the 
eventual standard,

Ansel Sermersheim



More information about the Unicode mailing list