[SG16-Unicode] Questions about some corner cases of proposed std::basic_text encoding implementation

Lyberta lyberta at lyberta.net
Sat Nov 2 13:11:00 CET 2019


Ansel Sermersheim:
> 1) There was some discussion about whether or not char32_t is guaranteed 
> to be a Unicode Code Point. JeanHeyd pointed me to 
> https://wg21.link/p1041, which makes it clear that for string literals 
> at least this is guaranteed.

Yes, char32_t is a bad type along with char8_t and char16_t. For that
reason I'm proposing strong types with proper guarantees:

https://github.com/Lyberta/cpp-unicode/blob/master/Fundamental.md

You can put ill-formed Unicode in string literals via escape codes. This
is also bad.

> 
> However, this is not sufficiently specified for all cases. For instance, 
> a GB 18030 encoding *must* use codepoints in the PUA. If a string 
> literal contains a PUA code point, how can you know the interpretation? 
> Making this a compile error seems problematic, but the right answer is 
> not clear to me.

Can probably be solved by custom instance of
std::unicode::character_database.
> 
> 2) The issue of PUA usage also comes up in the implementation of 
> Encoding Objects. It seems likely that the current direction will 
> necessitate some third party library to handle encodings other than the 
> main UTF ones. That seems reasonable. But without some sort of standard 
> mechanism that at least enumerates other common interpretations, and 
> allows third party libraries to declare their support for such, there 
> will be a combinatorial explosion of mutually incompatible encodings.

I think providing conversions to and from Unicode scalar values is enough.

> 
> 3) By a similar construction and often overlapping concerns, the 
> availability of a standardized way for encodings to declare which 
> version of unicode they support is quite important. It's also not clear 
> how some of the round trip encodings can possibly be fully specified in 
> the type system. For example, how could I properly encode "UTF-8 Unicode 
> version 10" text containing emoji into "UTF-16 Unicode version 5" text 
> using the PUA for representation for display on OS X 10.7?

Different versions of Unicode and PUA are a job for
std::unicode::character_database.

> 
> 4) The behavior of std::basic_text with respect to null termination is 
> valid but seems potentially risky. As I understand it, std::basic_text 
> will be null terminated if the underlying container is the default 
> std::basic_string. This seems likely to result in encoding 
> implementations which inadvertently assume null termination on their 
> operands. Our work on early versions of the CsString library persuaded 
> us that optional null termination is the source of some really obscure 
> bugs of the buffer overrun variety, and we eventually elected to force 
> null termination for all strings.

I think null termination is just bad design. Pointer + length is the way
to go.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://www.open-std.org/pipermail/unicode/attachments/20191102/261dc4fd/attachment.bin 


More information about the Unicode mailing list