[SG16-Unicode] code_unit_sequence and code_point_sequence
Tom Honermann
tom at honermann.net
Wed Jun 20 18:13:41 CEST 2018
On 06/20/2018 05:34 AM, keld at keldix.com wrote:
> On Tue, Jun 19, 2018 at 09:52:05PM -0400, Tom Honermann wrote:
>> On 06/19/2018 04:19 PM, Lyberta wrote:
>>> keld at keldix.com:
>>>> Is your code point advisory the same as codepoints in 10646/Unicode, also
>>>> called characters in 10646?
>>> Yes. A code point is unsigned 32 bit integer with the values in the
>>> range of 0-10FFFF. Modern C and C++ have type char32_t which is most
>>> suitable for holding code points.
>>>
>>>> And why not just treat these as 32-bit wchar-t?
>>>> I believe this is what we do in C.
>>> Because wide execution character set is implementation defined. So far
>>> nobody has expressed opinion of changing that and Windows violates the
>>> standard by having 16 bit wchar_t.
>> Technically, Windows doesn't violate the standard by having a 16-bit
>> wchar_t. It violates the standard by using a wide execution character
>> set that defines code points that do not fit in it's (16-bit) wchar_t
>> type. We have an issue (https://github.com/sg16-unicode/sg16/issues/9)
>> to track modifying the standard to enable Microsoft's implementation to
>> be conforming.
> I believe that using a 16-bit wchar_t to handle UCS characters in a UTF-16 form is a violation of the
> C++ standard. You need to do some processing of surrogates, that is not portable to
> other platforms,and is against specs for wchar_t.
I think we are agreeing. Specifically, it violates [lex.ccon]p6
(http://eel.is/c++draft/lex.ccon#6) and [basic.fundamental]p5
(http://eel.is/c++draft/basic.fundamental#5)
>
> I do not think this obsoletes wchar_t, it should not lead to obsoletion
> that some people use it wrongly.
I agree with this sentiment.
>
> Using a 16 bit wchar_t is ok if you restrict yourself to only a 16 bit subset of UCS.
I don't disagree, but for modern applications, limiting support to the
BMP is a pretty significant restriction. And modern applications need
to work on Windows and interact with the wchar_t based Win32 UTF-16 APIs.
>
> I am happy to have a specific type to handle code points that are defined to have
> UCS code point values. I just note that I think APIs to handle such a type would need to
> have exactly the same functionality as for handling wchar_t entities.
If I'm reading this correctly, it sounds like you are expressing a
preference that text interfaces should be consistently provided for
char, wchar_t, char16_t, char32_t (and char8_t). If so, I agree.
Tom.
More information about the Unicode
mailing list