[SG16-Unicode] Strong code unit types

Tom Honermann tom at honermann.net
Wed Dec 5 15:15:17 CET 2018


On 12/5/18 8:05 AM, Steve Downey wrote:
> `codepoint` also, which is probably "just" a char32_t?

No, I think a type that isn't convertible from code unit types is 
desirable.  (I'm interpreting your response as implying that 'codepoint' 
would just be a type alias of 'char32_t' as opposed to a distinct strong 
type)

Thinking about the std::isalnum example we discussed this week.  The 
problem was that it was being called with code unit values, but its 
parameter type means something more like a code point.  Code like the 
following is well-formed and follows current recommendations for correct 
use of std::isalnum, but is nevertheless incorrect for multibyte 
encodings that reuse valid leading code unit values as trailing code 
unit values (e.g.; Shift-JIS).

void f(const char *s) {
   while (*s) {
     if (std::isalnum(static_cast<unsigned char>(*s++)) {
       ...
     }
   }
}

Use of a distinct type for code points that is not implicitly 
convertible from a code unit type prevents these kinds of problems.

Tom.

>
> On Wed, Dec 5, 2018, 01:40 Tom Honermann <tom at honermann.net 
> <mailto:tom at honermann.net> wrote:
>
>     On 12/4/18 11:17 PM, Lyberta wrote:
>     > This is something that hit me recently. Why are we using fundamental
>     > types for code units? CppCon 2018 is full of people saying that we
>     > should migrate to strong types, that std::size_t should have been a
>     > struct, etc.
>     The primary reason for using fundamental types for code units is that
>     those are the types used for character and string literals.
>     >
>     > I propose we add strong types for code units:
>     >
>     > * utf8_code_unit
>     > * utf16_code_unit
>     > * utf32_code_unit
>     >
>     > These will hold char8,16,32_t inside of them respectively but
>     will not
>     > allow the invalid values such as >245 for UTF-8, surrogates and
>     >> 0x10FFFF for UTF-32, etc.
>     > This will guarantee that all code units are valid and will allow
>     us to
>     > write much faster code because we will never need to check for
>     invalid
>     > values.
>
>     The downside of such validating types is the validation overhead.
>
>     I am in favor of introducing strong types for code points.
>
>     Tom.
>
>     _______________________________________________
>     SG16 Unicode mailing list
>     Unicode at isocpp.open-std.org <mailto:Unicode at isocpp.open-std.org>
>     http://www.open-std.org/mailman/listinfo/unicode
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20181205/d0756083/attachment.html 


More information about the Unicode mailing list