<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 12/5/18 8:05 AM, Steve Downey wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJEGDKrpmxtKE2HwZ4T4k48j11RqX78JD5Sn6Nw9dCXo4Hc0RQ@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="auto">`codepoint` also, which is probably "just" a
char32_t? <br>
</div>
</blockquote>
<p>No, I think a type that isn't convertible from code unit types is
desirable. (I'm interpreting your response as implying that
'codepoint' would just be a type alias of 'char32_t' as opposed to
a distinct strong type)<br>
</p>
<p>Thinking about the <tt>std::isalnum</tt> example we discussed
this week. The problem was that it was being called with code
unit values, but its parameter type means something more like a
code point. Code like the following is well-formed and follows
current recommendations for correct use of <tt>std::isalnum</tt>,
but is nevertheless incorrect for multibyte encodings that reuse
valid leading code unit values as trailing code unit values (e.g.;
Shift-JIS).<br>
</p>
<p><tt>void f(const char *s) {</tt><tt><br>
</tt><tt> while (*s) {</tt><tt><br>
</tt><tt> if (std::isalnum(static_cast<unsigned
char>(*s++)) {</tt><tt><br>
</tt><tt> ...</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt>}</tt><br>
</p>
<p>Use of a distinct type for code points that is not implicitly
convertible from a code unit type prevents these kinds of
problems.</p>
<p>Tom.<br>
</p>
<blockquote type="cite"
cite="mid:CAJEGDKrpmxtKE2HwZ4T4k48j11RqX78JD5Sn6Nw9dCXo4Hc0RQ@mail.gmail.com"><br>
<div class="gmail_quote">
<div dir="ltr">On Wed, Dec 5, 2018, 01:40 Tom Honermann <<a
href="mailto:tom@honermann.net" moz-do-not-send="true">tom@honermann.net</a>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">On 12/4/18
11:17 PM, Lyberta wrote:<br>
> This is something that hit me recently. Why are we using
fundamental<br>
> types for code units? CppCon 2018 is full of people
saying that we<br>
> should migrate to strong types, that std::size_t should
have been a<br>
> struct, etc.<br>
The primary reason for using fundamental types for code units
is that <br>
those are the types used for character and string literals.<br>
><br>
> I propose we add strong types for code units:<br>
><br>
> * utf8_code_unit<br>
> * utf16_code_unit<br>
> * utf32_code_unit<br>
><br>
> These will hold char8,16,32_t inside of them respectively
but will not<br>
> allow the invalid values such as >245 for UTF-8,
surrogates and<br>
>> 0x10FFFF for UTF-32, etc.<br>
> This will guarantee that all code units are valid and
will allow us to<br>
> write much faster code because we will never need to
check for invalid<br>
> values.<br>
<br>
The downside of such validating types is the validation
overhead.<br>
<br>
I am in favor of introducing strong types for code points.<br>
<br>
Tom.<br>
<br>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" target="_blank"
rel="noreferrer" moz-do-not-send="true">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode"
rel="noreferrer noreferrer" target="_blank"
moz-do-not-send="true">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote>
</div>
</blockquote>
<p><br>
</p>
</body>
</html>