<div dir="ltr"><div><div><div><div>Dear SG16,<br><br></div> I think a codepoint type would be very helpful, even if it is just a strong typedef over char32_t that we manually define in the library. I am not sure it would be a great idea to ask for another primitive type in C++'s Core Language, since this one can be done fairly well in the library with the appropriately operator-strapped strong typedef.<br><br></div> With explicit constructors from `char32_t` we can probably realize this dream fairly well, even if it might make code very verbose. (Making it a regular, non-explicit constructor can probably aid ease of use for this who already use Unicode and work with char32_t or uint32_t and friends.)<br><br></div> Distinct codeunit types are probably not worth the effort. Validation is not something done on singular code units basis to begin with, these are multibyte sequences. Fundamentally, validation should work at the multi-code-unit level: presenting anything else proliferates the confusion that a single code unit by itself is meaningful. It is not meaningful.<br><br> Furthermore, there are more encodings than the 3 we would have these validated code units for. While first-class support for Unicode at such a level would be good, individual code units hardly are worth the validation: sequences are what is more important. This also leaves room for CESU8, WTF8, and similar transformations which may or may not encode things outside of the typical range an individual code unit has but still makes sense for its sequencing rules.<br><br></div><div> Let's focus on sequences.<br><br></div><div>All the Best,<br></div><div>JeanHeyd<br></div><div><div><div><div><div><div><div><br></div><div>(P.S.: code_unit and code_point or codeunit and codepoint?)<br></div><div><br><div class="gmail_quote"><div dir="ltr">On Wed, Dec 5, 2018 at 9:15 AM Tom Honermann <<a href="mailto:tom@honermann.net" target="_blank">tom@honermann.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div class="m_3311438504078840135m_-6964030514711219093moz-cite-prefix">On 12/5/18 8:05 AM, Steve Downey wrote:<br>
</div>
<blockquote type="cite">
<div dir="auto">`codepoint` also, which is probably "just" a
char32_t? <br>
</div>
</blockquote>
<p>No, I think a type that isn't convertible from code unit types is
desirable. (I'm interpreting your response as implying that
'codepoint' would just be a type alias of 'char32_t' as opposed to
a distinct strong type)<br>
</p>
<p>Thinking about the <tt>std::isalnum</tt> example we discussed
this week. The problem was that it was being called with code
unit values, but its parameter type means something more like a
code point. Code like the following is well-formed and follows
current recommendations for correct use of <tt>std::isalnum</tt>,
but is nevertheless incorrect for multibyte encodings that reuse
valid leading code unit values as trailing code unit values (e.g.;
Shift-JIS).<br>
</p>
<p><tt>void f(const char *s) {</tt><tt><br>
</tt><tt> while (*s) {</tt><tt><br>
</tt><tt> if (std::isalnum(static_cast<unsigned
char>(*s++)) {</tt><tt><br>
</tt><tt> ...</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt>}</tt><br>
</p>
<p>Use of a distinct type for code points that is not implicitly
convertible from a code unit type prevents these kinds of
problems.</p>
<p>Tom.<br>
</p></div></blockquote></div></div></div></div></div></div></div></div></div>