[SG16-Unicode] Convert between std::u8string and std::string

Tom Honermann tom at honermann.net
Mon May 6 05:32:56 CEST 2019


On 5/3/19 7:44 PM, JeanHeyd Meneide wrote:
> Note that c8rtomb is actually under-specified in the current C and C++ 
> standards: that is what DR 488 fixed by Philipp K. Krause's n2040 
> (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm) applied to 
> standard C2x was for, albeit I forget if it was applied to the 
> c32rtomb functions.

Well, c8rtomb is definitely under-specified in current C standards since 
it isn't defined there at all :)

When drafting the wording for c8rtomb for C++, I did incorporate updates 
from N2040.  P0482R6 contains the following note:

> /Drafting note: The wording for mbrtoc8 and c8rtomb is derived from 
> wording for mbrtoc16 and c16rtomb in C18 (WG14 N2176 
> <http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf>), 
> augmented by changes suggested in WG14 N2040 
> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm> for WG14 
> DR488 
> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_488> to 
> properly account for UTF-8 being a variable length encoding, and 
> lightly edited for formatting style. The author was reluctant to stray 
> from the existing C wording for related functions despite a belief 
> that considerable improvements to the wording would be possible. /
With regard to:

>
> In the case that nothing is stored, use the return value of 0 as a 
> marker that the current character is valid but the mbstate has been 
> modified and that you may be working with a multi-byte sequence, and 
> that you need to feed more input into c8rtomb with the same mbstate_t.
I think this is consistent with the current wording, though the wording 
is not explicit about this case.
>
> With a return value of 0, you can sanity-check the implementation by 
> doing mbsinit(&my_mb_state) and checking if it does NOT return the "I 
> am still in the initial stateless sequence" value after claiming a 
> return value of 0 (the mbstate_t object should be modified since it 
> should be storing part of the accumulated multi-byte sequence).
>
> To be honest with you, the whole situation is a bit awful and -- 
> what's worse -- is that there are no string versions of any of these 
> functions for fast, efficient processing (c8srtombs/mbsrtoc8s, 
> c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s): they are just straight 
> up missing. The latter 2 in that list are being fixed by Philipp K. 
> Krause's N2282 
> (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm) -- you 
> should write to your C and/or C++ representatives in your country (or, 
> really, anyone who's listening) and tell them that we need these for 
> fast, competitive implementations that hope to hold a candle to proper 
> Unicode conversion utilities employed around the world. (One of the 
> kickbacks surrounding that paper is "waiting for implementation 
> experience and feedback", I think?) I don't know how Tom feels about 
> jumping the gun and writing c8srtombs/mbsrtoc8s for the C++ standard 
> before its friends ( c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s) are 
> accepted into the C standard, but I would highly encourage that to be 
> a thing we do because one-by-one code point processing is a mistake 
> for efficient processing. In days gone by, the C Committee added 
> mbsrtowcs and other multiple-code point functions to the C standard 
> for a reason (this reason), why the C standard is about to wait on it 
> to make the same mistake is something I do not quite understand.

Philipp, do you perhaps know the history of how C came to have the UTF 
code-unit-at-a-time conversion functions (e.g., c16rtomb(), mbrtoc16()), 
but not the UTF string-at-a-time analogs of mbsrtowcs() and wcsrtombs()?

Tom.

>
> Maybe it's just a matter of being loud and vocal enough to the 
> Committee and its representatives to have it put in?
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190505/cad83290/attachment.html 


More information about the Unicode mailing list