<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 5/3/19 7:44 PM, JeanHeyd Meneide
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CANHA4OhcBYXZOvzB2s-DjG9uLyUkC4rgBUHCtT9v0F4PWQTT1w@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>Note that c8rtomb is actually under-specified in the
current C and C++ standards: that is what DR 488 fixed by
Philipp K. Krause's n2040 (<a
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm"
moz-do-not-send="true">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm</a>)
applied to standard C2x was for, albeit I forget if it was
applied to the c32rtomb functions.<br>
</div>
</div>
</div>
</div>
</blockquote>
<p>Well, c8rtomb is definitely under-specified in current C
standards since it isn't defined there at all :)</p>
<p>When drafting the wording for c8rtomb for C++, I did incorporate
updates from N2040. P0482R6 contains the following note:</p>
<p>
<blockquote type="cite"><em>Drafting note: The wording for <tt>mbrtoc8</tt>
and <tt>c8rtomb</tt> is
derived from wording for <tt>mbrtoc16</tt> and <tt>c16rtomb</tt>
in C18
(<a
href="http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf">WG14
N2176</a>),
augmented by changes suggested in
<a
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm">WG14
N2040</a>
for
<a
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_488">WG14
DR488</a>
to properly account for UTF-8 being a variable length
encoding, and lightly
edited for formatting style. The author was reluctant to stray
from the
existing C wording for related functions despite a belief that
considerable
improvements to the wording would be possible.
</em></blockquote>
With regard to:</p>
<blockquote type="cite"
cite="mid:CANHA4OhcBYXZOvzB2s-DjG9uLyUkC4rgBUHCtT9v0F4PWQTT1w@mail.gmail.com">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div><br>
</div>
<div>In the case that nothing is stored, use the return
value of 0 as a marker that the current character is valid
but the mbstate has been modified and that you may be
working with a multi-byte sequence, and that you need to
feed more input into c8rtomb with the same mbstate_t.</div>
</div>
</div>
</div>
</blockquote>
I think this is consistent with the current wording, though the
wording is not explicit about this case.<br>
<blockquote type="cite"
cite="mid:CANHA4OhcBYXZOvzB2s-DjG9uLyUkC4rgBUHCtT9v0F4PWQTT1w@mail.gmail.com">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div><br>
</div>
<div>With a return value of 0, you can sanity-check the
implementation by doing mbsinit(&my_mb_state) and
checking if it does NOT return the "I am still in the
initial stateless sequence" value after claiming a return
value of 0 (the mbstate_t object should be modified since
it should be storing part of the accumulated multi-byte
sequence).<br>
<br>
To be honest with you, the whole situation is a bit awful
and -- what's worse -- is that there are no string
versions of any of these functions for fast, efficient
processing (c8srtombs/mbsrtoc8s, c16srtombs/mbsrtoc16s,
c32srtombs/mbsrtoc32s): they are just straight up missing.
The latter 2 in that list are being fixed by Philipp K.
Krause's N2282 (<a
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm"
moz-do-not-send="true">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm</a>)
-- you should write to your C and/or C++ representatives
in your country (or, really, anyone who's listening) and
tell them that we need these for fast, competitive
implementations that hope to hold a candle to proper
Unicode conversion utilities employed around the world.
(One of the kickbacks surrounding that paper is "waiting
for implementation experience and feedback", I think?) I
don't know how Tom feels about jumping the gun and writing
c8srtombs/mbsrtoc8s for the C++ standard before its
friends (
c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s) are accepted
into the C standard, but I would highly encourage that to
be a thing we do because one-by-one code point processing
is a mistake for efficient processing. In days gone by,
the C Committee added mbsrtowcs and other multiple-code
point functions to the C standard for a reason (this
reason), why the C standard is about to wait on it to make
the same mistake is something I do not quite understand.<br>
</div>
</div>
</div>
</div>
</blockquote>
<p>Philipp, do you perhaps know the history of how C came to have
the UTF code-unit-at-a-time conversion functions (e.g.,
c16rtomb(), mbrtoc16()), but not the UTF string-at-a-time analogs
of mbsrtowcs() and wcsrtombs()?</p>
<p>Tom.<br>
</p>
<blockquote type="cite"
cite="mid:CANHA4OhcBYXZOvzB2s-DjG9uLyUkC4rgBUHCtT9v0F4PWQTT1w@mail.gmail.com">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div><br>
Maybe it's just a matter of being loud and vocal enough to
the Committee and its representatives to have it put in?<br>
</div>
<br>
</div>
</div>
</div>
</blockquote>
<p><br>
</p>
</body>
</html>