From owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org  Sun Mar 29 06:53:18 2020
Return-Path: <owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org>
X-Original-To: sc22wg14-domo2
Delivered-To: sc22wg14-domo2@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id ACF189DB16D; Sun, 29 Mar 2020 06:53:18 +0200 (CEST)
Delivered-To: sc22wg14@open-std.org
Received: from smtp101.iad3a.emailsrvr.com (smtp101.iad3a.emailsrvr.com [173.203.187.101])
	(using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by www.open-std.org (Postfix) with ESMTP id 3D7443566A9
	for <sc22wg14@open-std.org>; Sun, 29 Mar 2020 06:53:17 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=honermann.net;
	s=20180930-2j89z3ji; t=1585457596;
	bh=pYO84H0DeA+WszLJzntcGQyvZjBhEmCwudIcQQxZBpQ=;
	h=Subject:From:To:Date:From;
	b=U4Id4HohahcTV/JcDs8WQzt7rPdPrmbIEyLdcgp7Ir2/nVFpBQWKxOvb6pJVKldW7
	 TkPY7e6H9PBHVlnEjLyK8BN53knkeZ8j/14wLIygnklTmk92wb0cGZ6A2QKIeOf6U6
	 3/nmhM7DhuBBx37ppAe7xIX3hMYFWASeOr9X3HkE=
X-Auth-ID: tom@honermann.net
Received: by smtp5.relay.iad3a.emailsrvr.com (Authenticated sender: tom-AT-honermann.net) with ESMTPSA id B54092120E;
	Sun, 29 Mar 2020 00:53:15 -0400 (EDT)
X-Sender-Id: tom@honermann.net
Received: from [192.168.1.13] (pool-74-110-208-227.rcmdva.fios.verizon.net [74.110.208.227])
	(using TLSv1.2 with cipher DHE-RSA-AES128-SHA)
	by 0.0.0.0:587 (trex/5.7.12);
	Sun, 29 Mar 2020 00:53:16 -0400
Subject: Re: (SC22WG14.17674) mbrtowc() wording ambiguities and surprising
 implementation behavior
From: Tom Honermann <tom@honermann.net>
To: wg14 <sc22wg14@open-std.org>, SG16 <sg16@lists.isocpp.org>
References: <20200328044149.75FAD3589AA@www.open-std.org>
Message-ID: <0436b6ac-f61c-9ec5-a9d5-4e10d3012ae4@honermann.net>
Date: Sun, 29 Mar 2020 00:53:14 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.4.1
MIME-Version: 1.0
In-Reply-To: <20200328044149.75FAD3589AA@www.open-std.org>
Content-Type: multipart/alternative;
 boundary="------------0FD4ADF3684A83857609204B"
Content-Language: en-US
X-Classification-ID: 58940f92-8499-4edf-9eec-9f4b639f08ab-1-1
Sender: owner-sc22wg14@open-std.org
Precedence: bulk

This is a multi-part message in MIME format.
--------------0FD4ADF3684A83857609204B
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

On 3/28/20 12:34 AM, Tom Honermann wrote:
>
> I came across the following issues while testing an implementation of 
> mbrtoc8() [1] I'm working on.  The implementation uses mbrtowc() 
> internally.
>
> The issues concern the return value of mbrtowc() in two related scenarios.
>
> Quoting from the C18 standard for convenience: 7.29.6.3.2p4 states:
>
>> The *mbrtowc* function returns the first of the following that 
>> applies (given the current conversion
>> state):
>>
>> 0 if the next n or fewer bytes complete the multibyte character that 
>> corresponds to
>> the null wide character (which is the value stored).
>>
>> /between 1 and n inclusive/ if the next n or fewer bytes complete a 
>> valid multibyte character (which
>> is the value stored); the value returned is the number of bytes that 
>> complete the
>> multibyte character.
>>
>> (*size_t*) (−2) if the next n bytes contribute to an incomplete (but 
>> potentially valid) multibyte
>> character, and all n bytes have been processed (no value is 
>> stored).^355)
>>
>> (*size_t*)(-1) if an encoding error occurs, in which case the next n 
>> or fewer bytes do not contribute
>> to a complete and valid multibyte character (no value is stored); the 
>> value of the
>> macro *EILSEQ* is stored in *errno*, and the conversion state is 
>> unspecified.
>
> The issues are demonstrated using an example of converting one byte at 
> a time, a Big5-HKSCS double byte sequence that maps to two Unicode 
> code points (assume the wide execution character set is UTF-16 (or 
> UCS2) or UTF32):
>
>   * 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH
>     CIRCUMFLEX} {COMBINING MACRON}
>
> There are two distinct issues:
>
>  1. What should the return value be when consuming the remainder of a
>     previously incomplete multibyte character?  My interpretation of
>     the wording above is that it should be the number of bytes
>     consumed by the call that completed the multibyte character, but
>     at least two implementations return the total number of bytes that
>     contributed to the complete character.  In this case, these
>     implementations return a value larger than the value of 'n' thus
>     contradicting the wording above.
>  2. What should the return value be when no input bytes are consumed,
>     but an output character is written (e.g., for the call that writes
>     the second Unicode code point above)? mbrtowc() doesn't specify a
>     return value of (size_t)-3 like mbrtoc16() does.
>
> The following test case validates the behavior exhibited by recent 
> glibc releases on Linux.  Note that this example depends on support 
> for the zh_HK.BIG5-HKSCS locale being present.
>
>> #include <assert.h>
>> #include <locale.h>
>> #include <stdio.h>
>> #include <string.h>
>> #include <wchar.h>
>>
>> int main() {
>>   /* This test case demonstrates glibc's current (2.31) behavior when
>>      attempting to translate, one byte at a time, Big5-HKSCS input
>>      containing a double byte sequence (0x88 0x62) that maps to two
>>      Unicode code points (U+00CA U+0304).
>>
>>      There are two interesting behaviors.
>>      1) The first call to mbrtowc() consumes the first byte and returns
>>         a value of -2 indicating an incomplete multibyte character as
>>         expected.  However, the second call, with an input length of 1,
>>         consumes the second byte, recognizes completion of the previously
>>         incomplete character, writes the first of the mapped Unicode
>>         code points, and then returns 2.  The return value of 2 is
>>         surprising since only 1 byte was read.  This seems to violate
>>         the C standard as well since the return value is greater than
>>         the input length.
>>      2) The third call to mbrtowc() writes the second of the mapped
>>         Unicode code points from the previously translated multibyte
>>         character, consumes no further input, and returns 0 (because
>>         no input was consumed).  The C standard does not specify a
>>         return value of -3 for mbrtowc() as it does for mbrtoc16() for
>>         the analagous situation involving UTF-16 surrogate code points.
>>         The return of 0 seems rational, but also contradicts the C
>>         standard because a return value of 0 is reserved for when a
>>         null character is written.  Distinguishing these two cases of
>>         returning 0 requires inspecting the code point that was written
>>         to see if it was a null character or not. */
>>
>>   if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {
>>     perror("setlocale");
>>     return 1;
>>   }
>>
>>   const char *mbs;
>>   wchar_t wc;
>>   mbstate_t s;
>>   size_t result;
>>
>>   mbs = "\x88\x62";
>>   memset(&s, 0, sizeof(s));
>>   /* Translate the first byte.  This call to mbrtowc() consumes the first
>>      byte and returns -2 indicating that a potentially valid but 
>> incomplete
>>      character was read.  This is expected behavior. */
>>   result = mbrtowc(&wc, mbs, 1, &s);
>>   printf("1st mbrtowc call:\n");
>>   printf("  result: %zd (-2 expected)\n", result);
>>   assert(result == (size_t) -2);
>>   mbs += 1;
>>   /* Translate the second byte.  This completes the first multibyte 
>> character
>>      and writes the first Unicode code point.  The C standard appears to
>>      state that the return value should be 1, but glibc returns 2. */
>>   result = mbrtowc(&wc, mbs, 1, &s);
>>   printf("2nd mbrtowc call:\n");
>>   printf("  result: %zd (1 expected, glibc returns 2)\n", result);
>>   printf("  wc: 0x%04X (0x00CA expected)\n", (unsigned)wc);
>>   mbs += 1;
>>   assert(result == (size_t) 2);
>>   assert(wc == 0x00CA);
>>   /* This next call to mbrtowc() writes the second Unicode code point 
>> without
>>      consuming any input.  Since output was written, but no input was 
>> consumed,
>>      0 is returned.  This is a case where mbrtoc16() would return 
>> (size_t)-3,
>>      but mbrtowc() isn't specified to do so.  This behavior is a bit 
>> confusing
>>      because the return of 0 is specified to indicate that a null 
>> character
>>      was written; but that isn't the case here. */
>>   result = mbrtowc(&wc, mbs, 1, &s);
>>   printf("3rd mbrtowc call:\n");
>>   printf("  result: %zd (0 expected)\n", result);
>>   printf("  wc: 0x%04X (0x0304 expected)\n", (unsigned)wc);
>>   assert(result == (size_t) 0);
>>   assert(wc == 0x0304);
>> }
> When compiled by gcc with glibc and run on Linux, the following output 
> is produced:
>
>> 1st mbrtowc call:
>>   result: -2 (-2 expected)
>> 2nd mbrtowc call:
>>   result: 2 (1 expected, glibc returns 2)
>>   wc: 0x00CA (0x00CA expected)
>> 3rd mbrtowc call:
>>   result: 0 (0 expected)
>>   wc: 0x0304 (0x0304 expected)
>
> For the first issue, my suspicion is that the return value of 2 when 
> consuming the 2nd byte of the multibyte character is an implementation 
> defect.  However, if that is the case, it seems to be a common defect 
> since Microsoft's implementation also exhibits it.  Microsoft doesn't 
> support Big5-HKSCS, but this behavior can be exhibited with any double 
> byte character. Example behavior can be observed at 
> https://rextester.com/SXAEW48593.
>
I'm feeling more confident that this behavior is a defect in these 
implementations.  I tested UTF-8 and Big5 in glibc and found this 
behavior isn't exhibited for those encodings.

glibc bug filed at https://sourceware.org/bugzilla/show_bug.cgi?id=25744.

Tom.
>
> For the second issue, implementation behavior seems reasonable to me, 
> but the C standard doesn't seem to acknowledge the possibility of this 
> scenario.  There are several things that can be done about it:
>
>  1. Modify the standard to add specification of a -3 return value to
>     match the specification for mbrtoc16().  This would require
>     implementations to change behavior.
>  2. Modify the standard to add wording for the return value of 0
>     acknowledging the scenario where no bytes are consumed, but
>     converted characters are written.  This would standardize existing
>     practice (at least as exhibited by glibc).
>
> Tom.
>
> [1] See WG21 proposal P0482R6 <https://wg21.link/p0482r6> and WG14 
> proposal N2231 
> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm>.
>


--------------0FD4ADF3684A83857609204B
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <div class="moz-cite-prefix">On 3/28/20 12:34 AM, Tom Honermann
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:20200328044149.75FAD3589AA@www.open-std.org">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <p>I came across the following issues while testing an
        implementation of mbrtoc8() [1] I'm working on.  The
        implementation uses mbrtowc() internally.</p>
      <p>The issues concern the return value of mbrtowc() in two related
        scenarios.</p>
      <p>Quoting from the C18 standard for convenience: 7.29.6.3.2p4
        states:<br>
      </p>
      <p> </p>
      <blockquote type="cite">The <b>mbrtowc</b> function returns the
        first of the following that applies (given the current
        conversion<br>
        state):<br>
        <br>
        0 if the next n or fewer bytes complete the multibyte character
        that corresponds to<br>
        the null wide character (which is the value stored).<br>
        <br>
        <i>between 1 and n inclusive</i> if the next n or fewer bytes
        complete a valid multibyte character (which<br>
        is the value stored); the value returned is the number of bytes
        that complete the<br>
        multibyte character.<br>
        <br>
        (<b>size_t</b>) (−2) if the next n bytes contribute to an
        incomplete (but potentially valid) multibyte<br>
        character, and all n bytes have been processed (no value is
        stored).<sup>355)</sup><br>
        <br>
        (<b>size_t</b>)(-1) if an encoding error occurs, in which case
        the next n or fewer bytes do not contribute<br>
        to a complete and valid multibyte character (no value is
        stored); the value of the<br>
        macro <b>EILSEQ</b> is stored in <b>errno</b>, and the
        conversion state is unspecified.</blockquote>
      <br>
      The issues are demonstrated using an example of converting one
      byte at a time, a Big5-HKSCS double byte sequence that maps to two
      Unicode code points (assume the wide execution character set is
      UTF-16 (or UCS2) or UTF32):
      <ul>
        <li>0x88 0x62 =&gt; U+00CA U+0304 {LATIN CAPITAL LETTER E WITH
          CIRCUMFLEX} {COMBINING MACRON}</li>
      </ul>
      <p>There are two distinct issues:</p>
      <ol>
        <li>What should the return value be when consuming the remainder
          of a previously incomplete multibyte character?  My
          interpretation of the wording above is that it should be the
          number of bytes consumed by the call that completed the
          multibyte character, but at least two implementations return
          the total number of bytes that contributed to the complete
          character.  In this case, these implementations return a value
          larger than the value of 'n' thus contradicting the wording
          above.<br>
        </li>
        <li>What should the return value be when no input bytes are
          consumed, but an output character is written (e.g., for the
          call that writes the second Unicode code point above)? 
          mbrtowc() doesn't specify a return value of (size_t)-3 like
          mbrtoc16() does.</li>
      </ol>
      <p>The following test case validates the behavior exhibited by
        recent glibc releases on Linux.  Note that this example depends
        on support for the zh_HK.BIG5-HKSCS locale being present.<br>
      </p>
      <p> </p>
      <blockquote type="cite"><tt>#include &lt;assert.h&gt;</tt><tt><br>
        </tt><tt>#include &lt;locale.h&gt;</tt><tt><br>
        </tt><tt>#include &lt;stdio.h&gt;</tt><tt><br>
        </tt><tt>#include &lt;string.h&gt;</tt><tt><br>
        </tt><tt>#include &lt;wchar.h&gt;</tt><tt><br>
        </tt><tt><br>
        </tt><tt>int main() {</tt><tt><br>
        </tt><tt>  /* This test case demonstrates glibc's current (2.31)
          behavior when</tt><tt><br>
        </tt><tt>     attempting to translate, one byte at a time,
          Big5-HKSCS input</tt><tt><br>
        </tt><tt>     containing a double byte sequence (0x88 0x62) that
          maps to two</tt><tt><br>
        </tt><tt>     Unicode code points (U+00CA U+0304).</tt><tt><br>
        </tt><tt><br>
        </tt><tt>     There are two interesting behaviors.</tt><tt><br>
        </tt><tt>     1) The first call to mbrtowc() consumes the first
          byte and returns</tt><tt><br>
        </tt><tt>        a value of -2 indicating an incomplete
          multibyte character as</tt><tt><br>
        </tt><tt>        expected.  However, the second call, with an
          input length of 1,</tt><tt><br>
        </tt><tt>        consumes the second byte, recognizes completion
          of the previously</tt><tt><br>
        </tt><tt>        incomplete character, writes the first of the
          mapped Unicode</tt><tt><br>
        </tt><tt>        code points, and then returns 2.  The return
          value of 2 is</tt><tt><br>
        </tt><tt>        surprising since only 1 byte was read.  This
          seems to violate</tt><tt><br>
        </tt><tt>        the C standard as well since the return value
          is greater than</tt><tt><br>
        </tt><tt>        the input length.</tt><tt><br>
        </tt><tt>     2) The third call to mbrtowc() writes the second
          of the mapped</tt><tt><br>
        </tt><tt>        Unicode code points from the previously
          translated multibyte</tt><tt><br>
        </tt><tt>        character, consumes no further input, and
          returns 0 (because</tt><tt><br>
        </tt><tt>        no input was consumed).  The C standard does
          not specify a</tt><tt><br>
        </tt><tt>        return value of -3 for mbrtowc() as it does for
          mbrtoc16() for</tt><tt><br>
        </tt><tt>        the analagous situation involving UTF-16
          surrogate code points.</tt><tt><br>
        </tt><tt>        The return of 0 seems rational, but also
          contradicts the C</tt><tt><br>
        </tt><tt>        standard because a return value of 0 is
          reserved for when a</tt><tt><br>
        </tt><tt>        null character is written.  Distinguishing
          these two cases of</tt><tt><br>
        </tt><tt>        returning 0 requires inspecting the code point
          that was written</tt><tt><br>
        </tt><tt>        to see if it was a null character or not. */</tt><tt><br>
        </tt><tt><br>
        </tt><tt>  if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {</tt><tt><br>
        </tt><tt>    perror("setlocale");</tt><tt><br>
        </tt><tt>    return 1;</tt><tt><br>
        </tt><tt>  }</tt><tt><br>
        </tt><tt><br>
        </tt><tt>  const char *mbs;</tt><tt><br>
        </tt><tt>  wchar_t wc;</tt><tt><br>
        </tt><tt>  mbstate_t s;</tt><tt><br>
        </tt><tt>  size_t result;</tt><tt><br>
        </tt><tt><br>
        </tt><tt>  mbs = "\x88\x62";</tt><tt><br>
        </tt><tt>  memset(&amp;s, 0, sizeof(s));</tt><tt><br>
        </tt><tt>  /* Translate the first byte.  This call to mbrtowc()
          consumes the first</tt><tt><br>
        </tt><tt>     byte and returns -2 indicating that a potentially
          valid but incomplete</tt><tt><br>
        </tt><tt>     character was read.  This is expected behavior. */</tt><tt><br>
        </tt><tt>  result = mbrtowc(&amp;wc, mbs, 1, &amp;s);</tt><tt><br>
        </tt><tt>  printf("1st mbrtowc call:\n");</tt><tt><br>
        </tt><tt>  printf("  result: %zd (-2 expected)\n", result);</tt><tt><br>
        </tt><tt>  assert(result == (size_t) -2);</tt><tt><br>
        </tt><tt>  mbs += 1;</tt><tt><br>
        </tt><tt>  /* Translate the second byte.  This completes the
          first multibyte character</tt><tt><br>
        </tt><tt>     and writes the first Unicode code point.  The C
          standard appears to</tt><tt><br>
        </tt><tt>     state that the return value should be 1, but glibc
          returns 2. */</tt><tt><br>
        </tt><tt>  result = mbrtowc(&amp;wc, mbs, 1, &amp;s);</tt><tt><br>
        </tt><tt>  printf("2nd mbrtowc call:\n");</tt><tt><br>
        </tt><tt>  printf("  result: %zd (1 expected, glibc returns
          2)\n", result);</tt><tt><br>
        </tt><tt>  printf("  wc: 0x%04X (0x00CA expected)\n",
          (unsigned)wc);</tt><tt><br>
        </tt><tt>  mbs += 1;</tt><tt><br>
        </tt><tt>  assert(result == (size_t) 2);</tt><tt><br>
        </tt><tt>  assert(wc == 0x00CA);</tt><tt><br>
        </tt><tt>  /* This next call to mbrtowc() writes the second
          Unicode code point without</tt><tt><br>
        </tt><tt>     consuming any input.  Since output was written,
          but no input was consumed,</tt><tt><br>
        </tt><tt>     0 is returned.  This is a case where mbrtoc16()
          would return (size_t)-3,</tt><tt><br>
        </tt><tt>     but mbrtowc() isn't specified to do so.  This
          behavior is a bit confusing</tt><tt><br>
        </tt><tt>     because the return of 0 is specified to indicate
          that a null character</tt><tt><br>
        </tt><tt>     was written; but that isn't the case here. */</tt><tt><br>
        </tt><tt>  result = mbrtowc(&amp;wc, mbs, 1, &amp;s);</tt><tt><br>
        </tt><tt>  printf("3rd mbrtowc call:\n");</tt><tt><br>
        </tt><tt>  printf("  result: %zd (0 expected)\n", result);</tt><tt><br>
        </tt><tt>  printf("  wc: 0x%04X (0x0304 expected)\n",
          (unsigned)wc);</tt><tt><br>
        </tt><tt>  assert(result == (size_t) 0);</tt><tt><br>
        </tt><tt>  assert(wc == 0x0304);</tt><tt><br>
        </tt><tt>}</tt><br>
      </blockquote>
      When compiled by gcc with glibc and run on Linux, the following
      output is produced:
      <p> </p>
      <blockquote type="cite"><tt>1st mbrtowc call:</tt><tt><br>
        </tt><tt>  result: -2 (-2 expected)</tt><tt><br>
        </tt><tt>2nd mbrtowc call:</tt><tt><br>
        </tt><tt>  result: 2 (1 expected, glibc returns 2)</tt><tt><br>
        </tt><tt>  wc: 0x00CA (0x00CA expected)</tt><tt><br>
        </tt><tt>3rd mbrtowc call:</tt><tt><br>
        </tt><tt>  result: 0 (0 expected)</tt><tt><br>
        </tt><tt>  wc: 0x0304 (0x0304 expected)</tt><br>
      </blockquote>
      <p>For the first issue, my suspicion is that the return value of 2
        when consuming the 2nd byte of the multibyte character is an
        implementation defect.  However, if that is the case, it seems
        to be a common defect since Microsoft's implementation also
        exhibits it.  Microsoft doesn't support Big5-HKSCS, but this
        behavior can be exhibited with any double byte character. 
        Example behavior can be observed at <a moz-do-not-send="true"
          href="https://rextester.com/SXAEW48593">https://rextester.com/SXAEW48593</a>.</p>
    </blockquote>
    <p>I'm feeling more confident that this behavior is a defect in
      these implementations.  I tested UTF-8 and Big5 in glibc and found
      this behavior isn't exhibited for those encodings.</p>
    <p>glibc bug filed at <a moz-do-not-send="true"
        href="https://sourceware.org/bugzilla/show_bug.cgi?id=25744">https://sourceware.org/bugzilla/show_bug.cgi?id=25744</a>.<br>
    </p>
    Tom.<br>
    <blockquote type="cite"
      cite="mid:20200328044149.75FAD3589AA@www.open-std.org">
      <p>For the second issue, implementation behavior seems reasonable
        to me, but the C standard doesn't seem to acknowledge the
        possibility of this scenario.  There are several things that can
        be done about it:</p>
      <ol>
        <li>Modify the standard to add specification of a -3 return
          value to match the specification for mbrtoc16().  This would
          require implementations to change behavior.<br>
        </li>
        <li>Modify the standard to add wording for the return value of 0
          acknowledging the scenario where no bytes are consumed, but
          converted characters are written.  This would standardize
          existing practice (at least as exhibited by glibc).<br>
        </li>
      </ol>
      <p>Tom.</p>
      <p>[1] See <a moz-do-not-send="true"
          href="https://wg21.link/p0482r6">WG21 proposal P0482R6</a> and
        <a moz-do-not-send="true"
          href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm">WG14
          proposal N2231</a>.<br>
      </p>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>

--------------0FD4ADF3684A83857609204B--
