From owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org  Sat Mar 28 05:41:49 2020
Return-Path: <owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org>
X-Original-To: sc22wg14-domo2
Delivered-To: sc22wg14-domo2@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 460C0358C83; Sat, 28 Mar 2020 05:41:49 +0100 (CET)
Delivered-To: sc22wg14@open-std.org
X-Greylist: delayed 428 seconds by postgrey-1.34 at www5.open-std.org; Sat, 28 Mar 2020 05:41:48 CET
Received: from smtp88.iad3b.emailsrvr.com (smtp88.iad3b.emailsrvr.com [146.20.161.88])
	(using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by www.open-std.org (Postfix) with ESMTP id AA75C356D42
	for <sc22wg14@open-std.org>; Sat, 28 Mar 2020 05:41:47 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=honermann.net;
	s=20180930-2j89z3ji; t=1585370078;
	bh=anMQ7m1Aq/x0c98c6Zk756sQ0mHAO3IiOJRXsoEnwPM=;
	h=To:From:Subject:Date:From;
	b=QEkDjjPLCJ/0LrF18tWRmWnBWVoXwVV5iycJbFOZo5dih5Uw+7VRBKjSUJnvEPJwn
	 eVEg+Tiq4+ZFiFOv5e1Of1LzpSKZrGpe35gyoqsBsEjISEGHqk+dUVHqQ4mcL/3eWD
	 ptgq5j4A1PCGks+jP/3OTRKcxWKLEa5Txm+fupIQ=
X-Auth-ID: tom@honermann.net
Received: by smtp12.relay.iad3b.emailsrvr.com (Authenticated sender: tom-AT-honermann.net) with ESMTPSA id 7FB44C0100;
	Sat, 28 Mar 2020 00:34:38 -0400 (EDT)
X-Sender-Id: tom@honermann.net
Received: from [192.168.1.13] (pool-74-110-208-227.rcmdva.fios.verizon.net [74.110.208.227])
	(using TLSv1.2 with cipher DHE-RSA-AES128-SHA)
	by 0.0.0.0:25 (trex/5.7.12);
	Sat, 28 Mar 2020 00:34:38 -0400
To: wg14 <sc22wg14@open-std.org>, SG16 <sg16@lists.isocpp.org>
From: Tom Honermann <tom@honermann.net>
Subject: mbrtowc() wording ambiguities and surprising implementation behavior
Message-ID: <2c49d002-3ff1-0540-02be-6034f1aa7d50@honermann.net>
Date: Sat, 28 Mar 2020 00:34:38 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.4.1
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="------------7D4BB34827F3AA266AF029D9"
Content-Language: en-US
X-Classification-ID: bc69f994-961b-4929-8200-ef0550382923-1-1
Sender: owner-sc22wg14@open-std.org
Precedence: bulk

This is a multi-part message in MIME format.
--------------7D4BB34827F3AA266AF029D9
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

I came across the following issues while testing an implementation of 
mbrtoc8() [1] I'm working on.  The implementation uses mbrtowc() internally.

The issues concern the return value of mbrtowc() in two related scenarios.

Quoting from the C18 standard for convenience: 7.29.6.3.2p4 states:

> The *mbrtowc* function returns the first of the following that applies 
> (given the current conversion
> state):
>
> 0 if the next n or fewer bytes complete the multibyte character that 
> corresponds to
> the null wide character (which is the value stored).
>
> /between 1 and n inclusive/ if the next n or fewer bytes complete a 
> valid multibyte character (which
> is the value stored); the value returned is the number of bytes that 
> complete the
> multibyte character.
>
> (*size_t*) (−2) if the next n bytes contribute to an incomplete (but 
> potentially valid) multibyte
> character, and all n bytes have been processed (no value is stored).^355)
>
> (*size_t*)(-1) if an encoding error occurs, in which case the next n 
> or fewer bytes do not contribute
> to a complete and valid multibyte character (no value is stored); the 
> value of the
> macro *EILSEQ* is stored in *errno*, and the conversion state is 
> unspecified.

The issues are demonstrated using an example of converting one byte at a 
time, a Big5-HKSCS double byte sequence that maps to two Unicode code 
points (assume the wide execution character set is UTF-16 (or UCS2) or 
UTF32):

  * 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH CIRCUMFLEX}
    {COMBINING MACRON}

There are two distinct issues:

 1. What should the return value be when consuming the remainder of a
    previously incomplete multibyte character?  My interpretation of the
    wording above is that it should be the number of bytes consumed by
    the call that completed the multibyte character, but at least two
    implementations return the total number of bytes that contributed to
    the complete character.  In this case, these implementations return
    a value larger than the value of 'n' thus contradicting the wording
    above.
 2. What should the return value be when no input bytes are consumed,
    but an output character is written (e.g., for the call that writes
    the second Unicode code point above)?  mbrtowc() doesn't specify a
    return value of (size_t)-3 like mbrtoc16() does.

The following test case validates the behavior exhibited by recent glibc 
releases on Linux.  Note that this example depends on support for the 
zh_HK.BIG5-HKSCS locale being present.

> #include <assert.h>
> #include <locale.h>
> #include <stdio.h>
> #include <string.h>
> #include <wchar.h>
>
> int main() {
>   /* This test case demonstrates glibc's current (2.31) behavior when
>      attempting to translate, one byte at a time, Big5-HKSCS input
>      containing a double byte sequence (0x88 0x62) that maps to two
>      Unicode code points (U+00CA U+0304).
>
>      There are two interesting behaviors.
>      1) The first call to mbrtowc() consumes the first byte and returns
>         a value of -2 indicating an incomplete multibyte character as
>         expected.  However, the second call, with an input length of 1,
>         consumes the second byte, recognizes completion of the previously
>         incomplete character, writes the first of the mapped Unicode
>         code points, and then returns 2.  The return value of 2 is
>         surprising since only 1 byte was read.  This seems to violate
>         the C standard as well since the return value is greater than
>         the input length.
>      2) The third call to mbrtowc() writes the second of the mapped
>         Unicode code points from the previously translated multibyte
>         character, consumes no further input, and returns 0 (because
>         no input was consumed).  The C standard does not specify a
>         return value of -3 for mbrtowc() as it does for mbrtoc16() for
>         the analagous situation involving UTF-16 surrogate code points.
>         The return of 0 seems rational, but also contradicts the C
>         standard because a return value of 0 is reserved for when a
>         null character is written.  Distinguishing these two cases of
>         returning 0 requires inspecting the code point that was written
>         to see if it was a null character or not. */
>
>   if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {
>     perror("setlocale");
>     return 1;
>   }
>
>   const char *mbs;
>   wchar_t wc;
>   mbstate_t s;
>   size_t result;
>
>   mbs = "\x88\x62";
>   memset(&s, 0, sizeof(s));
>   /* Translate the first byte.  This call to mbrtowc() consumes the first
>      byte and returns -2 indicating that a potentially valid but 
> incomplete
>      character was read.  This is expected behavior. */
>   result = mbrtowc(&wc, mbs, 1, &s);
>   printf("1st mbrtowc call:\n");
>   printf("  result: %zd (-2 expected)\n", result);
>   assert(result == (size_t) -2);
>   mbs += 1;
>   /* Translate the second byte.  This completes the first multibyte 
> character
>      and writes the first Unicode code point.  The C standard appears to
>      state that the return value should be 1, but glibc returns 2. */
>   result = mbrtowc(&wc, mbs, 1, &s);
>   printf("2nd mbrtowc call:\n");
>   printf("  result: %zd (1 expected, glibc returns 2)\n", result);
>   printf("  wc: 0x%04X (0x00CA expected)\n", (unsigned)wc);
>   mbs += 1;
>   assert(result == (size_t) 2);
>   assert(wc == 0x00CA);
>   /* This next call to mbrtowc() writes the second Unicode code point 
> without
>      consuming any input.  Since output was written, but no input was 
> consumed,
>      0 is returned.  This is a case where mbrtoc16() would return 
> (size_t)-3,
>      but mbrtowc() isn't specified to do so.  This behavior is a bit 
> confusing
>      because the return of 0 is specified to indicate that a null 
> character
>      was written; but that isn't the case here. */
>   result = mbrtowc(&wc, mbs, 1, &s);
>   printf("3rd mbrtowc call:\n");
>   printf("  result: %zd (0 expected)\n", result);
>   printf("  wc: 0x%04X (0x0304 expected)\n", (unsigned)wc);
>   assert(result == (size_t) 0);
>   assert(wc == 0x0304);
> }
When compiled by gcc with glibc and run on Linux, the following output 
is produced:

> 1st mbrtowc call:
>   result: -2 (-2 expected)
> 2nd mbrtowc call:
>   result: 2 (1 expected, glibc returns 2)
>   wc: 0x00CA (0x00CA expected)
> 3rd mbrtowc call:
>   result: 0 (0 expected)
>   wc: 0x0304 (0x0304 expected)

For the first issue, my suspicion is that the return value of 2 when 
consuming the 2nd byte of the multibyte character is an implementation 
defect.  However, if that is the case, it seems to be a common defect 
since Microsoft's implementation also exhibits it.  Microsoft doesn't 
support Big5-HKSCS, but this behavior can be exhibited with any double 
byte character.  Example behavior can be observed at 
https://rextester.com/SXAEW48593.

For the second issue, implementation behavior seems reasonable to me, 
but the C standard doesn't seem to acknowledge the possibility of this 
scenario.  There are several things that can be done about it:

 1. Modify the standard to add specification of a -3 return value to
    match the specification for mbrtoc16().  This would require
    implementations to change behavior.
 2. Modify the standard to add wording for the return value of 0
    acknowledging the scenario where no bytes are consumed, but
    converted characters are written.  This would standardize existing
    practice (at least as exhibited by glibc).

Tom.

[1] See WG21 proposal P0482R6 <https://wg21.link/p0482r6> and WG14 
proposal N2231 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm>.


--------------7D4BB34827F3AA266AF029D9
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>I came across the following issues while testing an
      implementation of mbrtoc8() [1] I'm working on.  The
      implementation uses mbrtowc() internally.</p>
    <p>The issues concern the return value of mbrtowc() in two related
      scenarios.</p>
    <p>Quoting from the C18 standard for convenience: 7.29.6.3.2p4
      states:<br>
    </p>
    <p>
      <blockquote type="cite">The <b>mbrtowc</b> function returns the
        first of the following that applies (given the current
        conversion<br>
        state):<br>
        <br>
        0 if the next n or fewer bytes complete the multibyte character
        that corresponds to<br>
        the null wide character (which is the value stored).<br>
        <br>
        <i>between 1 and n inclusive</i> if the next n or fewer bytes
        complete a valid multibyte character (which<br>
        is the value stored); the value returned is the number of bytes
        that complete the<br>
        multibyte character.<br>
        <br>
        (<b>size_t</b>) (−2) if the next n bytes contribute to an
        incomplete (but potentially valid) multibyte<br>
        character, and all n bytes have been processed (no value is
        stored).<sup>355)</sup><br>
        <br>
        (<b>size_t</b>)(-1) if an encoding error occurs, in which case
        the next n or fewer bytes do not contribute<br>
        to a complete and valid multibyte character (no value is
        stored); the value of the<br>
        macro <b>EILSEQ</b> is stored in <b>errno</b>, and the
        conversion state is unspecified.</blockquote>
      <br>
      The issues are demonstrated using an example of converting one
      byte at a time, a Big5-HKSCS double byte sequence that maps to two
      Unicode code points (assume the wide execution character set is
      UTF-16 (or UCS2) or UTF32):</p>
    <ul>
      <li>0x88 0x62 =&gt; U+00CA U+0304 {LATIN CAPITAL LETTER E WITH
        CIRCUMFLEX} {COMBINING MACRON}</li>
    </ul>
    <p>There are two distinct issues:</p>
    <ol>
      <li>What should the return value be when consuming the remainder
        of a previously incomplete multibyte character?  My
        interpretation of the wording above is that it should be the
        number of bytes consumed by the call that completed the
        multibyte character, but at least two implementations return the
        total number of bytes that contributed to the complete
        character.  In this case, these implementations return a value
        larger than the value of 'n' thus contradicting the wording
        above.<br>
      </li>
      <li>What should the return value be when no input bytes are
        consumed, but an output character is written (e.g., for the call
        that writes the second Unicode code point above)?  mbrtowc()
        doesn't specify a return value of (size_t)-3 like mbrtoc16()
        does.</li>
    </ol>
    <p>The following test case validates the behavior exhibited by
      recent glibc releases on Linux.  Note that this example depends on
      support for the zh_HK.BIG5-HKSCS locale being present.<br>
    </p>
    <p>
      <blockquote type="cite"><tt>#include &lt;assert.h&gt;</tt><tt><br>
        </tt><tt>#include &lt;locale.h&gt;</tt><tt><br>
        </tt><tt>#include &lt;stdio.h&gt;</tt><tt><br>
        </tt><tt>#include &lt;string.h&gt;</tt><tt><br>
        </tt><tt>#include &lt;wchar.h&gt;</tt><tt><br>
        </tt><tt><br>
        </tt><tt>int main() {</tt><tt><br>
        </tt><tt>  /* This test case demonstrates glibc's current (2.31)
          behavior when</tt><tt><br>
        </tt><tt>     attempting to translate, one byte at a time,
          Big5-HKSCS input</tt><tt><br>
        </tt><tt>     containing a double byte sequence (0x88 0x62) that
          maps to two</tt><tt><br>
        </tt><tt>     Unicode code points (U+00CA U+0304).</tt><tt><br>
        </tt><tt><br>
        </tt><tt>     There are two interesting behaviors.</tt><tt><br>
        </tt><tt>     1) The first call to mbrtowc() consumes the first
          byte and returns</tt><tt><br>
        </tt><tt>        a value of -2 indicating an incomplete
          multibyte character as</tt><tt><br>
        </tt><tt>        expected.  However, the second call, with an
          input length of 1,</tt><tt><br>
        </tt><tt>        consumes the second byte, recognizes completion
          of the previously</tt><tt><br>
        </tt><tt>        incomplete character, writes the first of the
          mapped Unicode</tt><tt><br>
        </tt><tt>        code points, and then returns 2.  The return
          value of 2 is</tt><tt><br>
        </tt><tt>        surprising since only 1 byte was read.  This
          seems to violate</tt><tt><br>
        </tt><tt>        the C standard as well since the return value
          is greater than</tt><tt><br>
        </tt><tt>        the input length.</tt><tt><br>
        </tt><tt>     2) The third call to mbrtowc() writes the second
          of the mapped</tt><tt><br>
        </tt><tt>        Unicode code points from the previously
          translated multibyte</tt><tt><br>
        </tt><tt>        character, consumes no further input, and
          returns 0 (because</tt><tt><br>
        </tt><tt>        no input was consumed).  The C standard does
          not specify a</tt><tt><br>
        </tt><tt>        return value of -3 for mbrtowc() as it does for
          mbrtoc16() for</tt><tt><br>
        </tt><tt>        the analagous situation involving UTF-16
          surrogate code points.</tt><tt><br>
        </tt><tt>        The return of 0 seems rational, but also
          contradicts the C</tt><tt><br>
        </tt><tt>        standard because a return value of 0 is
          reserved for when a</tt><tt><br>
        </tt><tt>        null character is written.  Distinguishing
          these two cases of</tt><tt><br>
        </tt><tt>        returning 0 requires inspecting the code point
          that was written</tt><tt><br>
        </tt><tt>        to see if it was a null character or not. */</tt><tt><br>
        </tt><tt><br>
        </tt><tt>  if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {</tt><tt><br>
        </tt><tt>    perror("setlocale");</tt><tt><br>
        </tt><tt>    return 1;</tt><tt><br>
        </tt><tt>  }</tt><tt><br>
        </tt><tt><br>
        </tt><tt>  const char *mbs;</tt><tt><br>
        </tt><tt>  wchar_t wc;</tt><tt><br>
        </tt><tt>  mbstate_t s;</tt><tt><br>
        </tt><tt>  size_t result;</tt><tt><br>
        </tt><tt><br>
        </tt><tt>  mbs = "\x88\x62";</tt><tt><br>
        </tt><tt>  memset(&amp;s, 0, sizeof(s));</tt><tt><br>
        </tt><tt>  /* Translate the first byte.  This call to mbrtowc()
          consumes the first</tt><tt><br>
        </tt><tt>     byte and returns -2 indicating that a potentially
          valid but incomplete</tt><tt><br>
        </tt><tt>     character was read.  This is expected behavior. */</tt><tt><br>
        </tt><tt>  result = mbrtowc(&amp;wc, mbs, 1, &amp;s);</tt><tt><br>
        </tt><tt>  printf("1st mbrtowc call:\n");</tt><tt><br>
        </tt><tt>  printf("  result: %zd (-2 expected)\n", result);</tt><tt><br>
        </tt><tt>  assert(result == (size_t) -2);</tt><tt><br>
        </tt><tt>  mbs += 1;</tt><tt><br>
        </tt><tt>  /* Translate the second byte.  This completes the
          first multibyte character</tt><tt><br>
        </tt><tt>     and writes the first Unicode code point.  The C
          standard appears to</tt><tt><br>
        </tt><tt>     state that the return value should be 1, but glibc
          returns 2. */</tt><tt><br>
        </tt><tt>  result = mbrtowc(&amp;wc, mbs, 1, &amp;s);</tt><tt><br>
        </tt><tt>  printf("2nd mbrtowc call:\n");</tt><tt><br>
        </tt><tt>  printf("  result: %zd (1 expected, glibc returns
          2)\n", result);</tt><tt><br>
        </tt><tt>  printf("  wc: 0x%04X (0x00CA expected)\n",
          (unsigned)wc);</tt><tt><br>
        </tt><tt>  mbs += 1;</tt><tt><br>
        </tt><tt>  assert(result == (size_t) 2);</tt><tt><br>
        </tt><tt>  assert(wc == 0x00CA);</tt><tt><br>
        </tt><tt>  /* This next call to mbrtowc() writes the second
          Unicode code point without</tt><tt><br>
        </tt><tt>     consuming any input.  Since output was written,
          but no input was consumed,</tt><tt><br>
        </tt><tt>     0 is returned.  This is a case where mbrtoc16()
          would return (size_t)-3,</tt><tt><br>
        </tt><tt>     but mbrtowc() isn't specified to do so.  This
          behavior is a bit confusing</tt><tt><br>
        </tt><tt>     because the return of 0 is specified to indicate
          that a null character</tt><tt><br>
        </tt><tt>     was written; but that isn't the case here. */</tt><tt><br>
        </tt><tt>  result = mbrtowc(&amp;wc, mbs, 1, &amp;s);</tt><tt><br>
        </tt><tt>  printf("3rd mbrtowc call:\n");</tt><tt><br>
        </tt><tt>  printf("  result: %zd (0 expected)\n", result);</tt><tt><br>
        </tt><tt>  printf("  wc: 0x%04X (0x0304 expected)\n",
          (unsigned)wc);</tt><tt><br>
        </tt><tt>  assert(result == (size_t) 0);</tt><tt><br>
        </tt><tt>  assert(wc == 0x0304);</tt><tt><br>
        </tt><tt>}</tt><br>
      </blockquote>
      When compiled by gcc with glibc and run on Linux, the following
      output is produced:</p>
    <p>
      <blockquote type="cite"><tt>1st mbrtowc call:</tt><tt><br>
        </tt><tt>  result: -2 (-2 expected)</tt><tt><br>
        </tt><tt>2nd mbrtowc call:</tt><tt><br>
        </tt><tt>  result: 2 (1 expected, glibc returns 2)</tt><tt><br>
        </tt><tt>  wc: 0x00CA (0x00CA expected)</tt><tt><br>
        </tt><tt>3rd mbrtowc call:</tt><tt><br>
        </tt><tt>  result: 0 (0 expected)</tt><tt><br>
        </tt><tt>  wc: 0x0304 (0x0304 expected)</tt><br>
      </blockquote>
    </p>
    <p>For the first issue, my suspicion is that the return value of 2
      when consuming the 2nd byte of the multibyte character is an
      implementation defect.  However, if that is the case, it seems to
      be a common defect since Microsoft's implementation also exhibits
      it.  Microsoft doesn't support Big5-HKSCS, but this behavior can
      be exhibited with any double byte character.  Example behavior can
      be observed at <a moz-do-not-send="true"
        href="https://rextester.com/SXAEW48593">https://rextester.com/SXAEW48593</a>.</p>
    <p>For the second issue, implementation behavior seems reasonable to
      me, but the C standard doesn't seem to acknowledge the possibility
      of this scenario.  There are several things that can be done about
      it:</p>
    <ol>
      <li>Modify the standard to add specification of a -3 return value
        to match the specification for mbrtoc16().  This would require
        implementations to change behavior.<br>
      </li>
      <li>Modify the standard to add wording for the return value of 0
        acknowledging the scenario where no bytes are consumed, but
        converted characters are written.  This would standardize
        existing practice (at least as exhibited by glibc).<br>
      </li>
    </ol>
    <p>Tom.</p>
    <p>[1] See <a moz-do-not-send="true"
        href="https://wg21.link/p0482r6">WG21 proposal P0482R6</a> and <a
        moz-do-not-send="true"
        href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm">WG14
        proposal N2231</a>.<br>
    </p>
  </body>
</html>

--------------7D4BB34827F3AA266AF029D9--