From owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org  Sat Mar 28 21:46:38 2020
Return-Path: <owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org>
X-Original-To: sc22wg14-domo2
Delivered-To: sc22wg14-domo2@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 6E842358D34; Sat, 28 Mar 2020 21:46:38 +0100 (CET)
Delivered-To: sc22wg14@open-std.org
Received: from mail-oi1-f173.google.com (mail-oi1-f173.google.com [209.85.167.173])
	(using TLSv1 with cipher AES128-SHA (128/128 bits))
	(No client certificate requested)
	by www.open-std.org (Postfix) with ESMTP id F3AE33566A9
	for <sc22wg14@open-std.org>; Sat, 28 Mar 2020 21:46:37 +0100 (CET)
Received: by mail-oi1-f173.google.com with SMTP id m14so12209181oic.0
        for <sc22wg14@open-std.org>; Sat, 28 Mar 2020 13:46:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=D9yXtMPVcE64cyTCW/aMvjLPmEkOVcxoi20+rrYU4Dk=;
        b=WJBz9acOo4DqvPSAX2rnSmvTGT4vAPUbG22fcYoipnuhuCeMvwJlklZM1f0nk5c3T4
         5lBQKJnez8sgpUUQm50ph5h5q/8SVaqskql2iYFFvgchQMnzEpc7vk/O7Zf8lOYe2mLl
         x+eFvJFYupEzcsfTdOUXtouxWysn3qEuTHkLQWSd4e9G2jDDWQ6CJ7BZZIo2lLFe75Z4
         RUT7Vz5olwk18Jw5WmImILr1uZX4tR2E4Rfqktl9akFFUK+EsBwyfY8dtqnQjb7ZWgnc
         U/y4z9n5Ccp8jZaYwY7M/PT1i0mLwKsebsGddx4TwR6Ua+TdUIKJx3SgmnRYchYgLxIP
         pIbQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=D9yXtMPVcE64cyTCW/aMvjLPmEkOVcxoi20+rrYU4Dk=;
        b=jkav6xaY3dTY/E/ll3fhqi7U0eWMHstzJsyr7S59BIX85sU1URak8BL8U1lMpp9XCC
         H3x2dHt0LysmE/z+QvlxXYZXyeVjiqqjZHr6+is5Tq9FFaxAnFPW0N44CPWlgRxj3mWJ
         o8EbC2zY1E5XEjImU4RolTZxkhGrz/B64WG9AcW4CnNviDSv/FglsoAusEBpN8jJQAlc
         H8kDLlljLm+ukzVl1OHLXbrzYXvtiJXmFZP+xAJ26kQcQzngyPjvgKxUZy5ZH3S69KSx
         j4znStLfH5s1VXVwK8RKO9bshy+ZLHcwLLt8mvoYp4BBFVSpoSYZbLiIuLeK8pxXyck/
         by0A==
X-Gm-Message-State: ANhLgQ1K3l3oYt1Yb7lCPElAd6gYOs9YSbgYHORFhS+Be213JiANQUW7
	iI5PnGq4p7TZJK2aDLDXZihcaHn8iabPeIhCsF3u21Y9
X-Google-Smtp-Source: ADFU+vus4T21vLfqapZET23zCHCa/E+NFJ68VEhGjZiPd7ioWDd2fNfqILAgQwiAP+e+lv5EDvr5giNWVRgYYrYrpMg=
X-Received: by 2002:aca:34c6:: with SMTP id b189mr3216342oia.63.1585428396442;
 Sat, 28 Mar 2020 13:46:36 -0700 (PDT)
MIME-Version: 1.0
References: <20200328044149.75FAD3589AA@www.open-std.org> <20200328140620.1F731356571@www.open-std.org>
 <4d5c82d2-3269-035c-b42f-41fb42abbd36@honermann.net>
In-Reply-To: <4d5c82d2-3269-035c-b42f-41fb42abbd36@honermann.net>
From: Hubert Tong <hubert.reinterpretcast@gmail.com>
Date: Sat, 28 Mar 2020 16:46:20 -0400
Message-ID: <CACvkUqbXzpcrxZYQ5Krdq+apZd-rOcD-m7p4DHzu-ZaAxc1D0w@mail.gmail.com>
Subject: Re: (SC22WG14.17682) mbrtowc() wording ambiguities and surprising
 implementation behavior
To: Tom Honermann <tom@honermann.net>
Cc: wg14 <sc22wg14@open-std.org>, SG16 <sg16@lists.isocpp.org>
Content-Type: multipart/alternative; boundary="00000000000063097c05a1f05012"
Sender: owner-sc22wg14@open-std.org
Precedence: bulk

--00000000000063097c05a1f05012
Content-Type: text/plain; charset="UTF-8"

[Note: Cross-posted between the WG 14 and WG 21/SG 16 reflectors]

On Sat, Mar 28, 2020 at 3:34 PM Tom Honermann <tom@honermann.net> wrote:

> On 3/28/20 10:06 AM, Hubert Tong wrote:
>
> On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann <tom@honermann.net> wrote:
>
>> I came across the following issues while testing an implementation of
>> mbrtoc8() [1] I'm working on.  The implementation uses mbrtowc() internally.
>>
> [ ... ]
>
>>
>> The issues are demonstrated using an example of converting one byte at a
>> time, a Big5-HKSCS double byte sequence that maps to two Unicode code
>> points (assume the wide execution character set is UTF-16 (or UCS2) or
>> UTF32):
>>
>>    - 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH CIRCUMFLEX}
>>    {COMBINING MACRON}
>>
>> The scenario presented violates the definition of "wide character", which
> indicates the relationship between values of wchar_t and the C standard
> concept of a "character":
> value representable by an object of type wchar_t, capable of representing
> any character in the current locale
>
> Indeed, but that definition of wide character in the standard contradicts
> long standing existing practice (e.g., use of UTF-16 on Windows).
>
What I mean is that asking about the behaviour of a function in a scenario
that contradicts its underlying model is unlikely to lead to helpful action
in terms of interpreting the wording.

This situation is similar in some respects to the __STDC_MB_MIGHT_NEQ_WC__
one. "Long-standing existing practice" indicates that something about the
standard does not serve a community of users. The standard in the case of
__STDC_MB_MIGHT_NEQ_WC__ says that there are environments where certain
assumptions don't hold. Users who have to operate in such an environment
can detect and take it into account. Users who don't have to operate in
such an environment can safely ignore it and be assured their program is
portable within their needs.

So, we probably need to accommodate "odd" operating environments, but would
need to look for some balance so as to not complicate the situation too
much.


>
> I doubt that wide characters should be considered the preferred solution
> for dealing with UCS encodings or notions that characters are formed by
> more than one minimal well-formed code unit sequence.
>
> I certainly agree with the first part of that statement, but not the
> second considering existing practice.
>
Just to ensure we understand each other. I did not say anything in the
second part of the statement that contradicts the existence of surrogate
pairs. I am pointing out that there is a technical issue of using UTF-8 as
the multibyte string encoding if a character is considered to require more
than a single UCS scalar value. An implementation of mblen should not
return different non-negative values for successive calls with the same
non-null pointer simply because the `n` parameter is changed.

> Tom.
>
>
>

--00000000000063097c05a1f05012
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_quote"><div class=3D"gmail_attr">[Note=
: Cross-posted between the WG 14 and WG 21/SG 16 reflectors]<br></div><div =
dir=3D"ltr" class=3D"gmail_attr"><br></div><div dir=3D"ltr" class=3D"gmail_=
attr">On Sat, Mar 28, 2020 at 3:34 PM Tom Honermann &lt;<a href=3D"mailto:t=
om@honermann.net">tom@honermann.net</a>&gt; wrote:<br></div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex">
 =20
   =20
 =20
  <div>
    <div>On 3/28/20 10:06 AM, Hubert Tong wrote:<br>
    </div>
    <blockquote type=3D"cite">
     =20
      <div dir=3D"ltr">
        <div class=3D"gmail_quote">
          <div dir=3D"ltr" class=3D"gmail_attr">On Sat, Mar 28, 2020 at 2:4=
0
            AM Tom Honermann &lt;<a href=3D"mailto:tom@honermann.net" targe=
t=3D"_blank">tom@honermann.net</a>&gt; wrote:<br>
          </div>
          <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8=
ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p>I came across the following issues while testing an
                implementation of mbrtoc8() [1] I&#39;m working on.=C2=A0 T=
he
                implementation uses mbrtowc() internally.</p>
            </div>
          </blockquote>
          <div>[ ... ] <br>
          </div>
          <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8=
ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div><br>
              The issues are demonstrated using an example of converting
              one byte at a time, a Big5-HKSCS double byte sequence that
              maps to two Unicode code points (assume the wide execution
              character set is UTF-16 (or UCS2) or UTF32):
              <ul>
                <li>0x88 0x62 =3D&gt; U+00CA U+0304 {LATIN CAPITAL LETTER
                  E WITH CIRCUMFLEX} {COMBINING MACRON}</li>
              </ul>
            </div>
          </blockquote>
          The scenario presented violates the definition of &quot;wide
          character&quot;, which indicates the relationship between values =
of
          wchar_t and the C standard concept of a &quot;character&quot;:<br=
>
          value representable by an object of type wchar_t, capable of
          representing any character in the current locale</div>
      </div>
    </blockquote>
    Indeed, but that definition of wide character in the standard
    contradicts long standing existing practice (e.g., use of UTF-16 on
    Windows).<br></div></blockquote><div></div><div>What I mean is that ask=
ing about the behaviour of a function in a scenario that contradicts its un=
derlying model is unlikely to lead to helpful action in terms of interpreti=
ng the wording.<br></div><div><br></div><div>This situation is similar in s=
ome respects to the __STDC_MB_MIGHT_NEQ_WC__ one. &quot;Long-standing exist=
ing practice&quot; indicates that something about the standard does not ser=
ve a community of users. The standard in the case of=20
__STDC_MB_MIGHT_NEQ_WC__ says that there are environments where certain ass=
umptions don&#39;t hold. Users who have to operate in such an environment c=
an detect and take it into account. Users who don&#39;t have to operate in =
such an environment can safely ignore it and be assured their program is po=
rtable within their needs.<br></div><div><br></div><div>So, we probably nee=
d to accommodate &quot;odd&quot; operating environments, but would need to =
look for some balance so as to not complicate the situation too much.<br></=
div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div=
>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <div class=3D"gmail_quote"><br>
        </div>
        <div class=3D"gmail_quote">I doubt that wide characters should be
          considered the preferred solution for dealing with UCS
          encodings or notions that characters are formed by more than
          one minimal well-formed code unit sequence.</div>
        <div class=3D"gmail_quote"><br>
        </div>
      </div>
    </blockquote>
    <p>I certainly agree with the first part of that statement, but not
      the second considering existing practice.<br></p></div></blockquote><=
div>Just to ensure we understand each other. I did not say anything in the =
second part of the statement that contradicts the existence of surrogate pa=
irs. I am pointing out that there is a technical issue of using UTF-8 as th=
e multibyte string encoding if a character is considered to require more th=
an a single UCS scalar value. An implementation of mblen should not return =
different non-negative values for successive calls with the same non-null p=
ointer simply because the `n` parameter is changed.<br></div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex"><div><p>
    </p>
    <p>Tom.<br>
    </p>
    <p><br>
    </p>
  </div>

</blockquote></div></div>

--00000000000063097c05a1f05012--
