From owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org  Sat Mar 28 15:06:20 2020
Return-Path: <owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org>
X-Original-To: sc22wg14-domo2
Delivered-To: sc22wg14-domo2@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id F3E4F3588DB; Sat, 28 Mar 2020 15:06:19 +0100 (CET)
Delivered-To: sc22wg14@open-std.org
Received: from mail-ot1-f41.google.com (mail-ot1-f41.google.com [209.85.210.41])
	(using TLSv1 with cipher AES128-SHA (128/128 bits))
	(No client certificate requested)
	by www.open-std.org (Postfix) with ESMTP id 9BEE0356571
	for <sc22wg14@open-std.org>; Sat, 28 Mar 2020 15:06:19 +0100 (CET)
Received: by mail-ot1-f41.google.com with SMTP id a49so12899335otc.11
        for <sc22wg14@open-std.org>; Sat, 28 Mar 2020 07:06:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=h8WjEUyAzUjzVRgedP8eqIclXfR+l4D+rXt5QHL+yhQ=;
        b=YNFSRQiIGi3xiSKbVBjL8Qy0c2x0YQqRyoVbmYqRziLEFsiPQaCu9KSystlBk12UTl
         V44BwNOARsJJ8nOMwHcpQmNnqXfyxC+wx65o5KnYfxZJPn0I8T7yyORCF9tTLN6lLM2S
         KEd7289P5msgeuOUsFMzvDu6Ul/h8UCJnL1LO8CucVMH4gv156FFzXnaELd5xs8TuGKv
         mRobwbTBiXYSf8ynNDw3BVqBuxkGML5CaMSfX0+M7Jn1t6SsyDSRi8nI+TcoB10v9ZSN
         m72i0ECLx+RVwHANZo3iKQC6IIXhFijI+iqua+TYyyT05tVyZlf6yPBNi0KVXArFWhJB
         ofVw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=h8WjEUyAzUjzVRgedP8eqIclXfR+l4D+rXt5QHL+yhQ=;
        b=j7at5YselWgO3OPcx3GuRP3DRA/qPH1VxE+rtqWzj64RRYICNK92Wi/dZydk4AzC9T
         tUSW3VCe13efD3WY+y/UVBKAyeK48fvk4GX8W2Sxllf9zZaqZ5nphkDKTD28mJKXMbN0
         7qadqmUjh4DZesnkmL2hJ6N91wQv08z+90QADfqctXBLDZozI/W1pTF2PXWY1QzJvN+q
         bLUVy+d2W9/BBAtgbCKitgbJR2emYMjzdZjsrMrVjZnhfwPYLXGsXiuj20RKKZ4MhFjM
         t3FSD5wNG59dPuIL6Le6WL1eIL0l8dukpQWcrqq+jkDIAuKq/HjtRiK/0zo83u5Rpcr3
         7tlw==
X-Gm-Message-State: ANhLgQ3j2+iVkBhWo5b2NCerWTjdziuM+1yNhF01a3mMiLkaib6jEk0Z
	FEs9aO4wtfqEYHbXyaqSGamMeGkjL9R9IxdVVRg=
X-Google-Smtp-Source: ADFU+vuKFEVsuxjMkembSW6qkzh8kHdob+ZO4PIPmXm7ylcby01/uUgWGjVW/jqfNAz9xbYXZWV4hO7xqehONyk8YEk=
X-Received: by 2002:a05:6830:4008:: with SMTP id h8mr2893544ots.295.1585404378135;
 Sat, 28 Mar 2020 07:06:18 -0700 (PDT)
MIME-Version: 1.0
References: <20200328044149.75FAD3589AA@www.open-std.org>
In-Reply-To: <20200328044149.75FAD3589AA@www.open-std.org>
From: Hubert Tong <hubert.reinterpretcast@gmail.com>
Date: Sat, 28 Mar 2020 10:06:02 -0400
Message-ID: <CACvkUqZr-qqJkSnHY+dNQACaTM4J-LnDbUQ95omCzDy0A0UtOg@mail.gmail.com>
Subject: Re: (SC22WG14.17674) mbrtowc() wording ambiguities and surprising
 implementation behavior
To: Tom Honermann <tom@honermann.net>
Cc: wg14 <sc22wg14@open-std.org>, SG16 <sg16@lists.isocpp.org>
Content-Type: multipart/alternative; boundary="000000000000c8c19b05a1eab826"
Sender: owner-sc22wg14@open-std.org
Precedence: bulk

--000000000000c8c19b05a1eab826
Content-Type: text/plain; charset="UTF-8"

On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann <tom@honermann.net> wrote:

> I came across the following issues while testing an implementation of
> mbrtoc8() [1] I'm working on.  The implementation uses mbrtowc() internally.
>
[ ... ]

>
> The issues are demonstrated using an example of converting one byte at a
> time, a Big5-HKSCS double byte sequence that maps to two Unicode code
> points (assume the wide execution character set is UTF-16 (or UCS2) or
> UTF32):
>
>    - 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH CIRCUMFLEX}
>    {COMBINING MACRON}
>
> The scenario presented violates the definition of "wide character", which
indicates the relationship between values of wchar_t and the C standard
concept of a "character":
value representable by an object of type wchar_t, capable of representing
any character in the current locale

I doubt that wide characters should be considered the preferred solution
for dealing with UCS encodings or notions that characters are formed by
more than one minimal well-formed code unit sequence.

--000000000000c8c19b05a1eab826
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail=
_attr">On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann &lt;<a href=3D"mailto:=
tom@honermann.net">tom@honermann.net</a>&gt; wrote:<br></div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex">
 =20

   =20
 =20
  <div>
    <p>I came across the following issues while testing an
      implementation of mbrtoc8() [1] I&#39;m working on.=C2=A0 The
      implementation uses mbrtowc() internally.</p></div></blockquote><div>=
[ ... ] <br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px=
 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><b=
r>
      The issues are demonstrated using an example of converting one
      byte at a time, a Big5-HKSCS double byte sequence that maps to two
      Unicode code points (assume the wide execution character set is
      UTF-16 (or UCS2) or UTF32):
    <ul>
      <li>0x88 0x62 =3D&gt; U+00CA U+0304 {LATIN CAPITAL LETTER E WITH
        CIRCUMFLEX} {COMBINING MACRON}</li></ul></div></blockquote>The scen=
ario presented violates the definition of &quot;wide character&quot;, which=
 indicates the relationship between values of wchar_t and the C standard co=
ncept of a &quot;character&quot;:<br>value representable by an object of ty=
pe wchar_t, capable of representing any character in the current locale</di=
v><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">I doubt t=
hat wide characters should be considered the preferred solution for dealing=
 with UCS encodings or notions that characters are formed by more than one =
minimal well-formed code unit sequence.</div><div class=3D"gmail_quote"><br=
></div></div>

--000000000000c8c19b05a1eab826--
