From owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org  Sat Mar 28 20:34:20 2020
Return-Path: <owner-sc22wg14+sc22wg14-domo2=www.open-std.org@open-std.org>
X-Original-To: sc22wg14-domo2
Delivered-To: sc22wg14-domo2@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 3A8CB9DB1AD; Sat, 28 Mar 2020 20:34:20 +0100 (CET)
Delivered-To: sc22wg14@open-std.org
Received: from smtp77.iad3b.emailsrvr.com (smtp77.iad3b.emailsrvr.com [146.20.161.77])
	(using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by www.open-std.org (Postfix) with ESMTP id DF9099DB197
	for <sc22wg14@open-std.org>; Sat, 28 Mar 2020 20:34:19 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=g001.emailsrvr.com;
	s=20190322-9u7zjiwi; t=1585424058;
	bh=8245YeIK3kXb48EEhRZ7emglCn2VXqu4jGBLy8xcB0c=;
	h=Subject:To:From:Date:From;
	b=j6clYbKCWXR8FJxMDwEfbsVqANPp2UKPyPd7qcdee64I8XCwddJpJp9lModbY05Je
	 f9rmycaYuCk+AaqaJUDc5YaJLQr1RMp0xc3eaCrK1aAOjVtWwfjI8XdstuAOvYKSKR
	 cBFbuNcgBjFSKEP4P3zgu+ghyyg4Tp3MbAg2iMYQ=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=honermann.net;
	s=20180930-2j89z3ji; t=1585424058;
	bh=8245YeIK3kXb48EEhRZ7emglCn2VXqu4jGBLy8xcB0c=;
	h=Subject:To:From:Date:From;
	b=FNxNkUApgKHZLxN+aaO7NHce3NFBw++Y2xsxKKu5clzpfN4+5qTVdu9S71/fC3200
	 WO4Hm0dV5uwTUhEEMML2+HR39VrrzceRb0HuDXLMQv2478/8jRlJ69gp5ke9w7Eo7B
	 1Jp42KmCaaoXKCJ701+GuJ6hIVnLKxyY1HvLf1LM=
X-Auth-ID: tom@honermann.net
Received: by smtp10.relay.iad3b.emailsrvr.com (Authenticated sender: tom-AT-honermann.net) with ESMTPSA id 34273E0135;
	Sat, 28 Mar 2020 15:34:18 -0400 (EDT)
X-Sender-Id: tom@honermann.net
Received: from [192.168.1.13] (pool-74-110-208-227.rcmdva.fios.verizon.net [74.110.208.227])
	(using TLSv1.2 with cipher DHE-RSA-AES128-SHA)
	by 0.0.0.0:25 (trex/5.7.12);
	Sat, 28 Mar 2020 15:34:18 -0400
Subject: Re: (SC22WG14.17682) mbrtowc() wording ambiguities and surprising
 implementation behavior
To: Hubert Tong <hubert.reinterpretcast@gmail.com>
Cc: wg14 <sc22wg14@open-std.org>, SG16 <sg16@lists.isocpp.org>
References: <20200328044149.75FAD3589AA@www.open-std.org>
 <20200328140620.1F731356571@www.open-std.org>
From: Tom Honermann <tom@honermann.net>
Message-ID: <4d5c82d2-3269-035c-b42f-41fb42abbd36@honermann.net>
Date: Sat, 28 Mar 2020 15:34:17 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.4.1
MIME-Version: 1.0
In-Reply-To: <20200328140620.1F731356571@www.open-std.org>
Content-Type: multipart/alternative;
 boundary="------------DF6FE3B866A361580B79BADD"
Content-Language: en-US
X-Classification-ID: f0cc0e5d-9045-4066-8ee8-ce238361a84f-1-1
Sender: owner-sc22wg14@open-std.org
Precedence: bulk

This is a multi-part message in MIME format.
--------------DF6FE3B866A361580B79BADD
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

On 3/28/20 10:06 AM, Hubert Tong wrote:
> On Sat, Mar 28, 2020 at 2:40 AM Tom Honermann <tom@honermann.net 
> <mailto:tom@honermann.net>> wrote:
>
>     I came across the following issues while testing an implementation
>     of mbrtoc8() [1] I'm working on.  The implementation uses
>     mbrtowc() internally.
>
> [ ... ]
>
>
>     The issues are demonstrated using an example of converting one
>     byte at a time, a Big5-HKSCS double byte sequence that maps to two
>     Unicode code points (assume the wide execution character set is
>     UTF-16 (or UCS2) or UTF32):
>
>       * 0x88 0x62 => U+00CA U+0304 {LATIN CAPITAL LETTER E WITH
>         CIRCUMFLEX} {COMBINING MACRON}
>
> The scenario presented violates the definition of "wide character", 
> which indicates the relationship between values of wchar_t and the C 
> standard concept of a "character":
> value representable by an object of type wchar_t, capable of 
> representing any character in the current locale
Indeed, but that definition of wide character in the standard 
contradicts long standing existing practice (e.g., use of UTF-16 on 
Windows).
>
> I doubt that wide characters should be considered the preferred 
> solution for dealing with UCS encodings or notions that characters are 
> formed by more than one minimal well-formed code unit sequence.
>
I certainly agree with the first part of that statement, but not the 
second considering existing practice.

Tom.


--------------DF6FE3B866A361580B79BADD
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <div class="moz-cite-prefix">On 3/28/20 10:06 AM, Hubert Tong wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:20200328140620.1F731356571@www.open-std.org">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Sat, Mar 28, 2020 at 2:40
            AM Tom Honermann &lt;<a href="mailto:tom@honermann.net"
              moz-do-not-send="true">tom@honermann.net</a>&gt; wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div>
              <p>I came across the following issues while testing an
                implementation of mbrtoc8() [1] I'm working on.  The
                implementation uses mbrtowc() internally.</p>
            </div>
          </blockquote>
          <div>[ ... ] <br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div><br>
              The issues are demonstrated using an example of converting
              one byte at a time, a Big5-HKSCS double byte sequence that
              maps to two Unicode code points (assume the wide execution
              character set is UTF-16 (or UCS2) or UTF32):
              <ul>
                <li>0x88 0x62 =&gt; U+00CA U+0304 {LATIN CAPITAL LETTER
                  E WITH CIRCUMFLEX} {COMBINING MACRON}</li>
              </ul>
            </div>
          </blockquote>
          The scenario presented violates the definition of "wide
          character", which indicates the relationship between values of
          wchar_t and the C standard concept of a "character":<br>
          value representable by an object of type wchar_t, capable of
          representing any character in the current locale</div>
      </div>
    </blockquote>
    Indeed, but that definition of wide character in the standard
    contradicts long standing existing practice (e.g., use of UTF-16 on
    Windows).<br>
    <blockquote type="cite"
      cite="mid:20200328140620.1F731356571@www.open-std.org">
      <div dir="ltr">
        <div class="gmail_quote"><br>
        </div>
        <div class="gmail_quote">I doubt that wide characters should be
          considered the preferred solution for dealing with UCS
          encodings or notions that characters are formed by more than
          one minimal well-formed code unit sequence.</div>
        <div class="gmail_quote"><br>
        </div>
      </div>
    </blockquote>
    <p>I certainly agree with the first part of that statement, but not
      the second considering existing practice.<br>
    </p>
    <p>Tom.<br>
    </p>
    <p><br>
    </p>
  </body>
</html>

--------------DF6FE3B866A361580B79BADD--