Doc. no.: | P0417R1 |
Date: | 2016-11-25 |
Reply to: | Beman Dawes <bdawes at acm dot org> |
Audience: | Core, Library |
ISO standards are only supposed to have normative references to the latest version of other ISO standards, yet the C++17 CD still refers to ISO/IEC 10646-1:1993, Information technology — Universal Multiple-Octet Coded Character Set (UCS)— Part 1: Architecture and Basic Multilingual Plane.
This paper proposes updating the C++ standard to refer to ISO/IEC 10646:2014 and replacing of the terms UCS2 and UCS4 with UTF-16 and UTF-32. National Body comment GB 4 requests updating the reference. NB comments US 64 and CA 9 implicitly support updating the reference, but explicitly request UCS2 be retained.
There have been three revisions and numerous amendments to ISO/IEC 10646 since 1994. The changes that impact the C++17 CD include:
See http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html for a copy of ISO/IEC 10646:2014.
The term 'UCS2' is only used twice, in the specification of the C++11 header
<codecvt>
facets in [locale.stdcvt].
Rationale for the change to UTF-16:
UCS-2 stands for “Universal Character Set coded in 2 octets” and is also known as “the two-octet BMP form.” It was documented in earlier editions of 10646 as the two-octet (16-bit) encoding consisting only of code positions for plane zero, the Basic Multilingual Plane. This documentation has been removed from ISO/IEC 10646:2011 and subsequent editions, and the term UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.UCS-2.
Elem
is char16_t
, it is
surprising and error-prone if the encoding is actually UCS2 since the value of char16_t
character literals "is equal to its ISO 10646 code point value" and the
encoding for char16_t
string literals is explicitly
required (2.13.5 [lex.string] paragraph 10) to support surrogate pairs (i.e.
is UTF-16).<codecvt>
facets only became part of the
standard with C++11, and because the only code breakage from the change to
UTF-16 is in downstream code that makes assumptions which fail for surrogate
code points, it seems unlikely that UCS2 replacement will break much existing
code that isn't already broken. Use of the facets themselves does not break
existing code
because the ranges for high surrogates, low surrogates, and valid BMP
characters are disjoint.R1 - 2016 Post-Issaquah mailing
R0 - 2016 Post-Oulu mailing
Thanks to Richard Smith for encouraging me to write this paper.
Thanks to Tom Honermann for standardese discussions that led me to realize how out-of-date the ISO/IEC 10646:1-1993 reference was.
Strike the wording high-lighted in red and add the wording high-lighted in
green.
— ISO/IEC 10646-1:1993, Information technology — Universal
Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic
Multilingual Plane :2014, Information technology — Universal Coded
Character Set (UCS)
For the facet codecvt_utf8
:
— The facet shall convert between UTF-8 multibyte sequences and
UCS2UTF-16 orUCS4UTF-32 (depending on the size of Elem) within the program.
...
For the facet codecvt_utf16:
— The facet shall convert between UTF-16 multibyte sequences and
UCS2UTF-16 orUCS4UTF-32 (depending on the size of Elem) within the program.