Doc. no.:	P0417R1
Date:	2016-11-25
Reply to:	Beman Dawes <bdawes at acm dot org>
Audience:	Core, Library

C++17 should refer to ISO/IEC 10646 2014 instead of 1994 (R1)

ISO standards are only supposed to have normative references to the latest version of other ISO standards, yet the C++17 CD still refers to ISO/IEC 10646-1:1993, Information technology — Universal Multiple-Octet Coded Character Set (UCS)— Part 1: Architecture and Basic Multilingual Plane.

This paper proposes updating the C++ standard to refer to ISO/IEC 10646:2014 and replacing of the terms UCS2 and UCS4 with UTF-16 and UTF-32. National Body comment GB 4 requests updating the reference. NB comments US 64 and CA 9 implicitly support updating the reference, but explicitly request UCS2 be retained.

Background

There have been three revisions and numerous amendments to ISO/IEC 10646 since 1994. The changes that impact the C++17 CD include:

The name has changed to Information technology — Universal Coded Character Set (UCS).
UTF-8, UTF-16, and UTF-32 are now defined. They were not even a part of 10646:1994 before amendments, so the C++ standard has been using the terms without a normative definition.
UCS-2 has been deprecated, and has been replaced by UTF-16. This is a normative change for the C++ standard because UCS-2 and UTF-16 are not the same; UCS-2 does not support surrogate pairs and so is limited to the Basic Multilingual Plane (BMP).
The term UCS-4 has been changed to UTF-32. Although 10646 says "The terms UTF-32 and UCS-4 can be used interchangeably...", the C++ standard should use the preferred term UTF-32 throughout.

See http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html for a copy of ISO/IEC 10646:2014.

Discussion of the UCS2 to UTF-16 change

The term 'UCS2' is only used twice, in the specification of the C++11 header <codecvt> facets in [locale.stdcvt].

Rationale for the change to UTF-16:

The term UCS-2 is now obsolete and deprecated. See http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf section C.2, which says:

UCS-2. UCS-2 stands for “Universal Character Set coded in 2 octets” and is also known as “the two-octet BMP form.” It was documented in earlier editions of 10646 as the two-octet (16-bit) encoding consisting only of code positions for plane zero, the Basic Multilingual Plane. This documentation has been removed from ISO/IEC 10646:2011 and subsequent editions, and the term UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.

Implementations diverge. Stdlibc++ already treats two-octet forms as UTF-16.
When a facet's Elem is char16_t, it is surprising and error-prone if the encoding is actually UCS2 since the value of char16_t character literals "is equal to its ISO 10646 code point value" and the encoding for char16_t string literals is explicitly required (2.13.5 [lex.string] paragraph 10) to support surrogate pairs (i.e. is UTF-16).
Because header <codecvt> facets only became part of the standard with C++11, and because the only code breakage from the change to UTF-16 is in downstream code that makes assumptions which fail for surrogate code points, it seems unlikely that UCS2 replacement will break much existing code that isn't already broken. Use of the facets themselves does not break existing code because the ranges for high surrogates, low surrogates, and valid BMP characters are disjoint.

Revision history

R1 - 2016 Post-Issaquah mailing

Add mention of National Body comments.
Add Acknowledgements.
Add Discussion of the UCS2 to UTF-16 change.
Remove proposed changes to Annex E. Clark Nelson points out that the omission of F0000-FFFFD and 100000-10FFFD is deliberate because they are reserved for private use.

R0 - 2016 Post-Oulu mailing

Initial proposal

Acknowledgements

Thanks to Richard Smith for encouraging me to write this paper.

Thanks to Tom Honermann for standardese discussions that led me to realize how out-of-date the ISO/IEC 10646:1-1993 reference was.

Proposed changes

Strike the wording ~~high-lighted in red~~ and add the wording high-lighted in green.

1.2 Normative references [intro.refs]

— ISO/IEC 10646~~-1:1993, Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane~~ :2014, Information technology — Universal Coded Character Set (UCS)

22.5 Standard code conversion facets [locale.stdcvt]

For the facet codecvt_utf8:

— The facet shall convert between UTF-8 multibyte sequences and ~~UCS2~~ UTF-16 or ~~UCS4~~ UTF-32 (depending on the size of Elem) within the program.

...

For the facet codecvt_utf16:

— The facet shall convert between UTF-16 multibyte sequences and ~~UCS2~~ UTF-16 or ~~UCS4~~ UTF-32 (depending on the size of Elem) within the program.