Update The Reference To The Unicode Standard

Document number: P1025R0
Date: 2018-04-23
Author: Steve Downey <sdowney2@bloomberg.net>
Audience: Core, LWG, SG16

Abstract

The reference to the Unicode Standard in the C++ Standard should be updated to the stable base standard or any successor standard.

References

P0417R1 : C++17 should refer to ISO/IEC 10646 2014 instead of 1994 (R1)

Preferred New Reference

The Unicode Consortium, the entity responsible for the Unicode standard, documents the preferred citations for the the Unicode Standard. The current standard is version 10.0. The existing reference should be changed to:

The Unicode Standard, Version 10.0 or later

The Unicode Consortium. The Unicode Standard, Version 10.0.0, (Mountain View, CA: The Unicode Consortium, 2017. ISBN 978-1-936213-16-0) http://www.unicode.org/versions/Unicode10.0.0/

The Unicode Consortium. The Unicode Standard. http://www.unicode.org/versions/latest/

The reason for not referring to the equivalent ISO Standard, 10646, is that the ISO standard is incomplete with respect to the Unicode Standard. From the Unicode and ISO 10646 FAQ

Although the character codes and encoding forms are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646.

For existing purposes, the C++ Standard is only concerned with character codes and encoding forms. However, to standardise any Unicode text processing, the algorithms and character data will need to be referenced. Therefore, we might as well update the reference now.

Referring to 10.0 or later sets a baseline, but allows implementors to move to later standards, including new emoji, at their discretion.

The equivalent to the 10.0 standard is ISO/IEC 10646:2017 with some additions from the first amendment to 10646. If there are strong reasons not to refer to the Unicode Standard itself, the reference for character sets and encoding should be changed to:

ISO/IEC 10646:2017 Information technology – Universal Coded Character Set (UCS) plus 10646:2017/DAmd 1, or successor

The 'or successor' wording is borrowed from the current ECMAScript standard, ECMAScript® 2017 Language Specification (ECMA-262, 8th edition, June 2017). The 'or successor' language has been in place since at least the 2015 standard.

The Unicode Consortium has made a number of stability guarantees based on the referenced standard, promising that any currently conforming Unicode text will continue to be interpreted the same way in the future for purposes of encoding, collation, registration, and locales. They are documented as part of their policies.

This means that it is safe to allow implementations to adopt newer Unicode standards without affecting the interpretation of existing conforming text. Since in practice, due to customer demand, everyone ships the latest Unicode data and algorithms available, this allows conformance to existing practice, particularly as new, advanced, unicode libraries are incorporated into the standard.

Immediate Effects

The Unicode standard that the C++ Standard refers to predates UTF-16 and UTF-32, instead defining UCS2 and UCS4. Moving to a newer standard would make the former terms well defined in the C++ Standard. It has been argued that the ECMAScript standard referred to uses a newer Unicode standard, in which those terms are defined, so those terms are defined for the C++ Standard by transitive reference. If that argument is accepted, then moving to the newer version makes the intent explicit.

In addition, in 1996, as part of amendments 5, 6 and 7, the original set of Hangul characters were removed and added at a new location, as well as Tibetan characters added again. This places the current citation in the standard of "ISO/IEC 10646-1:1993" in conflict with the version imported by way of the ECMAScript standard. In practice, all implementors adopt the later version for conversion operations.

The Wikipidia article on Unicode has a summary of the changes over the years.

UCS2 and UCS4 in `codecvt` facets

The last proposal to update the Unicode Standard reference, P0417R1, was entangled with deprecation of UCS2 and UCS4. The remaining references are in the now deprecated codecvt facets [depr.locale.stdcvt.req]. There is resistance to changing those to UTF-16 and UTF-32, since, particularly for UCS2, there are real changes in behavior. UTF-32 can be viewed as UCS4. UTF-16 can not be similarly viewed as UCS2. Since there may be users of the facility depending on the behavior as it was when standardized this paper does not propose changing them, but instead leaving them in place, as deprecated features, with no formal definition, as there is none to refer to anymore. This should not be interpreted as requiring any onus on implememtors to change the existing, deprecated, facets.

`__STDC_ISO_10646__` macro

The macro __STDC_ISO_10646__ in [cpp.predefined] can be left unchanged. The ISO/IEC 10646 version will be the version that corresponds to the Unicode Standard in effect.

Fall-back Reference

The current Unicode standard, 10.0, is equivalent to

10646:2017, fifth edition, plus the following additions from Amendment 1 to the fifth edition:

56 emoji characters

285 hentaigana

3 additional Zanabazar Square characters

according to the Unicode 10.0 Standard

The 2017 standard is ISO/IEC 10646:2017 so as a fall-back position, the standard should be updated to

ISO/IEC 10646:2017 Information technology – Universal Coded Character Set (UCS) plus 10646:2017/DAmd 1

Without reference to the latest standard.

Proposed Changes

Strike the wording ~~high-lighted in red~~ and add the wording high-lighted in green.

1.2 Normative references [intro.refs]

~~— ISO/IEC 10646-1:1993, Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane~~

— The Unicode Consortium. The Unicode Standard, Version 10.0.0, (Mountain View, CA: The Unicode Consortium, 2017. ISBN 978-1-936213-16-0) http://www.unicode.org/versions/Unicode10.0.0/

— The Unicode Consortium. The Unicode Standard. http://www.unicode.org/versions/latest/

— ISO/IEC 10646, Information technology — Universal Multiple-Octet Coded Character Set (UCS)

Add:

5 The ISO/IEC 10646 version is the corresponding version to the Unicode Standard, as documented by the Unicode Standard. For version 10.0 this is ISO/IEC 10646:2017 plus 10646:2017/DAmd 1.