Doc. no.: N2207=07-0067

Date: 2007-03-09

Project: Programming Language C++

Subgroup: Library

Reply to: Matthew Austern <austern@google.com>

Minimal Unicode support for the standard library (revision 2)

Background

Unicode is an industry standard developed by the Unicode Consortium, with the goal of encoding every character in every writing system. It is synchronized with ISO 10646, which contains the same characters and the same character codes, and for the purposes of this paper we may treat Unicode and ISO 10646 as synonymous. Many programming languages and platforms already support Unicode, and many standards, such as XML, are defined in terms of Unicode. There has already been some work to add Unicode support to ISO C++.

C++ has two character types, char and wchar_t. The standard does not specify which character set either type uses, except that each is a superset of the 95-character basic execution character set. In practice char is almost always an 8-bit type, typically used either for the ASCII character set or for some 256-character superset of ASCII (e.g. ISO-8859-1). Some programs use wchar_t for Unicode characters, but wchar_t varies enough from one platform to another that it is unsuitable for portable Unicode programming.

Unicode assigns “a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.” [6] These numbers are known as code points. A character encoding form specifies the way in which a sequence of code points is represented in an actual program. Code points range from 0x000000 through 0x10ffff, so 21 bits suffice to represent all Unicode characters. No popular architecture has a 21-bit word size, so instead most programs that work with Unicode use one of the following character encoding forms for internal processing:

Other character encoding forms are sometimes used for serialization or external storage.


Unicode support for ISO C is described in TR 19769:2004, a Type 2 technical report. TR 19769 proposes two new character types, char16_t and char32_t, together with new syntax for character and string literals of those types, and a few additions to the C library to manipulate strings of those types. Lawrence Crowl’s paper N2149, “New Character Types in C++,” proposes that WG21 adopt TR 19769 almost unchanged; essentially the only change from TR 19769 is that char16_t and char32_t are required to be distinct from other integer types, so that it’s possible to overload on them.

This paper describes changes to the standard library that will be needed if WG21 chooses to adopt N2149. It is a proposal for C++0x, because it proposes changes in existing standard library components.

Goals and design decisions

The main goal of this paper is simple: make it possible to use library facilities in combination with the two new character types char16_t and char32_t. This paper does not attempt to define new library facilities or to fix defects in existing ones, but only to make it possible to use char16_t and char32_t with existing library facilities.


This goal is important despite the existence of wchar_t. Even if wchar_t is the same size as one of those two types, it is distinct from both from the point of view of the C++ type system. It would be very poor user experience if we told users that they had to cast their Unicode strings to some other type in order to use library facilities, especially since that type would vary from one system to another. (Internally, of course, I imagine most library implementers will choose to share code between char32_t and wchar_t or between char16_t and wchar_t.) It is indeed irritating to have three distinct types when two of them will almost always be identical, but, as with char, signed char, and unsigned char, history leaves us little choice.

Minimal support for char32_t

Minimal support for char32_t is simple: UTF-32 is a fixed width encoding, so we just need to require specializations of library facilities for char32_t in the same way that we do for char and wchar_t. Arguably a basic_string of 32-bit characters isn't all that useful, but I think just enough people would use it to make it worth having.

Minimal support for char16_t

Minimal support for char16_t is more complicated in theory, but equally simple in practice: again, just add specializations of all library facilities for char16_t. UTF-16 is not a fixed width encoding, but, for two reasons, it can almost be treated as one. First, most text is composed only of the common characters that lie in the Basic Multilingual Plane (BMP), and for such text UTF-16 is in fact a fixed width encoding. Second, since it’s always possible to tell whether a code unit is a complete character, a leading surrogate, or a trailing surrogate, there is little danger from treating a UTF-16 string as a sequence of code units instead of a sequence of code points. Corrupting a UTF-16 string by inserting an incorrect code unit is no more likely than corrupting a UTF-32 string, and the corruption, if any, will be confined to a single character.


We don’t need to say very much about how the library handles char16_t strings. There is already language in the standard to allow facets to give errors at run time for invalid strings, and we need that for UTF-32 as well as UTF-16.


In practice, we need library support for UTF-16 because that’s the real world; if the standard library ignores UTF-16 then the standard library will be irrelevant to processing non-ASCII text. The small amount of extra simplicity that you get from using UTF-32 instead of UTF-16 just doesn’t outweigh the cost of using 4 bytes per character instead of 2+ε. Microsoft, Apple, and Java all use UTF-16 as their primary string representation, and in practice it works fine. Microsoft’s decision to use UTF-16 for wchar_t shows that there is no insuperable obstacle to using UTF-16 with the standard C++ library.

Names of template specializations

The C++ standard assigns names for many of the specializations of class templates on character types. For example, string is shorthand for basic_string<char> and wstreambuf is shorthand for basic_streambuf<wchar_t>. Our general pattern: no prefix for char specializations and the prefix ‘w’ for wchar_t specializations. What should the pattern be for specializations on char16_t and char32_t?

In principle we could use a prefix based on the “u” and “U” prefixes that N2149 proposes for Unicode string literals, or we could use a prefix or suffix based on the “16” and “32” in the type names themselves. I propose a combination of the two: a “u” prefix for the char16_t specializations and a “u32” prefix for the char32_t specializations. Rationale:

Prior art

There is no prior art for a C++ library implementation containing four types named char, wchar_t, char16_t, and char32_t. However, there is also no doubt that the proposal is implementable. There is extensive prior art for C++ standard library implementations that use UTF-16 wide characters (wchar_t in Microsoft’s C++ implementation uses UTF-16), and there is extensive prior art for C++ standard library implementations that use UTF-32 wide characters (most Unix implementations).

Dependence on N2149

Since N2149 has not been voted into the WP, building library facilities on top of it is slightly dicey. In principle a number of questions are still open: the names of the two new types (I have chosen to use char16_t and char32_t in accordance with TR 19769 and N2018), whether they are new built-in types or new user-defined types (the latter would only make sense if there are core language changes to permit string literals for user-defined types), whether char16_t and char32_t are the names of types or just the names of typedefs for underlying types with uglier names, and, if they are typedef names, which namespace the typedefs live in. Except for the names char16_t and char32_t, nothing in this paper depends on those decisions.


Changes since revision 1

This document has been revised as a result of the LWG's straw polls in Portland. The LWG voted on whether to support char16_t and char32_t for various library facilities:

Library facility
For
Against
char_traits
12
0
iostream
1
9
fstream
3
4
sstream
2
1
facets (other than codecvt)
3
4
codecvt
11
0
regex
2
7

There were no straw polls on numeric_limits and hash, and it was assumed that basic_string would work if char_traits was supported. We need language in the standard to support hash since it lists all types for which it is defiend, but no such language is needed for numberic_limits since the standard already says that it is specialized for all fundamental types.

Revision 1 included support for all library facilities, i.e. adding char16_t and char32_t overloads to every feature that currently takes char and wchar_t overloads. Revision 2 removes support for all facilities that the majority of the LWG opposed in the Portland straw poll. Even though sstream attracted majority support, I have also removed it from revision 2 of this proposal because (a) it was a very weak majority, and (b) sstream depends so heavily on the general iostream machinery that it would be difficult to support sstream without also supporting other parts of the iostream machinery.

Possible future directions

Two items are conspicuously missing from this paper: UTF-8 support, and explicit support for Unicode features like normalization, case conversion, and collation. I intend to address those issues in future papers.

UTF-8 support

One way to provide UTF-8 support would be a new string class whose interface is very different from basic_string, designed to preserve string validity and to encourage users to view the string as code points rather than individual bytes. Alternative approaches include UTF-8 iterator adaptors, or just user education to encourage users to store UTF-8 data in the existing string class.


Some form of UTF-8 support is important because there's an awful lot of real-world code that uses UTF-8 even internally, and programmers certainly need UTF-8 to interface with third-party libraries like libxml2.

Unicode text manipulation

Unicode is more than a character set and a handful of encoding schemes. It also specifies a great deal of information about each character, including script identification, character classification, and text direction, and various operations on strings, including normalization, case conversion, and collation.


Normalization is particularly important because there are cases where two different sequences of code points can represent what is conceptually the same string. For example, a string that is printed as “á” can be either the single character U+00E1 (LATIN SMALL LETTER A WITH ACUTE) or the two-character sequence U+0061 (LATIN SMALL LETTER A) U+0301 (COMBINING ACUTE ACCENT). Unicode defines several different canonical forms, and algorithms for converting to canonical form and for testing string equivalence.


In principle some of these facilities are already part of C++’s facet interface, and it might be argued that we do not need a separate mechanism just for Unicode. There is, however, an important way in which Unicode is special: since it uses a single code point space for all scripts, many operations in Unicode are locale-independent that in other encodings are necessarily locale-dependent. Since the C++ locale interface is so awkward, it would be useful to provide a locale-independent interface for common operations that do not require locales.


ICU (International Components for Unicode) is a useful source of prior art.


Proposed working paper changes

In clause 20.5 [lib.function.objects], in the header <functional> synopsis, add the following specializations of class template hash<>:

template<> struct hash<char16_t>;

template<> struct hash<char32_t>;

template<> struct hash<std::ustring>;

template<> struct hash<std::u32string>;


In clause 20.5.15 [lib.unord.hash], in paragraph 1, change "and std::string and std::wstring" to read "and std::string, std::wstring, std::ustring, and std::u32string".


Add two new sections after 21.1.3.2 [lib.char.traits.specializations.wchar.t]:


[lib.char.traits.specializations.char16.t]

namespace std {

template<>

struct char_traits<char16_t> {

typedef char16_t char_type;

typedef uint_least_16_t int_type;

typedef streamoff off_type;

typedef ustreampos pos_type;

typedef mbstate_t state_type;

static void assign(char_type& c1, const char_type& c2);

static bool eq(const char_type& c1, const char_type& c2);

static bool lt(const char_type& c1, const char_type& c2);

static int compare(const char_type* s1, const char_type* s2, size_t n);

static size_t length(const char_type* s);

static const char_type* find(const char_type* s, size_t n,

const char_type& a);

static char_type* move(char_type* s1, const char_type* s2, size_t n);

static char_type* copy(char_type* s1, const char_type* s2, size_t n);

static char_type* assign(char_type* s, size_t n, char_type a);

static int_type not_eof(const int_type& c);

static char_type to_char_type(const int_type& c);

static int_type to_int_type(const char_type& c);

static bool eq_int_type(const int_type& c1, const int_type& c2);

static int_type eof();

};

}


The header <string> (21.2) declares a specialization of the class template char_traits for char16_t.


The two-argument member assign is defined identically to the built-in operator =. The two-argument members eq and lt are defined identically to the built-in operators == and <.


The member eof() returns an implementation defined constant that cannot appear as a valid UTF-16 code unit.


[lib.char.traits.specializations.char32.t]

namespace std {

template<>

struct char_traits<char32_t> {

typedef char32_t char_type;

typedef uint_least_32_t int_type;

typedef streamoff off_type;

typedef u32streampos pos_type;

typedef mbstate_t state_type;

static void assign(char_type& c1, const char_type& c2);

static bool eq(const char_type& c1, const char_type& c2);

static bool lt(const char_type& c1, const char_type& c2);

static int compare(const char_type* s1, const char_type* s2, size_t n);

static size_t length(const char_type* s);

static const char_type* find(const char_type* s, size_t n,

const char_type& a);

static char_type* move(char_type* s1, const char_type* s2, size_t n);

static char_type* copy(char_type* s1, const char_type* s2, size_t n);

static char_type* assign(char_type* s, size_t n, char_type a);

static int_type not_eof(const int_type& c);

static char_type to_char_type(const int_type& c);

static int_type to_int_type(const char_type& c);

static bool eq_int_type(const int_type& c1, const int_type& c2);

static int_type eof();

};

}


The header <string> (21.2) declares a specialization of the class template char_traits for char32_t.


The two-argument member assign is defined identically to the built-in operator =. The two-argument members eq and lt are defined identically to the built-in operators == and <.


The member eof() returns an implementation defined constant that does not represent a Unicode code point.


In clause 21.2 [lib.string.classes], add the following to the beginning of the header <string> synopsis:

template<> struct char_traits<char16_t>;

template<> struct char_traits<char32_t>;

and the following to the end:

typedef basic_string<char16_t> ustring;

typedef basic_string<char32_t> u32string;


In Table 65 (Locale category facets) in clause 22.1 [lib.locale.category], add the following specializations:


codecvt<char16_t, char, mbstate_t>

codecvt<char32_t, char, mbstate_t>



In Table 66 (Required Specializations) in clause 22.1 [lib.locale.category], add the following specializations:


codecvt_byname<char16_t, char, mbstate_t>

codecvt_byname<char32_t, char, mbstate_t>


In clause 22.2.1.4 [lib.locale.codecvt] paragraph 3, remove the phrase “namely codecvt<wchar_t, char, mbstate_t> and codecvt<char, char, mbstate_t>.” Add the following sentence, after the one describing the wchar_t specialization: “The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encoding schemes, and the specialization codecvt<char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes.”



References

[1] Lawrence Crowl, Extensions for the Programming Language C++ to Support New Character Data Types. WG21 N2149, 2007.

[2] ISO. Information technology -- Universal Multiple-Octet Coded Character Set (UCS), ISO/IEC 10646.
[3] ISO. Information technology -- Programming languages, their environments and system software inferfaces -- Extensions for the programming language C to support new character data types, ISO/IEC TR 19769:2004.

[4] The Unicode Consortium. The Unicode Standard, Version 4.1.0, defined by: The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1) and by Unicode 4.1.0 (http://www.unicode.org/versions/Unicode4.1.0).

[5] The Unicode Consortium, Frequently Asked Questions, http://www.unicode.org/unicode/faq/. See in particular http://www.unicode.org/faq/utf_bom.html for a discussion of encodings.

[6] The Unicode Consortium, What is Unicode?, http://www.unicode.org/standard/WhatIsUnicode.html