Document number:	N3336 = 12-0026
Date:	2012-01-13
Project:	Programming Language C++, Library Working Group
Reply-to:	Beman Dawes <bdawes at acm dot org>

Adapting Standard Library Strings and I/O to a Unicode World

Introduction
String conversion safety rationale
Design paths not taken
Acknowledgements
String interoperability problems and proposed solutions
    Problem 1: Strings don't interoperate if encoding differs
    Problem 2: Strings don't interoperate with I/O streams if encoding differs
    Problem 3: String conversion iterators are not provided

Introduction

This paper proposes additions to the C++ standard library/TR2 to ease use of Unicode and other string encodings. The motivation is a series of problems with the C++11 standard library.

The full statement of the problems with proposed solutions is given below in String interoperability problems and proposed solutions.

The C++03 versions of these problems were first encountered while providing Unicode support for the internationalization of commercial GIS software. The problems appeared again while working on the Boost Filesystem Library. These problems have become more apparent as compiler support for C++11's additional Unicode support has made it easier to write programs that run up against current limitations.

The proposed solutions are pure additions to the C++11 standard library. No C++03 or C++11 compliant code is broken or otherwise affected by the additions.

This paper does not provide working paper wording. WP wording will be provided if this proposal is accepted in principle.

A "proof-of-concept" implementation of the proposals (and more) is available at github.com/Beman/string-interoperability.

String conversion safety rationale

The proposed solutions below make the assumption that it is safe to convert a string of any type and encoding to another type and encoding. The rationale for that assumption follows.

Conversion in either direction between UTF-8 encoded std::string and UTF-32 encoded std::u32string is safe because it is defined by the Unicode Consortium and ISO/IEC 10646 as unambiguous and lossless.

Conversion in either direction between UTF-16 encoded std::u16string and UTF-32 encoded std::u32string is safe because it is defined by the Unicode Consortium and ISO/IEC 10646 as unambiguous and lossless.

Conversion in either direction between UTF-8 encoded std::string and UTF-16 encoded std::u16string is safe because it can be composed from the two previous known safe conversions via an intermediate conversion to and from UTF-32 encoded char32_t characters.

The cases of std::string and std::wstring are more complex in that the encoding is not implied by the char and wchar_t value types. It is not necessary, however, to know the encoding of these string types in advance as long as it is known how to convert them to one of the known encoding string types. The C++11 standard library requires codecvt<char32_t,char,mbstate_t> and codecvt<wchar_t,char,mbstate_t> facets, so such conversions are always possible using the standard library. In practice, library implementations have additional knowledge that allow such conversions to be more efficient than just calling codecvt facets. To ensure safety, error handling does need to be provided, however, as conversions involving some char and wchar_t encodings can encounter errors. See Problem 3 below for some requested error handling approaches.

Implicit conversion between single characters of different types, as opposed to strings, may require multi-character sequences. No such single character implicit conversions are proposed here.

Design paths not taken

This proposal deals with C++11 std::basic_string and character types, and with their encodings. The deeper attributes of Unicode characters are not addressed. See Mathias Gaunard's Unicode project for an example of deeper Unicode support.

This proposal does not suggest providing a string type guaranteed to provide UTF-8 encoding. Although experiments with typedef basic_string<unsigned char> u8string; worked well, benefits would be speculative and not based on existing practice.

Another approach would be to provide a utf8_char_traits class and then typedef basic_string<char, utf8_char_traits> u8string;. This approach has not been investigated.

Acknowledgements

Peter Dimov inspired the idea of string interoperability by arguing that the Boost Filesystem library should treat a path is a single type (i.e. not a template) regardless of character size and encoding.

John Maddock's Unicode conversion iterators demonstrated an easier-to-use, more efficient, and STL friendlier way to perform character type and encoding conversions as an alternative to standard library codecvt facets.

The C++11 standard deserves acknowledgement as it provides the underlying language and library features that allow Unicode string interoperability:

char16_t and char32_t provide Unicode character types and null-terminated characters strings with guaranteed encodings.
std::u16string and std::u32string provide library support for Unicode character types and encodings.
u8, u, and U character and string literals ease programming with Unicode character types and encodings.

String interoperability problems and proposed solutions

Problem 1: Strings don't interoperate if encoding differs

Discussion

Standard library strings with different character encodings have different types that do not interoperate.

Example

u16string s16 = u"您好世界";
u32string s32;
s32 = s16;           // error!
s32 = "foo";         // error!
s32 = s16.c_str();   // error!
s32.assign(s16.cbegin(), s16.cend()); // error!

void f(const string&);
f(s32);              //error!

The encoding of basic_string instantiations can be determined for the types under discussion. It is either implicit in the string's value_type or can be determined via the locale.

Existing practice

Boost Filesystem Version 3, and the filesystem proposal before the C++ committee, class path solves some of the string interoperability problems, albeit in limited context. A function that is declared like this:

void f(const path&);

Can be called like this:

f("Meow");
f(L"Meow");
f(u8"Meow");
f(u"Meow");
f(U"Meow");
// ... many additional variations such as basic_strings and iterators

This string interoperability support has been a success. It does, however, raise the question of why std::basic_string isn't providing the interoperability support. Users are misusing paths as general string containers because they provide interoperability. The string interoperability cat is out of the bag. The toothpaste is out of the tube.

See Boost.Filesystem V3 class path for an example of how such interoperability might be achieved.

Experience with Boost.Filesystem V3 class path has demonstrated that string interoperability brings a considerable simplification and improvement to internationalized user code, but that having to provide interoperability without the resolution of the issues presented here is a band-aid.

Relationship with interoperability iterators

String interoperability will be easier to specify, implement, and use if the string interoperability iterators proposed below are accepted.

Proposed Solution

The approach is to add additional std::basic_string overloads to functions most likely to benefit from interoperability. The overloads are in the form of function templates with sufficient restrictions on overload resolution participation (i.e. enable_if) that the existing C++11 functions are always selected if the value type of the argument is the same as or convertible to the std::basic_string type's value_type. The semantics of the added signatures are the same as original signatures except that arguments of the template parameter type have their value converted to the type and encoding of basic_string::value_type.

The std::basic_string functions given additional overloads are:

Each constructor, operator=, operator+=, append, and assign signature.
template <class T> unspecified_iterator c_str(), returning an unspecified iterator with value_type of T.
begin() and end(). Similar to c_str().

To keep the number and complexity of overloads manageable, the proof-of-concept implementation does not provide any way to specify error handling policies, or string and wstring encoding. Every one of the added signatures does not need to be able to control error handling and encoding. The need is particularly rare in environments where UTF-8 is the narrow character encoding and UTF-16 is the wide character encoding. A subset, possibly just c_str(), begin(), and end(), with error handling and encoding parameters or arguments, suitable defaulted, may well be sufficient.

Because full implicit interoperability involves a lot of additional signatures be added to basic_string, it will certainly be appropriate to discuss limiting changes to the key areas of need. For example, constructors and operator= are much more likely to need interoperability than operator+=, append, or assign signatures.

Problem 2: Strings don't interoperate with I/O streams if encoding differs

Discussion

I/O streams do not accept strings of different character types

A "Hello World" program using a C++11 Unicode string literal illustrates this frustration:

#include <iostream>
int main()
{
  std::cout << U"您好世界";   // error in C++11!
}

This code should "just work", even though the type of U"您好世界" is const char32_t*, not const char*, as long as the encoding of char supports 您好世界. Even if those characters are not supported by default encodings, alternatives like UTF-8 are available.

The code does "just work" with the proof-of-concept implementation of this proposal. On Linux, with default char encoding of UTF-8, execution produces the expected 您好世界 output. On Windows, the console doesn't support full UTF-8, so the output can be piped to a file or to a program which does handle UTF-8 correctly. And, yes, that does work correctly with the proof-of-concept implementation of this proposal.

Proposed Solution

Add additional function templates to those in 27.7.3.6.4 [ostream.inserters.character], Character inserter function templates, to cover the case where the argument character type differs from charT and is not char, signed char, unsigned char, const char*, const signed char*, or const unsigned char*. (The specified types are excluded because they are covered by existing signatures.) The semantics of the added signatures are the same as original signatures except that arguments shall be converted to the type and encoding of the stream.

Do the same for the character extractors in 27.7.2.2.3 [istream::extractors], basic_istream::operator>>.

Do the same for the two std::basic_string inserters and extractors in 21.4.8.9 [string.io], Inserters and extractors.

Problem 3: String conversion iterators are not provided

Discussion

Conversion between character types and their encodings using current standard library facilities such as std::codecvt, std::locale, and std::wstring_convert has multiple problems:

Interfaces are overly complex, difficult to learn, difficult to use, and error prone.
Given n encodings, it is necessary to providing n² rather than 2n codecs. In other words, two codecvt facets don't easily compose into a complete conversion from one encoding to another. Such composition is existing practice in C libraries like ICU. UTF-32 is the obvious choice for the common encoding to pass between codecs.
Interfaces don't work well with generic programming techniques, particularly iterators.
Interfaces work at the level of entire strings rather than characters, resulting in unnecessary creation of temporary strings, with attendant memory allocations/deallocations.
Interfaces entangle std::locale and code conversion, even when these are implementation details that should be hidden from the application.
Difficult to control error actions. Choices requested by users and provided by other interfaces include:
- Throw exception.
- Replace offending character with default character.
- Replace offending character with specified character. Motivating example: Filesystem need to use a replacement character that is acceptable to the Windows codepage. See Boost issue #5769.

Example

The generalization of the std::basic_string function c_str is:

template <class T> unspecified_iterator c_str() const;

Give a std::string named s8, this allows a user to write s8.c_str<char16_t>() to obtain an iterator with a value type of char16_t. To implement this function generically using the current standard library would be difficult, and would involve the creation of a temporary sting. The full implementation with the proposed solution is simply:

template <class T>
converting_iterator<const_iterator, value_type, by_range, T> c_str() const
{
  return converting_iterator<const_iterator,
    value_type, by_range, T>(cbegin(), cend());
}

No temporary string is created, and none of the other problems listed above are present either. The solution is generally useful for user defined types, and not just for implementations of the standard library.

Other problems become easier to solve with converting_iterator. For example, the Filesystem library's class path in N3239 has many functions with an argument in the form const codecvt_type& cvt=codecvt() that could be eliminated by either direct or indirect use of converting_iterator.

Existing practice

Boost Regex for many years has included a set of Unicode conversion iterators as an implementation detail. Although these do not provide composition, they do demonstrate the technique of using encoding conversion iterators to avoid creation of temporary strings.

Proposed Solution

This solution is based on the proof-of-concept implementation. Input iterator requirements can probably be loosened to bidirectional, but that hasn't been tested yet.

The preliminaries begin with end-detection policy classes, since strings used null termination, size, or half-open ranges to determine the end of a sequence.

template <class InputIterator> class by_null;
template <class InputIterator> class by_size;
template <class InputIterator> class by_range;

Codec templates handle actual conversion to and from UTF-32. The primary templates are:

template <class InputIterator, class FromCharT, template<class> class EndPolicy> 
  class to32_iterator;
template <class InputIterator, class ToCharT> 
  class from32_iterator;

The standard library would provide specializations for char, wchar_t, char16_t, and char32_t. Presumably users could provide specializations for UDTs, but that hasn't been tested yet. The char and wchar_t specializations provide mechanisms to select the encoding. Since this is a new component the char default encoding could be UTF-8 rather than locale based and no existing code would be broken.

The actual converting_iterator primary template is simply:

template <class InputIterator, class FromCharT, template<class> class EndPolicy,
          class ToCharT> 
class converting_iterator
  : public from32_iterator<to32_iterator<InputIterator, FromCharT, EndPolicy>,
      ToCharT>
{
public:
  using from32_iterator::from32_iterator;
};

Specializations may be provided, but aren't required. The proof-of-concept implementation doesn't use inherited constructors because of lack of compiler support.

Adapting Standard Library Strings and I/O to a Unicode World

Table of contents

Introduction

String conversion safety rationale

Design paths not taken

Acknowledgements

String interoperability problems and proposed solutions

Problem 1: Strings don't interoperate if encoding differs

Discussion

Example

Existing practice

Relationship with interoperability iterators

Proposed Solution

Problem 2: Strings don't interoperate with I/O streams if encoding differs

Discussion

Proposed Solution

Problem 3: String conversion iterators are not provided

Discussion

Example

Existing practice

Proposed Solution