Document number: | N3336 = 12-0026 |
Date: | 2012-01-13 |
Project: | Programming Language C++, Library Working Group |
Reply-to: | Beman Dawes <bdawes at acm dot org> |
Introduction
String conversion safety
rationale
Design paths not taken
Acknowledgements
String interoperability problems and proposed
solutions
Problem 1: Strings don't interoperate if
encoding differs
Problem 2: Strings don't interoperate
with I/O streams if encoding differs
Problem 3: String conversion
iterators are not provided
This paper proposes additions to the C++ standard library/TR2 to ease use of Unicode and other string encodings. The motivation is a series of problems with the C++11 standard library.
The full statement of the problems with proposed solutions is given below in String interoperability problems and proposed solutions.
The C++03 versions of these problems were first encountered while providing Unicode support for the internationalization of commercial GIS software. The problems appeared again while working on the Boost Filesystem Library. These problems have become more apparent as compiler support for C++11's additional Unicode support has made it easier to write programs that run up against current limitations.
The proposed solutions are pure additions to the C++11 standard library. No C++03 or C++11 compliant code is broken or otherwise affected by the additions.
This paper does not provide working paper wording. WP wording will be provided if this proposal is accepted in principle.
A "proof-of-concept" implementation of the proposals (and more) is available at github.com/Beman/string-interoperability.
The proposed solutions below make the assumption that it is safe to convert a string of any type and encoding to another type and encoding. The rationale for that assumption follows.
Conversion in either direction between UTF-8 encoded std::string and UTF-32 encoded std::u32string is safe because it is defined by the Unicode Consortium and ISO/IEC 10646 as unambiguous and lossless.
Conversion in either direction between UTF-16 encoded std::u16string and UTF-32 encoded std::u32string is safe because it is defined by the Unicode Consortium and ISO/IEC 10646 as unambiguous and lossless.
Conversion in either direction between UTF-8 encoded std::string and UTF-16 encoded std::u16string is safe because it can be composed from the two previous known safe conversions via an intermediate conversion to and from UTF-32 encoded char32_t characters.
The cases of std::string
and std::wstring
are more
complex in that the encoding is not implied by the char
and
wchar_t
value types. It is not necessary, however, to know
the encoding of these string types in advance as long as it is known how to convert them to
one of the known encoding string types. The C++11 standard library requires
codecvt<char32_t,char,mbstate_t>
and codecvt<wchar_t,char,mbstate_t>
facets, so such conversions are always possible using the standard library. In practice, library
implementations have additional knowledge that allow such conversions to be
more efficient than just calling codecvt facets. To ensure safety, error handling
does need to be
provided, however, as conversions involving some char
and wchar_t
encodings can encounter errors. See Problem 3 below for
some requested error handling approaches.
Implicit conversion between single characters of different types, as opposed to strings, may require multi-character sequences. No such single character implicit conversions are proposed here.
This proposal deals with C++11 std::basic_string
and
character types, and with their encodings. The deeper attributes of Unicode
characters are not addressed. See Mathias Gaunard's
Unicode project for an example of deeper Unicode support.
This proposal does not suggest providing a string type guaranteed to provide
UTF-8 encoding. Although experiments with typedef
basic_string<unsigned char> u8string;
worked well, benefits would be
speculative and not based on existing practice.
Another approach would be to provide a utf8_char_traits
class
and then typedef
basic_string<char, utf8_char_traits> u8string;
. This approach has
not been investigated.
Peter Dimov inspired the idea of string interoperability by arguing that the Boost Filesystem library should treat a path is a single type (i.e. not a template) regardless of character size and encoding.
John Maddock's Unicode conversion iterators demonstrated an
easier-to-use, more efficient, and STL friendlier way to perform character
type and encoding conversions as an alternative to standard library
codecvt
facets.
The C++11 standard deserves acknowledgement as it provides the underlying language and library features that allow Unicode string interoperability:
char16_t
and char32_t
provide Unicode
character types and null-terminated characters strings with guaranteed
encodings.std::u16string
and std::u32string
provide
library support for Unicode character types and encodings.u8
, u
, and U
character and string literals ease
programming with Unicode character types and encodings.Standard library strings with different character encodings have different types that do not interoperate.
u16string s16 = u"您好世界"; u32string s32; s32 = s16; // error! s32 = "foo"; // error! s32 = s16.c_str(); // error! s32.assign(s16.cbegin(), s16.cend()); // error!void f(const string&); f(s32); //error!
The encoding of basic_string instantiations can be determined for the types under discussion. It is either implicit in the string's value_type or can be determined via the locale.
Boost Filesystem Version 3, and the filesystem proposal before the C++
committee, class path
solves some of the string
interoperability problems, albeit in limited context. A function that is
declared like this:
void f(const path&);
Can be called like this:
f("Meow"); f(L"Meow"); f(u8"Meow"); f(u"Meow"); f(U"Meow"); // ... many additional variations such as basic_strings and iterators
This string interoperability support has been a success. It does, however,
raise the question of why std::basic_string
isn't providing the
interoperability support. Users are misusing paths as general string containers
because they provide interoperability. The string interoperability cat is out of the bag.
The toothpaste is out of the tube.
See Boost.Filesystem V3 class path for an example of how such interoperability might be achieved.
Experience with Boost.Filesystem V3 class path has demonstrated that string interoperability brings a considerable simplification and improvement to internationalized user code, but that having to provide interoperability without the resolution of the issues presented here is a band-aid.
String interoperability will be easier to specify, implement, and use if the string interoperability iterators proposed below are accepted.
The approach is to add additional std::basic_string
overloads to functions most likely to benefit from interoperability. The
overloads are in the form of function templates with sufficient restrictions on
overload resolution participation (i.e. enable_if) that the existing C++11
functions are always selected if the value type of the argument is the same as
or convertible to the std::basic_string
type's value_type
.
The semantics of the added signatures are the same as original signatures except
that arguments of the template parameter type have their value converted to the
type and encoding of
basic_string::value_type
.
The std::basic_string
functions given additional overloads are:
operator=
, operator+=
,
append
, and assign
signature.template <class T> unspecified_iterator c_str()
,
returning an unspecified iterator with value_type
of T
.
begin()
and end()
. Similar to c_str()
.
To keep the number and complexity of overloads manageable, the
proof-of-concept implementation does not provide any way to specify error
handling policies, or string
and wstring
encoding.
Every one of the added signatures does not need to be able to control error
handling and encoding. The need is particularly rare in environments where UTF-8
is the narrow character encoding and UTF-16 is the wide character encoding. A
subset, possibly just c_str()
, begin()
, and
end()
, with error handling and encoding parameters or arguments, suitable
defaulted, may well be sufficient.
Because full implicit interoperability involves a lot of additional
signatures be added to basic_string, it will certainly be appropriate to discuss
limiting changes to the key areas of need. For example, constructors and
operator=
are much more likely to need interoperability than operator+=
,
append
, or assign
signatures.
I/O streams do not accept strings of different character types
A "Hello World" program using a C++11 Unicode string literal illustrates this frustration:
#include <iostream> int main() { std::cout << U"您好世界"; // error in C++11! }
This code should
"just work", even though the type of U"您好世界"
is const
char32_t*
, not const char*
, as long as the encoding of char
supports 您好世界. Even if those characters are not
supported by default encodings, alternatives like UTF-8 are available.
The code does "just work" with the proof-of-concept implementation of this
proposal. On Linux, with default char
encoding of UTF-8, execution
produces the expected 您好世界 output. On Windows, the
console doesn't support full UTF-8, so the output can be piped to a file or to a
program which does handle UTF-8 correctly. And, yes, that does work correctly
with the proof-of-concept implementation of this proposal.
Add additional function templates to those in 27.7.3.6.4 [ostream.inserters.character],
Character inserter function templates, to cover the case where the
argument character type differs from charT and is not char
,
signed char
, unsigned char
, const char*
,
const signed char*
, or const unsigned char*
. (The
specified types are excluded because they are covered by existing signatures.)
The semantics of the added signatures are the same as original signatures except
that arguments shall be converted to the type and encoding of the stream.
Do the same for the character extractors in 27.7.2.2.3 [istream::extractors], basic_istream::operator>>.
Do the same for the two std::basic_string
inserters and
extractors in 21.4.8.9 [string.io], Inserters and extractors.
Conversion between character types and their encodings using current standard
library facilities such as std::codecvt
, std::locale
,
and
std::wstring_convert
has multiple problems:
codecvt
facets don't easily compose into a complete
conversion from one encoding to another. Such composition is existing practice in C libraries like ICU.
UTF-32 is the obvious choice for the common encoding to pass between codecs.std::locale
and code conversion, even when these
are implementation details that should be hidden from the application.The generalization of the std::basic_string
function
c_str
is:
template <class T> unspecified_iterator c_str() const;
Give a std::string
named s8
, this allows a user
to write s8.c_str<char16_t>()
to obtain an iterator with a value
type of char16_t
. To implement this function generically
using the current standard library would be difficult, and would involve the
creation of a temporary sting. The full implementation with the proposed
solution is simply:
template <class T> converting_iterator<const_iterator, value_type, by_range, T> c_str() const { return converting_iterator<const_iterator, value_type, by_range, T>(cbegin(), cend()); }
No temporary string is created, and none of the other problems listed above are present either. The solution is generally useful for user defined types, and not just for implementations of the standard library.
Other problems become easier to solve with converting_iterator.
For example, the Filesystem library's class path
in
N3239 has many functions with an argument in the form const
codecvt_type& cvt=codecvt()
that could be eliminated by either direct
or indirect use of converting_iterator.
Boost Regex for many years has included a set of Unicode conversion iterators as an implementation detail. Although these do not provide composition, they do demonstrate the technique of using encoding conversion iterators to avoid creation of temporary strings.
This solution is based on the proof-of-concept implementation. Input iterator requirements can probably be loosened to bidirectional, but that hasn't been tested yet.
The preliminaries begin with end-detection policy classes, since strings used null termination, size, or half-open ranges to determine the end of a sequence.
template <class InputIterator> class by_null; template <class InputIterator> class by_size; template <class InputIterator> class by_range;
Codec templates handle actual conversion to and from UTF-32. The primary templates are:
template <class InputIterator, class FromCharT, template<class> class EndPolicy> class to32_iterator; template <class InputIterator, class ToCharT> class from32_iterator;
The standard library would provide specializations for char
,
wchar_t
, char16_t
, and char32_t
.
Presumably users could provide specializations for UDTs, but that hasn't been
tested yet. The char
and wchar_t
specializations
provide mechanisms to select the encoding. Since this is a new component the
char
default encoding could be UTF-8 rather than locale based and
no existing code would be broken.
The actual converting_iterator
primary template is
simply:
template <class InputIterator, class FromCharT, template<class> class EndPolicy, class ToCharT> class converting_iterator : public from32_iterator<to32_iterator<InputIterator, FromCharT, EndPolicy>, ToCharT> { public: using from32_iterator::from32_iterator; };
Specializations may be provided, but aren't required. The proof-of-concept implementation doesn't use inherited constructors because of lack of compiler support.