Document number: | N3398=12-0088 |
Date: | 2012-09-19 |
Project: | Programming Language C++, Library Working Group |
Reply-to: | Beman Dawes <bdawes at acm dot org> |
This paper proposes library components to ease string interoperability problems for Unicode and other string encodings. These problems occur with the current C++11 standard library. Read the Components... section for a full description of problems or look at some simple examples here.
I first encountered the C++03 version of string interoperability problems while providing Unicode support for the internationalization of commercial GIS software. These problems appeared again while working on the Boost Filesystem Library. They have become more apparent as compiler support for C++11's additional Unicode support has made it easier to write programs that run up against current limitations.
Work began on the proposal when the Library Working Group requested string encoding conversion arguments be removed from class
path
in the initial C++11 proposal for a Filesystem library. That sparked this proposal as a far more general solution to string encoding conversion problems than a Filesystem specific proposal.
The proposed components are separable. Any of the components except codecs and codec helpers could be removed, although ease-of-use would suffer as a result.
The proposed components are suitable for a C++ standard library Technical Specification (TS), either standalone or as part of a larger TS.
The proposed components are pure additions. No C++03 or C++11 headers are changed and no current user or standard library code is broken, subject only to the usual namespace discipline caveats.
Proposed wording is provided. The proposed wording relies only
on C++11
features. Should a basic_string
-reference library TS be accepted,
it might be used to reduce the number of signatures in this proposal.
A "proof-of-concept" implementation of the proposals (and more) is available at github.com/beman/string-interoperability.
Introduction
Revision history
Proposed components and their motivation
Codecs and their helpers
conversion_iterator
class
template
copy_string
algorithm
make_string
function template
to_string
conversion functions,
converting stream inserters and extractors
Explicit UTF-8 encoded types char8_t
and
u8string
Design
Design paths not taken
Existing practice with string interoperability
Existing practice with conversion iterators
Acknowledgements
TODO List
Proposed Wording
String interoperation
Header <string_interop.hpp> synopsis
Codecs
Class default_codec
Requirements on codec classes
end-of-sequence iterator requirements
from_iterator
requirements
to_iterator
requirements
Constructor requirements
select_codec
UTF-8 typedefs (Informative)
Class template conversion_iterator
Synopsis
Constructors
Algorithm copy_string
make_string
function templates
to_
string
function
templates
UTF-8 string support
Stream inserters
Stream extractors
This paper is a complete rewrite of N3336, Adapting Standard Library Strings and I/O to a Unicode World. It reflects C++ committee feedback from the LWG's review of N3336 and further analysis and experimentation.
Provide an iterator-based composable solution to string encoding and type conversion that works well in generic code and does not require heap allocated temporary strings or buffers.
These low-level components provide the foundation for most of the higher level components. They provide an abstraction of string encoding and type conversion that frees higher-level components from details.
- The library provides codecs for the native
char
andwchar_t
encodings, plus UTF-8, UTF-16, and UTF-32.- Implementers and users may supply additional codecs.
Specific motivations include:
- Iterator-based because iterators have been far more productive than the painful and problematic string based codecvt facet interface.
- Iterator based codecs allocate no heap memory.
- Codecs are composable, so that for all possible conversions between n encodings, the number of codecs required is n rather than n2, yet there are no temporary strings even for composed conversions.
- The native narrow codec, which can be a problem because its encoding is runtime dependent, is built on top of
<cuchar>
, easing implementation and ensuring consistency.- The codec interfaced uses no virtual functions and is simpler than the codecvt facet interface, a constant source of irritations and mistakes.
conversion_iterator
class templateProvides an iterator adapter that performs character type and encoding conversion on-the-fly.
While codecs (see below) offer worthwhile benefits, they essentially provide low-level, encoding specific, iterators. The
conversion_iterator
class template provides a simple iterator adaptor that composes two codecs regardless of encoding into a single, easy-to-use iterator.With
conversion_iterator
, implementation of many mid and high level character type and encoding conversions becomes trivial. It is useful to standard, user, and third-party library implementers, as it provides a vocabulary iterator type that is far easier to use than roll-your-oven conversions based on codecvt facets.
copy_string
algorithmProvides an algorithm like std::copy
, except performing type
and encoding conversion as it copies.
Solves many end user problems.
Provides a simple way to both specify and implement other high-level convenience functions.
make_string
function templateProvides a generic string type and encoding conversion factory function.
to_
string
conversion functions, converting stream inserters and extractorsProvide easy-to-use (automatic, in the case of inserters and extractors) solutions to irritating string interoperability problems, in the style of similar standard library functionality.
With the C++11 standard library:
int i = 50; // OK long j = i; // OK cout << j; // OK string s = to_string(i); // OK, C++11 provides this overload wstring t = to_wstring(s); // error! u8string u = to_u8string(t); // error! u16string v = to_u16string(s); // error! u32string w = to_u32string(v); // error! string x = to_string(v.c_str()); // error! string y = to_string(U"50"); // error! std::cout << t; // error!With the proposal (and the unmodified C++11 standard library):
int i = 50; // OK long j = i; // OK cout << j; // OK string s = to_string(i); // OK wstring t = to_wstring(s); // OK u8string u = to_u8string(t); // OK u16string v = to_u16string(s); // OK u32string w = to_u32string(v); // OK string x = to_string(v.c_str()); // OK string y = to_string(U"50"); // OK std::cout << t; // OK
char8_t
and u8string
Specifies a character type and a string type that are unambiguously UTF-8 encoded.
UTF-8 is the most important, and often the only, byte-sized character encoding required by many internationalized applications. Yet it is the only one of the critical Unicode encodings (UTF-8, UTF-16, UTF-32) that does not have its own C++ character type. This causes endless technical problems, such as the inability to overload on a UTF-8 character type, for those who want to write portable code. It causes developers who otherwise think highly of C++ to believe the standards committee is stuck in the distant past when dinosaurs roamed the earth.
The proposed string interoperability facilities run afoul of the lack of a UTF-8 character type because they use generic programming techniques that depend on a one-to-one relationship between character value types and their encodings.
This feature is far more speculative than the rest of the proposal. It has been implemented and has been used in an experimental branch of the Filesystem library. But there is no user experience whatsoever. It leaves u8
string literals twisting in the wind, and that's a serious problem. It needs much further study and discussion before moving forward.
The copy_string
algorithm was a starting point for the design.
The algorithm was
arrived at by analyzing numerous real-world string conversion problems
encountered by Boost Filesystem and while internationalizing various industrial
applications. During that analysis, it was observed that std::copy
algorithm would be a
common solution to those problems if it could be given generic versions of John
Maddock's Unicode conversion iterator adaptors used in his Boost Regex
implementation. The conversion_iterator
and codec
designs evolved as the underlying conversion abstractions
needed to support copy_string
.
The key design for composition of codecs is the use of UTF-32 as an common intermediate encoding that works without an intermediate temporary string when applied at the iterator level. This is the same approach, albeit a compile time rather than run time, taken by the International Components for Unicode (ICU) library.
This proposal deals with C++11 std::basic_string
, standard character types, and their encodings. The deeper attributes of Unicode
characters are not addressed. See Mathias Gaunard's
Unicode project for an example of deeper Unicode support.
This proposal provides compile-time solutions. It does not provide runtime solutions such as provided by the ICU library.
This proposal provides work-arounds for C++11's lack of UTF-8 strings.
Several users have argued that instead of work-arounds, the C++ standard should
require UTF-8 encoding for both C-style char
strings and
std::string
. This proposal assumes that is too great a leap forward at this
time.
Boost Filesystem Version 3's class path
solves some of the string
interoperability problems, albeit in limited context. A function that is
declared like this:
void f(const path&);
Can be called like this:
f("Meow"); f(L"Meow"); f(u8"Meow"); f(u"Meow"); f(U"Meow"); // ... many additional variations such as basic_strings and iterators
This string interoperability support has been a success. It does, however,
raise the question of why std::basic_string
isn't providing the
interoperability support. Users are misusing paths as general string containers
because they provide interoperability. The string interoperability cat is out of the bag.
The toothpaste is out of the tube.
See Boost.Filesystem V3 class path for an example of how such interoperability might be achieved.
Experience with Boost.Filesystem V3 class path has demonstrated that string interoperability brings a considerable simplification and improvement to internationalized user code, but that having to provide interoperability without the resolution of the issues presented here is a band-aid.
Boost Regex for many years has included a set of Unicode conversion iterators as an implementation detail. Although these do not provide composition, they do demonstrate the technique of using encoding conversion iterators to avoid creation of temporary strings.
Peter Dimov inspired the idea of string interoperability by arguing that the Boost Filesystem library should treat a path is a single type (i.e. not a template) regardless of character size and encoding. The experienced gained with that approach led to a much clearer understanding of where to draw the line between functionality provided by a library such as Filesystem, and the standard library (or a TS) itself.
John Maddock's Unicode conversion iterators demonstrated an easy-to-use, more efficient, and STL friendly way to perform character type and encoding conversions.
Yakov Galka
suggested attacking string interoperability with free functions to reduce or
eliminate changes to basic_string
.
The C++11 standard deserves acknowledgement as it provides the underlying language and library features that allow Unicode string interoperability:
char16_t
and char32_t
provide Unicode
character types and null-terminated characters strings with guaranteed
encodings.std::u16string
and std::u32string
provide
library support for Unicode character types and encodings.u8
, u
, and U
character and string literals ease
programming with Unicode character types and encodings.
To Do
|
Italic text highlighted in yellow is commentary and not part of the proposal.
The wording assumes the whole of the ISO C++ Standard Library introduction [lib.library] is included by reference.
This library provides facilities that allow interoperation between strings of differing types and encodings, and ease the use of strings with UTF-8 encoding. The following encodings are supported:
namespace std { template <> struct char_traits<unsigned char>; namespace tbd { // tbd is to be decided // UTF-8 typedefs [str-x.utf8-typedefs] typedef unsigned char char8_t; typedef basic_string<char8_t> u8string; // codecs [str-x.codec] class narrow; class wide; class utf8; class utf16; class utf32; class default_codec; // See [str-x.codec.default] // select_codec [str-x.codec.select] template <class charT> struct select_codec; template <> struct select_codec<char> { typedef narrow type; }; template <> struct select_codec<wchar_t> { typedef wide type; }; template <> struct select_codec<char8_t> { typedef utf8 type; }; template <> struct select_codec<char16_t> { typedef utf16 type; }; template <> struct select_codec<char32_t> { typedef utf32 type; }; // conversion_iterator [str-x.cvt-iter] template <class ToCodec, class FromCodec, class InputIterator> class conversion_iterator; // copy_string algorithm [str-x.copy_string] template<class InputIterator, class FromCodec, class OutputIterator, class ToCodec> OutputIterator copy_string(InputIterator first, InputIterator last, OutputIterator result); // make_string function templates [str-x.make_string] template <class ToCodec, class FromCodec = default_codec, class ToString = std::basic_string<typename ToCodec::value_type>, class FromString> ToString make_string(const FromString& ctr); template <class ToCodec, class FromCodec = default_codec, class ToString = std::basic_string<typename ToCodec::value_type>, class InputIterator> ToString make_string(InputIterator begin); template <class ToCodec, class FromCodec = default_codec, class ToString = std::basic_string<typename ToCodec::value_type>, class InputIterator> ToString make_string(InputIterator begin, std::size_t sz); template <class ToCodec, class FromCodec = default_codec, class ToString = std::basic_string<typename ToCodec::value_type>, class InputIterator, class InputIterator2> ToString make_string(InputIterator begin, InputIterator2 end); // to_string function templates [str-x.to_string] template <class FromCodec = default_codec, class ToString = std::basic_string<char>, class FromString> ToString to_string(const FromString& s); template <class FromCodec = default_codec, class ToString = std::basic_string<char>, class InputIterator> ToString to_string(InputIterator begin); template <class FromCodec = default_codec, class ToString = std::basic_string<char>, class InputIterator> ToString to_string(InputIterator begin, std::size_t sz); template <class FromCodec = default_codec, class ToString = std::basic_string<char>, class InputIterator> ToString to_string(InputIterator begin, InputIterator end); Repeat pattern for to_wstring, to_u8string, to_u16string, to_u32string // UTF-8 string support [str-x.utf8] inline const char8_t* u8(const char* s) noexcept; inline const char8_t* u8(const string& s) noexcept; inline const char* u8(const char8_t* s) noexcept; inline const char* u8(const u8string& s) noexcept; } // namespace tbd // stream inserters [str-x.cvt.ins] template <class Ostream, class charT, class Traits, class Allocator> Ostream& operator<<(Ostream& os, const basic_string<charT, Traits, Allocator>& str); basic_ostream<char>& operator<<(basic_ostream<char>& os, const wchar_t* p); basic_ostream<char>& operator<<(basic_ostream<char>& os, const char16_t* p); basic_ostream<char>& operator<<(basic_ostream<char>& os, const char32_t* p); } // namespace std
Codecs are classes that package one typedef and three class templates. They
contain no data or function members and never need to be instantiated. Codec
classes may be predefined or user defined. All codec classes except
default_codec
shall meet the codec requirements [str-x.codec.req]
Table: Predefined codec classes
Class | value_type |
Encoding |
narrow |
char |
Default locale's char encoding. |
wide |
wchar_t |
Implementation specific wchar_t encoding. |
utf8 |
char8_t |
UTF-8 |
utf16 |
char16_t |
UTF-16 |
utf32 |
char32_t |
UTF-32 |
default_codec |
N/A |
N/A |
Class default_codec
is a pseudo-codec that provides
lazy select_codec
selection. It is for use as a default for codec
template parameters that appear before the template parameter that determines
charT
. Class default_codec
is not required to meet the
codec class
requirements
class default_codec { public: template <class charT> struct codec { typedef typename select_codec<charT>::type type; }; };
Codecs are required to contain the following:
typedef implementation-defined value_type; template <class charT> struct codec { typedef codec-class-name type; }; template <class InputIterator> class from_iterator { public: from_iterator(); from_iterator(InputIterator begin); from_iterator(InputIterator begin, size_t sz); template <class InputIterator2> from_iterator(InputIterator begin, InputIterator2 end); }; template <class InputIterator> class to_iterator { public: to_iterator(); to_iterator(InputIterator begin); };
An end-of-sequence iterator becomes equal to the
end-of-sequence value upon reaching the end of the sequence being iterated
over. An end-of-sequence iterator constructor with no arguments constructs the
end-of-sequence value, which is the only legitimate iterator value to be used
for the end condition. The behavior of operator*
on an iterator
with the end-of-sequence value is undefined. For any other iterator value a
const T&
is returned. The behavior of operator->
for
an iterator with the end-of-sequence value is undefined. For any other
iterator value a const T*
is returned. The behavior of
operator++()
for an iterator with the end-of-sequence value is
undefined.
Two iterators with the end-of-sequence value are equal. An iterator with the end-of-sequence value is not equal to an iterator that does not have the end-of-sequence value. Two iterators that do not have the end-of-iterator value are equal iff they point to the same element of the sequence.
from_iterator
requirements [str-x.codec.req.from]The class template from_iterator
is an input
iterator that is an adaptation of a InputIterator
template
parameter whose value_type
is the same as the parent codec class
value_type
. It has a value_type
of
char32_t
and meets the inpuyt iterator requirements of the C++
standard and the end-of-sequence iterator requirements ([str-x.codec.req.eos]).
to_iterator
requirements [str-x.codec.req.to]The class template to_iterator
is a input iterator that is
an adaptation of a InputIterator
template parameter whose
value_type
is char32_t
. It has a
value_type
that is the same as the parent codec class value_type
.
It meets the input iterator requirements of the C++ standard and the
end-of-sequence iterator requirements ([str-x.codec.req.eos]).
from_iterator();
Effects: Constructs an iterator with the end-of-sequence iterator value ([str-x.codec.req.eos]).
from_iterator(InputIterator begin);
Effects: Constructs an iterator for the half-open range that begins at
begin
and ends at the first element with a value ofvalue_type()
.
from_iterator(InputIterator begin, size_t sz);
Effects: Constructs an iterator for the half-open range that begins at
begin
and ends atbegin + sz
.
template <class
InputIterator2> from_iterator(InputIterator begin,
InputIterator2end);
Effects: Constructs an iterator for the half-open range that begins at
begin
and ends atend
.Remarks: Shall not participate in overload resolution unless
InputIterator
andInputIterator2
are the same type.
to_iterator();
Effects: Constructs an object with the end-of-sequence iterator value ([str-x.codec.req.eos]).
to_iterator(InputIterator begin);
InputIterator
is required to meet the end-of-sequence iterator requirements ([str-x.codec.req.eos]).Effects: Constructs an iterator for the half-open range that begins at
begin
and ends when the end-of-sequence iterator value is reached.
To be supplied.
In portable internationalized applications, use of UTF-8 encoded
C-style array of char
strings and std::string
is
problematic for passing arguments to functions which assume the
encoding is the native narrow character encoding. For example, arguments
representing filenames for I/O functions or arguments representing content for
web sites. Disciplined conversion of all narrow character strings to UTF-8
encoding within an application is a partial solution, but is not enforceable
via the C++ language type system and does not help with third-party or
standard library functions that assume char
strings use native
narrow encoding.
The char8_t
and u8string
typedefs allow
the C++ type system to distinguish between native encoded and UTF-8 encoded
character strings. The actual type used for char8_t
is
unsigned char
because the C++ language rules require that the representation of the underlying bytes
for char
and unsigned char
are the same (C++
standard: [basic.types]). This allows conversion by compile-time casts with no
runtime cost.
conversion_iterator
[str-x.cvt-iter]Class template conversion_iterator
composes a input iterator from a
codec to_iterator
, a codec from_iterator
, and a
input iterator. It adapts the input iterator to behave as an iterator to
ToCodec::value_type
. The type iterator_traits<InputIterator>::value_type
is required to be the same as FromCodec::value_type
.
conversion_iterator
meets the standard library input iterator
requirements and the end-of-sequence iterator requirements ([str-x.codec.req.eos]).
template <class ToCodec, class FromCodec, class InputIterator> class conversion_iterator : public ToCodec::template to_iterator< typename FromCodec::template from_iterator<InputIterator>> { public: typedef typename FromCodec::template from_iterator<InputIterator> from_iterator_type; typedef typename ToCodec::template to_iterator<from_iterator_type> to_iterator_type; conversion_iterator(); conversion_iterator(InputIterator begin); conversion_iterator(InputIterator begin, std::size_t sz); template <class U> conversion_iterator(InputIterator begin, U end); // other functions as needed to meet standard library requirements // for input iterators [input.iterators] ... };
conversion_iterator();
Effects: Constructs an iterator with the end-of-sequence iterator value ([str-x.codec.req.eos]).
conversion_iterator(InputIterator begin);
Effects: Constructs an iterator for the half-open range that begins at
begin
and ends at the first element with a value ofvalue_type()
.
conversion_iterator(InputIterator begin, size_t sz);
Effects: Constructs an iterator for the half-open range that begins at
begin
and ends atbegin + sz
.
template <class
InputIterator2> conversion_iterator(InputIterator begin,
InputIterator2end);
Effects: Constructs an iterator for the half-open range that begins at
begin
and ends atend
.Remarks: Shall not participate in overload resolution unless
InputIterator
andInputIterator2
are the same type.
copy_string
[str-x.copy_string]template<class InputIterator, class FromCodec, class OutputIterator, class ToCodec> OutputIterator copy_string(InputIterator first, InputIterator last, OutputIterator result);
Requires:
result
shall not be in the range [first,last
).Effects:
typedef conversion_iterator<ToCodec, typename FromCodec::template codec<typename std::iterator_traits<InputIterator>::value_type>::type, InputIterator> iter_type;Returns:
std::copy(iter_type(begin, end), iter_type(), result)
.
make_string
function templates [str-x.make_string]The make_string
functions create a string from a source sequence of
characters. The conversion of the type and encoding of the characters in the
source sequence of characters to the type and encoding of characters in the
created string is performed by conversion_iterator<ToCodec, typename
FromCodec::template codec<typename FromString::value_type>::type, typename
FromString::const_iterator>
, where ToCodec
,
FromCodec
, and FromString
are template parameters, as is
ToString
, the type of the resulting string.
template <class ToCodec, class FromCodec = default_codec, class ToString = std::basic_string<typename ToCodec::value_type>, class FromString> ToString make_string(const FromString& s);
Returns: A string containing the characters of the sequence [
s.cbegin(), s.cend()
).[Example: A conforming implementation would be:
typedef conversion_iterator<ToCodec, typename FromCodec::template codec<typename FromString::value_type>::type, typename FromString::const_iterator> iter_type; ToString tmp; std::copy(iter_type(s.cbegin(), s.cend()), iter_type(), std::back_insert_iterator<ToString>(tmp)); return tmp;--end example]
template <class ToCodec, class FromCodec = default_codec, class ToString = std::basic_string<typename ToCodec::value_type>, class InputIterator> ToString make_string(InputIterator begin);
Returns: A string containing the characters of the sequence [
begin, begin+dist
) wheredist
is the distance frombegin
to the first instance of characteriterator_traits<InputIterator>::value_type()
.Complexity: O(
dist
)
template <class ToCodec, class FromCodec = default_codec, class ToString = std::basic_string<typename ToCodec::value_type>, class InputIterator> ToString make_string(InputIterator begin, std::size_t sz);
Returns: A string containing the characters of the sequence [
begin, begin+sz
).
template <class ToCodec, class FromCodec = default_codec, class ToString = std::basic_string<typename ToCodec::value_type>, class InputIterator, class InputIterator2> ToString make_string(InputIterator begin, InputIterator2 end);
Returns: A string containing the characters of the sequence [
begin, end
).
to_
string
function templates [str-x.to_string]template <class FromCodec = default_codec, class ToString = std::basic_string<char>, class FromString> ToString to_string(const FromString& s); template <class FromCodec = default_codec, class ToString = std::basic_string<char>, class InputIterator> ToString to_string(InputIterator begin); template <class FromCodec = default_codec, class ToString = std::basic_string<char>, class InputIterator> ToString to_string(InputIterator begin, std::size_t sz); template <class FromCodec = default_codec, class ToString = std::basic_string<char>, class InputIterator> ToString to_string(InputIterator begin, InputIterator end); Repeat pattern for to_wstring, to_u8string, to_u16string, to_u32string
Returns:
make_string<codec, FromCodec, ToString>(arguments)
, wherecodec
isnarrow
,wide
,utf8
,utf16
, andutf32
, andarguments
iss
,begin
,begin,sz
, andbegin,end
.
These functions provide copy-less type conversion for use with
narrow character strings when no encoding conversion is required. Their
semantics take advantage
of C++ language rules that ensure the representation of the underlying bytes
for char
and unsigned char
are the same (C++
standard: [basic.types]).
inline const char8_t* u8(const char* s) noexcept;
Returns:
static_cast<const char8_t*>(static_cast<const void*>(s))
.
inline const char8_t* u8(const string& s) noexcept;
Returns:
static_cast<const char8_t*>(static_cast<const void*>(s.c_str()))
.
inline const char* u8(const char8_t* s) noexcept;
Returns:
static_cast<const char*>(static_cast<const void*>(s));
.
inline const char* u8(const u8string& s) noexcept;
Returns:
static_cast<const char*>(static_cast<const void*>(s.c_str()))
.
The stream inserter functions perform stream insertion of an
insertion character sequence converted from a source character sequence. The
conversion of the type and encoding of the source sequence to the type and
encoding of the insertion sequence is performed by a conversion_iterator
.
template <class Ostream, class charT, class traits, class Allocator> Ostream& operator<<(Ostream& os, const basic_string<charT, traits, Allocator>& str);
Effects: For each value of an iterator of type
conversion_iterator<typename select_codec<typename Ostream::char_type>::type, typename select_codec<charT>::type, typename string_type::const_iterator>
initialized with the source sequence (str.cbegin(), str.cend()
], iterate until the end-of-sequence value ([str-x.codec.req.eos]) is reached, inserting the dereferenced value of the iterator intoos
.Returns:
os
.Remarks: Does not participate in overload resolution if
charT
andOstream::char_type
are the same type.
basic_ostream<char>& operator<<(basic_ostream<char>& os, const wchar_t* p); basic_ostream<char>& operator<<(basic_ostream<char>& os, const char16_t* p); basic_ostream<char>& operator<<(basic_ostream<char>& os, const char32_t* p);
Effects: For each value of an iterator of type
conversion_iterator<typename select_codec<char>::type, typename select_codec<p's value_type>::type, p's type>
initialized withp
, iterate until the end-of-sequence value ([str-x.codec.req.eos]) is reached, inserting the dereferenced value of the iterator intoos
.Returns:
os
.[Note: The existing
basic_ostream<charT,traits>& operator<<(const void* p)
prevents use of a template to abstract away the differences between the pointer types covered by above signatures. --end note]
To be supplied.