Doc. no.: | P0353R1 |
Date: | 2016-10-14 |
Reply to: | Beman Dawes <bdawes at acm dot org> |
Audience: | Library Evolution |
Proposes character encoding conversion
and related functions to ease interoperability between
strings and other sequences of character types char
, char16_t
,
char32_t
, and wchar_t
. Support for Unicode
Transformation Form (UTF) and wide character encodings is built-in, while narrow
character encodings are supported via traditional codecvt facets. Pure addition to the standard library.
No changes to the
core language or existing standard library components. Breaks no existing code
or ABI. Proposed wording
provided. Specified in accordance with ISO/IEC 10646 and the Unicode Standard. Has been implemented. Suitable for
either a library TS or the standard itself.
Major interface revision: R1 adds important functionality yet markedly reduces interface size.
C++ types char
, char16_t
,
and
char32_t
support character and string literals encoded in Unicode
Transformation Forms UTF-8, UTF-16, and UTF-32 respectively. Additional narrow
character encodings are supported by the standard library's codecvt facets
via conversion to a wide character encoding.
Users may
need to use multiple encodings in the same
application, or even the same function. Yet neither the language nor the
standard library provides a convenient C++ way to convert between these encodings,
let alone a way to do that securely.
There is no equivalent to the ease with which the std::to_string
family of functions can convert an arithmetic value to a string.
This proposal
markedly eases the problems encountered by users due to the lack of convenient encoding conversion in the standard library.
Knowledge of UTF-8, UTF-16, UTF-32, and the implementation defined wchar_t
wide encoding is built-in to make the interface Unicode friendly. The interface meets the
error handling requirements of the Unicode and ISO/IEC 10646 standards, and
meets the error handling needs
requested by Unicode experts.
Problem: Convert a string s
to UTF-16 in a way that converts the string types
and encodings, and handles errors according to the best practices documented in
the ISO/IEC 10646 and Unicode standards.
Using the proposal:
to_string<utf16>(s);
Where s
can be anything convertible to
std::basic_string_view<char/char16_t/char32_t/wchar_t>
encoded
in the associated UTF-8, UTF-16, UTF-32, or wide character encoding. If
s
has a value type of char
, but is not UTF-8 encoded, a
second argument supplies a std::codecvt
derived facet that
converts to an internal type such as wchar_t
or char32_t
and its associated encoding:
to_string<utf16>(s, ccvt_facet);
Without the proposal, using only the standard library: This might not be too difficult using a third-party library, but is surprisingly difficult using only the standard library. Unless the developer has enough Unicode experience to focus on error detection and to test against an existing test data set, a roll-your-own solution would probably be very time consuming and error-prone.
N3398, String Interoperation Library, proposed a complete overhaul of
the standard library's mechanisms for character encoding conversion. The
proposal was discussed at the Portland meeting in 2012. Some aspects of the
proposal drew strong support, such as improving Unicode string interoperability.
Other aspects drew strong opposition, such as new low level functionality to
replace std::codecvt
. Clearly participants did not want N3398 -
they wanted a different proposal, less overreaching and more focused on Unicode
encoding conversions. Bill Plauger summed it up when he said something like
"Don't reinvent codecvt
. That said, we should pick a
winner — Unicode."
This P0353 proposal is completely new and not a revision of N3398.
R1 - Pre-Issaquah mailing
Major interface revision: Discussion in Oulu and experimental use of the R0 interface exposed serious issues:
The R1 interface provides both UTF-based and codecvt-based encoding
conversion, yet consists of only two levels of functionality and two conversions functions; a recode
algorithm
with a single signature and a to_string
convenience function with
four overloads.
A Boost licensed preliminary implementation is available at github.com/beman/unicode.
Jeffrey Yasskin, Titus Winters, Michael Spenser, and Fabio participated in a
small group discussion of P0353R0 in Oulu. Lots of direction and specific
comments that came out of the discussion are reflected in P0353R1. For example,
removal of the requirement that wchar_t
be UTF encoded and need to
be more explicit about narrow encoding.
Tom Honermann provided insights about environments no based on POSIX or Windows, such as IBM's z/OS.
Alisdair Meredith, Eric Niebler, Howard Hinnant, Jeffrey Yasskin, Marshall Clow, PJ Plauger, and Stephan T. Lavavej participated in the Portland discussion of N3398. Many of the design decisions that have gone into the current proposal flow directly from the Portland discussion.
After the Portland meeting, Matt Austern sat down with Google's "Unicode people" to "clarify things". His summary of that discussion was very helpful. Its guidance on error handling is reflected in current proposal.
Limit this proposal to encoding conversion and other encoding-related functionality
Build in support for UTF-8, UTF16, UTF-32 and wide (i.e. wchar_t
)
encodings
Build in support for existing narrow to/from wide codecvt facets
Provide two levels of functionality
The
recode
function provides a recoding conversion algorithm. It operates on an input sequence and produces an output sequence, so provides STL-like functionality to meet generic needs.The
to_string
function templates provide convenient encoding conversion forstring_view
,u16string_view
,u32string_view
, andwstring_view
arguments, and are intended to complement the existing standard libraryto_string
family of functions.
Provide a coherent error detection and handling policy
Provide encoding conversions as explicitly called non-member functions
to_string
functions already in the
standard library.Place the proposed components in namespace unicode
Emphasizes that these functions assume
char
strings use a Unicode encoding.
Keep interfaces neutral as to which character type or UTF encoding is "best"
Each of these encodings have uses where it is preferred or required, and all of these needs may appear in the same application. For example:
Provide types narrow
, utf8
,
utf16
, utf32
, and wide
to identify encodings
String value_type
alone is insufficient because in
the case of char
it is ambiguous.
The resulting user code more explicit and so easier to read.
Base conformance and definitions on ISO/IEC 10646:2014
Use variadic templates to minimize interface surface area
recode
conversion algorithm and one to_string
convenience function. These closely match the abstraction and mental model of
their behavior; the differing arguments are just a detail. Static asserts can
ensure comprehensible error messages when contradictory argument types are
passed.This wording assumes P0417 C++17 should refer to ISO/IEC 10646 2014 instead of 1994 has been accepted into the C++ working paper.
This sub-clause describes components that C++ programs may use to perform
operations on characters, strings, and other sequences of characters encoded in
various encoding forms. Encoding forms UTF-8, UTF-16, and UTF-32 are supported, as are narrow
character encodings having a codecvt
facet meeting requirements
described below.
[Note: The C++ standard does not require the encoding of
char
,
char16_t
,
char32_t,
and wchar_t
strings be UTF encoded, although
u8
, u
, and U
string literals are UTF encoded. The components in this sub-clause use the
provided types narrow
, utf8
, utf16
, utf32
, and
wide
to identify specific
encodings [uni.encoding]. — end note]
Within this sub-clause a reference written in the form "(UCS number)" refers to section number of ISO/IEC 10646:2014 (C++ [intro.refs]).
[Note: ISO/IEC 10646 Universal Coded Character Set (UCS) is the ISO/IEC standard for Unicode. It is synchronized with The Unicode Standard maintained by the Unicode Consortium. —end note]
[Footnote] Unicode® is a registered trademark of Unicode, Inc. This information is given for the convenience of users of this document and does not constitute an endorsement by ISO or IEC of this product.
The definitions from (UCS 4.) apply throughout. [Examples: code point (UCS 4.10), code unit (UCS 4.11), encoding form (UCS 4.23), ill-formed code unit sequence (UCS 4.33), minimal well-formed code unit sequence (UCS 4.41), well-formed code unit sequence (UCS 4.61). —end examples]
The types char
, char16_t
,
char32_t
, and wchar_t
.
Determined by the encoding type [uni.encoding]:
utf8
, utf16
, and utf32
,
a well-formed code unit sequence (UCS 4.61) that maps to a single UCS scalar
value.narrow
, implementation defined according to
a facet argument of a type derived from std::codecvt
.wide
, implementation defined.Template parameters named InputIterator
shall satisfy the requirements of an input iterator (C++ [input.iterators]).
Template parameters named ForwardIterator
shall satisfy the requirements of an
forward iterator (C++ [forward.iterators]).
Template parameters named OutputIterator
shall satisfy the
requirements of an output iterator (C++ [output.iterators]).
namespace std { namespace experimental { inline namespace fundamentals_v2 { namespace unicode { // [uni.encoding] encoding types struct narrow {using value_type = char;}; // codecvt determined encoding struct utf8 {using value_type = char;}; // UTF-8 encoding struct utf16 {using value_type = char16_t;}; // UTF-16 encoding struct utf32 {using value_type = char32_t;}; // UTF-32 encoding struct wide {using value_type = wchar_t;}; // wide-character literal // encoding [lex.ccon] // [uni.is_encoding] is_encoding type-trait template <class T> struct is_encoding : public false_type {}; template<> struct is_encoding<narrow> : true_type {}; template<> struct is_encoding<utf8> : true_type {}; template<> struct is_encoding<utf16> : true_type {}; template<> struct is_encoding<utf32> : true_type {}; template<> struct is_encoding<wide> : true_type {}; template <class T> constexpr bool is_encoding_v = is_encoding<T>::value; // [uni.is_encoded_character] is_encoded_character type-trait template <class T> struct is_encoded_character : public false_type {}; template<> struct is_encoded_character<char> : true_type {}; template<> struct is_encoded_character<char16_t> : true_type {}; template<> struct is_encoded_character<char32_t> : true_type {}; template<> struct is_encoded_character<wchar_t> : true_type {}; template <class T> constexpr bool is_encoded_character_v = is_encoded_character<T>::value; // [uni.err] default error handler template <class CharT> struct ufffd; template <> struct ufffd<char>; template <> struct ufffd<char16_t>; template <> struct ufffd<char32_t>; template <> struct ufffd<wchar_t>; // [uni.recode] encoding conversion algorithm template <class FromEncoding, class ToEncoding, class InputIterator, class OutputIterator, class ... T> OutputIterator recode(InputIterator first, InputIterator last, OutputIterator result, const T& ... args); // [uni.to_string] string encoding conversion template <class ToEncoding = utf8, class ...Pack> basic_string<typename ToEncoding::value_type> to_string(string_view v, const Pack& ... args); template <class ToEncoding = utf8, class ...Pack> basic_string<typename ToEncoding::value_type> to_string(u16string_view v, const Pack& ... args); template <class ToEncoding = utf8, class ...Pack> basic_string<typename ToEncoding::value_type> to_string(u32string_view v, const Pack& ... args); template <class ToEncoding = utf8, class ...Pack> basic_string<typename ToEncoding::value_type> to_string(wstring_view v, const Pack& ... args); // [uni.utf-query] Encoding queries template <class ForwardIterator> std::pair<ForwardIterator, ForwardIterator> first_ill_formed(ForwardIterator first, ForwardIterator last) noexcept; bool is_well_formed(string_view v) noexcept; bool is_well_formed(u16string_view v) noexcept; bool is_well_formed(u32string_view v) noexcept; bool is_well_formed(wstring_view v) noexcept; } // namespace unicode } // namespace fundamentals_v2 } // namespace experimental } // namespace std
The types narrow
, utf8
, utf16
,
utf32
, and wide
provided by header <unicode>
identify the encoding of strings and sequences of the encoded character types ([ucs.defs.enc-char-type]).
[Note: Users must supply arguments of std::codecvt
derived types for operations on narrow
encoded strings and
sequences. For the other encoding types, such facets are not necessary. — end note]
The relationship between encoded character types, encoding types, and encodings is specified by the following table:
Table of Relationships | ||
Character type | Encoding type | Encoding |
char |
narrow |
The encoding of characters of type |
utf8 |
UTF-8 (UCS 9.2). | |
char16_t |
utf16 |
UTF-16 (UCS 9.3). |
char32_t |
utf32 |
UFT-32 (UCS 9.4) |
wchar_t |
wide |
The implementation defined encoding of wide-character literals (C++ [lex.ccon]). |
When an ill-formed code unit subsequence is detected during execution of a conversion function, an error handler function object shall be invoked. Unless the error handler throws an exception, the string returned by the error handler shall be added to the output sequence and the ill-formed input code unit subsequence shall not be converted and added to the output sequence. Detection and error handling for ill-formed code unit subsequences is required even when the input and output encodings are the same. [Note: If the error handler function object always returns a pointer to a well-formed code point sequence, the conversion function's entire output sequence will be a well-formed code point sequence. — end note]
template <class CharT> struct ufffd; template <> struct ufffd<char>; template <> struct ufffd<char16_t>; template <> struct ufffd<char32_t>; template <> struct ufffd<wchar_t>;
struct ufffd
is the default error handler function object for conversion functions.
The default error
handling function object returns U+FFFD REPLACEMENT CHARACTER as a single code point error marker. Each specialization shall provide a member function with the signature:
constexpr const CharT* operator()() const noexcept;
that returns a pointer to the value indicated in the Specializations table:
Specializations
CharT
Returns char
u8"\uFFFD"
char16_t
u"\uFFFD"
char32_t
U"\uFFFD"
wchar_t
L"\uFFFD"
[Note: U+FFFD REPLACEMENT CHARACTER is returned as the default single code point error marker in accordance with the recommendations of the Unicode Standard. The rationale given by the Unicode standard is essentially that other commonly used approaches, including throwing exceptions, can be and have been used as security attack vectors. —end note]
template <class FromEncoding, class ToEncoding, class InputIterator, class OutputIterator, class ... T> OutputIterator recode(InputIterator first, InputIterator last, OutputIterator result, const T& ... args);
Effects: For each minimal code unit subsequence in the range [
first, last
):
If the code unit subsequence is well-formed, convert the code point it represents from the input sequence encoding to the output sequence encoding and then copy the code units of that code point as if by
*result++ = *u++
whereu
is an iterator over the code units making up the code point.Otherwise, copy the null-terminated string returned by the
eh
function object to the output result as if by*result++ = *p++
wherep
iterates over the string returned byeh
.Returns:
result
.Remarks: An implementation is permitted to first convert from the input encoding to an intermediate encoding, and then convert the intermediate encoding to the output encoding. [Note: This allows implementations to perform conversions to or from
narrow
via an intermediate string of aCodecvt
argument'sintern_type
and encoding. —end note]The requirements for the
args
parameter pack arguments are shown in the following table.
Parameter pack argument requirements FromEncoding
ToEncoding
First
args
argumentSecond args
argumentThird args
argumentutf8
,utf16
,
utf32
, orwide
utf8
,utf16
,
utf32
, orwide
Optional error handler
function object [uni.err]Not allowed; diagnostic required Not allowed; diagnostic required utf8
,utf16
,
utf32
, orwide
narrow
const Codecvt&
Optional error handler
function object [uni.err]Not allowed; diagnostic required narrow
utf8
,utf16
,
utf32
, orwide
const Codecvt&
Optional error handler
function object [uni.err]Not allowed; diagnostic required narrow
narrow
const Codecvt&
const Codecvt&
Optional error handler
function object [uni.err]Type
Codecvt
is thestd::codecvt
derived type described in the [uni.encoding] table. Used to perform the conversion tochar
fromElem
.Postcondition: If the string returned by each call to the
eh
function object during the execution of the algorithm is a well-formed code point sequence, then the output sequence is a well-formed code point sequence.
template <class ToEncoding, class ...Pack> basic_string<typename ToEncoding::value_type> to_string(string_view v, const Pack& ... args); template <class ToEncoding, class ...Pack> basic_string<typename ToEncoding::value_type> to_string(u16string_view v, const Pack& ... args); template <class ToEncoding, class ...Pack> basic_string<typename ToEncoding::value_type> to_string(u32string_view v, const Pack& ... args); template <class ToEncoding, class ...Pack> basic_string<typename ToEncoding::value_type> to_string(wstring_view v, const Pack& ... args);
Effects: Equivalent to:
basic_string<typename ToEncoding::value_type> tmp;
recode<FromEncoding, ToEncoding>(v.cbegin(), v.cend(),
back_inserter(tmp), args ...);
return tmp;For the first overload,
FromEncoding
isnarrow
if there are two function arguments convertible toccvt_type
, andFromEncoding
isnarrow
and if there is one argument convertible toccvt_type
andToEncoding
is notnarrow
. OtherwiseFromEncoding
isutf8.
For the second, third, and fourth overloads,
FromEncoding
isutf16
,utf32
, andwide
, respectively.
[Example:
#include <string_encoding> #include <string> #include <locale> #include <cvt/big5> // vendor supplied #include <cvt/sjis> // vendor supplied using namespace std::unicode; using namespace std; string sjisstr() { string s; /*load s*/ return s; } string big5str() { string s; /*load s*/ return s; } int main() { string locstr("abc123..."); // narrow encoding known to std::locale() string u8str(u8"abc123$€𐐷𤭢..."); // UTF-8 encoded u16string u16str(u"abc123$€𐐷𤭢..."); // UTF-16 encoded u32string u32str(U"abc123$€𐐷𤭢..."); // UTF-32 encoding wstring wstr(L"abc123$€𐐷𤭢..."); // implementation defined wide encoding stdext::cvt::codecvt_big5<wchar_t> big5; // vendor supplied Big-5 facet stdext::cvt::codecvt_sjis<wchar_t> sjis; // vendor supplied Shift-JIS facet auto loc = std::locale(); auto& loc_ccvt(std::use_facet<ccvt_type>(loc)); u16string s1 = to_string<utf16>(u8str); // UTF-16 from UTF-8 wstring s2 = to_string<wide>(locstr, loc_ccvt); // wide from narrow u32string s3 = to_string<utf32>(sjisstr(), sjis); // UTF-32 from Shift-JIS string s4 = to_string<narrow>(u32str, big5); // Big-5 from UTF-32 string s5 = to_string<narrow>(big5str(), big5, sjis); // Shift-JIS from Big-5 string s6 = to_string<utf8>(u8str); // replace errors with u8"\uFFFD" string s7 = to_string(u16str, []() {return "?";}); // replace errors with '?' string s8 = to_string(wstr, []() {throw "barf"; return "";}); // throw on error string s9 = to_string<narrow>(u16str, big5);// OK string s10 = to_string<utf8>(u16str, big5); // error: ccvt_type arg not allowed string s11 = to_string<narrow>(u16str); // error: ccvt_type arg required string s12 = to_string<narrow>(u16str, big5, big5); // error: >1 ccvt_type arg wstring s13 = to_string<wide>(locstr, big5, big5); // error: >1 ccvt_type arg string s14 = to_string<narrow>(locstr); // error: ccvt_type arg required }
— end example]
These functions determine whether or not character sequences or string views consist of well-formed code unit sequences (UCS 4.61).
template <class ForwardIterator> std::pair<ForwardIterator, ForwardIterator> first_ill_formed(ForwardIterator first, ForwardIterator last) noexcept;
Effects: Equivalent to:
Searches for the first minimal ill-formed code unit subsequence ([ucs.def.min-ill-cus]) in the half-open range [
first, last
).If such a minimal ill-formed code unit subsequence is found, returns
std::make_pair(begin, end)
wherebegin
is an iterator to the first element of the minimal ill-formed code unit subsequence andend
is a past-the-end iterator for the past-the-end element of the minimal ill-formed code unit subsequence.Otherwise returns
std::make_pair(last, last)
.Returns: See Effects.
Remarks: The specific encoding form is determined by the
ForwardIterator
value type ([uni.encoding]).
bool is_well_formed(string_view v) noexcept; bool is_well_formed(u16string_view v) noexcept; bool is_well_formed(u32string_view v) noexcept; bool is_well_formed(wstring_view v) noexcept;
Returns: Equivalent to
first_ill_formed(v.cbegin(), v.cend()).first == v.cend()
.