JTC1/SC22/WG21
N2401
ISO/IEC JTC1/SC22/WG21 N2401 = J16/07-0261
Code Conversion Facets for the Standard C++ Library
P.J. Plauger
Dinkumware, Ltd.
pjp@dinkumware.com
2007-09-03
With the acceptance of N2007 (Proposed Library Additions for Code Conversion)
we now have template classes wbuffer_convert and wstring_convert, as well
as basic_filebuf, that accept code-conversion facets as template parameters.
Unfortunately, the current draft C++ Standard defines only the default codecvt
facet, with weakly specified properties. This paper proposes the addition of
several facets that provide the commonest Unicode support.
Add the header <codecvt> with the following definitions:
namespace std {
enum codecvt_mode {
consume_header = 4,
generate_header = 2,
little_endian = 1};
template<class Elem,
unsigned long Maxcode = 0x10ffff,
codecvt_mode Mode = (codecvt_mode)0>
class codecvt_utf8
: public std::codecvt<Elem, char, mbstate_t>
{ // facet for converting between Elem and UTF-8 byte sequences
.....
};
template<class Elem,
unsigned long Maxcode = 0x10ffff,
codecvt_mode Mode = (codecvt_mode)0>
class codecvt_utf16
: public std::codecvt<Elem, char, mbstate_t>
{ // facet for converting between Elem and UTF-16 multibyte sequences
.....
};
template<class Elem,
unsigned long Maxcode = 0x10ffff,
codecvt_mode Mode = (codecvt_mode)0>
class codecvt_utf8_utf16
: public std::codecvt<Elem, char, mbstate_t>
{ // facet for converting between UTF-16 Elem and UTF-8 byte sequences
.....
};
} // namespace std
For each of the three code conversion facets codecvt_utf8, codecvt_utf16,
and codecvt_utf8_utf16:
-- Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.
-- Maxcode is the largest wide-character code that the facet will read
or write without reporting a conversion error.
-- If (Mode & consume_header), the facet consumes an optional initial
header sequence when reading a multibyte sequence to determine the
endianness of the subsequent multibyte sequence to be read.
-- If (Mode & generate_header), the facet generates an initial header
sequence when writing a multibyte sequence to advertise the endianness
of the subsequent multibyte sequence to be written.
-- If (Mode & little_endian), the facet generates a multibyte sequence in
little-endian order, as opposed to the default big-endian order.
For the facet codecvt_utf8:
-- The facet converts between UTF-8 multibyte sequences and UCS2 or UCS4
(depending on the size of Elem) within the program.
-- Endianness does not affect how multibyte sequences are read or written.
-- The multibyte sequence can be written as either a text or a binary file.
For the facet codecvt_utf16:
-- The facet converts between UTF-16 multibyte sequences and UCS2 or UCS4
(depending on the size of Elem) within the program.
-- Endianness affects how multibyte sequences are read or written.
-- The multibyte sequence must be written as a binary file.
For the facet codecvt_utf8_utf16:
-- The facet converts between UTF-8 multibyte sequences and UTF-16 (one or
two 16-bit codes) within the program.
-- Endianness does not affect how multibyte sequences are read or written.
-- The multibyte sequence can be written as eitier a text or a binary file.