JTC1/SC22/WG21 N2401

ISO/IEC JTC1/SC22/WG21 N2401 = J16/07-0261
Code Conversion Facets for the Standard C++ Library

P.J. Plauger
Dinkumware, Ltd.
pjp@dinkumware.com

2007-09-03

With the acceptance of N2007 (Proposed Library Additions for Code Conversion)
we now have template classes wbuffer_convert and wstring_convert, as well
as basic_filebuf, that accept code-conversion facets as template parameters.
Unfortunately, the current draft C++ Standard defines only the default codecvt
facet, with weakly specified properties. This paper proposes the addition of
several facets that provide the commonest Unicode support.

Add the header <codecvt> with the following definitions:

namespace std {
enum codecvt_mode {
	consume_header = 4,
	generate_header = 2,
	little_endian = 1};

template<class Elem,
	unsigned long Maxcode = 0x10ffff,
	codecvt_mode Mode = (codecvt_mode)0>
	class codecvt_utf8
	: public std::codecvt<Elem, char, mbstate_t>
	{	// facet for converting between Elem and UTF-8 byte sequences
	.....
	};

template<class Elem,
	unsigned long Maxcode = 0x10ffff,
	codecvt_mode Mode = (codecvt_mode)0>
	class codecvt_utf16
	: public std::codecvt<Elem, char, mbstate_t>
	{	// facet for converting between Elem and UTF-16 multibyte sequences
	.....
	};

template<class Elem,
	unsigned long Maxcode = 0x10ffff,
	codecvt_mode Mode = (codecvt_mode)0>
	class codecvt_utf8_utf16
	: public std::codecvt<Elem, char, mbstate_t>
	{	// facet for converting between UTF-16 Elem and UTF-8 byte sequences
	.....
	};
}	// namespace std

For each of the three code conversion facets codecvt_utf8, codecvt_utf16,
and codecvt_utf8_utf16:

-- Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.

-- Maxcode is the largest wide-character code that the facet will read
or write without reporting a conversion error.

-- If (Mode & consume_header), the facet consumes an optional initial
header sequence when reading a multibyte sequence to determine the
endianness of the subsequent multibyte sequence to be read.

-- If (Mode & generate_header), the facet generates an initial header
sequence when writing a multibyte sequence to advertise the endianness
of the subsequent multibyte sequence to be written.

-- If (Mode & little_endian), the facet generates a multibyte sequence in
little-endian order, as opposed to the default big-endian order.

For the facet codecvt_utf8:

-- The facet converts between UTF-8 multibyte sequences and UCS2 or UCS4
(depending on the size of Elem) within the program.

-- Endianness does not affect how multibyte sequences are read or written.

-- The multibyte sequence can be written as either a text or a binary file.

For the facet codecvt_utf16:

-- The facet converts between UTF-16 multibyte sequences and UCS2 or UCS4
(depending on the size of Elem) within the program.

-- Endianness affects how multibyte sequences are read or written.

-- The multibyte sequence must be written as a binary file.

For the facet codecvt_utf8_utf16:

-- The facet converts between UTF-8 multibyte sequences and UTF-16 (one or
two 16-bit codes) within the program.

-- Endianness does not affect how multibyte sequences are read or written.

-- The multibyte sequence can be written as eitier a text or a binary file.