JT/02-0114, UTF Datatypes

SC22 N914

US position regarding the German NB proposal on UTF-16 datatype in SC22 N3356

L2 has considered the proposal from the German NB for a new work item to add a UTF-16 data type to the C standard SC 22 N 3356, . This was discussed in a meeting with C language committee members during the L2 meeting on 2002-02-12. On the basis of that discussion, L2 recommends that the US JTC 1 TAG adopt the following as the US position:

The U.S. NB supports this new work item. Adding a UTF-16 datatype and string literal support to the C standard would greatly benefit implementers of Unicode / 10646 in making use of the C standard.
In particular, the following additions would be technically advantageous:
1. UTF-16 datatype. Exactly 16 bits, to explicitly hold a Unicode / 10646 UTF-16 code unit.
2. UTF-16 string type. Linked explicitly with the UTF-16 datatype, so that static string initialization with UTF-16 data would be easy and explicit.
3. UTF-32 datatype. Exactly 32 bits, to explicitly hold a Unicode/10646 code point (without the cross-platform size ambiguity of wchar_t).
4. UTF-32 string type (optional). Linked explicitly with the UTF-32 datatype. This might be useful, but for most implementations is less important than having the UTF-16 string type.
Regarding the terminology to be associated with any such new datatypes for C, usage of "UTF-16" and "UTF-32" is preferred. The exact form of the names for new datatypes would, of course, be up to the C committee to determine, but names along the lines of "utf16_t", "utf32_t" or the like would be satisfactory.
- It is advisable to avoid any terminological usage involving "UCS-2" and "UCS-4". The term "UCS-2" would be misleading, since it is the fixed-width 16-bit form of 10646, limited only to the BMP, whereas all significant implementations are now moving to the variable-width UTF-16, to get all-plane support for 10646. Use of "UCS-4" is not parallel, and just induces a cognitive matching problem of converting from 4 octets to 32 bits -- which is the more normal concept for a 32-bit datatype. Furthermore, the "16" and "32" are more normal concepts for C programmers dealing with datatype sizes.
The U.S. does not suggest adding any corresponding API's for the standard libraries, to match already existing API's relevant to char and wchar_t string types. Simply making the datatype additions listed in (2) above would meet the essential requirements that vendors have on the language to make their Unicode porting and other tasks simpler and more uniform. API support for Unicode semantics is, at this point at least, more appropriately provided by various third-party add-on libraries.
The U.S. considers it important that other language standards, and in particular, C++, take these issues into account, so that if a new datatype or datatypes are added to C, interoperability with other languages can be maintained as well.