SC22 N914
US position regarding the German NB proposal on UTF-16 datatype
in SC22 N3356
L2 has considered the proposal from the German NB for a new work item to add
a UTF-16 data type to the C standard SC
22 N 3356, . This was discussed in a meeting with C language
committee members during the L2 meeting on 2002-02-12. On the basis of that
discussion, L2 recommends that the US JTC 1 TAG adopt the following as the US
position:
- The U.S. NB supports this new work item. Adding a UTF-16 datatype and
string literal support to the C standard would greatly benefit implementers
of Unicode / 10646 in making use of the C standard.
- In particular, the following additions would be technically advantageous:
- UTF-16 datatype. Exactly 16 bits, to explicitly hold a
Unicode / 10646 UTF-16 code unit.
- UTF-16 string type. Linked explicitly with the UTF-16 datatype,
so that static string initialization with UTF-16 data would be easy and
explicit.
- UTF-32 datatype. Exactly 32 bits, to explicitly hold a
Unicode/10646 code point (without the cross-platform size ambiguity of
wchar_t).
- UTF-32 string type (optional). Linked explicitly with
the UTF-32 datatype. This might be useful, but for most implementations
is less important than having the UTF-16 string type.
- Regarding the terminology to be associated with any such new datatypes for
C, usage of "UTF-16" and "UTF-32" is preferred. The
exact form of the names for new datatypes would, of course, be up to the C
committee to determine, but names along the lines of "utf16_t",
"utf32_t" or the like would be satisfactory.
- It is advisable to avoid any terminological usage involving
"UCS-2" and "UCS-4". The term "UCS-2"
would be misleading, since it is the fixed-width 16-bit form of
10646, limited only to the BMP, whereas all significant implementations
are now moving to the variable-width UTF-16, to get all-plane
support for 10646. Use of "UCS-4" is not parallel, and just
induces a cognitive matching problem of converting from 4 octets to 32
bits -- which is the more normal concept for a 32-bit datatype.
Furthermore, the "16" and "32" are more normal
concepts for C programmers dealing with datatype sizes.
- The U.S. does not suggest adding any corresponding API's for the standard
libraries, to match already existing API's relevant to char and wchar_t
string types. Simply making the datatype additions listed in (2)
above would meet the essential requirements that vendors have on the
language to make their Unicode porting and other tasks simpler and more
uniform. API support for Unicode semantics is, at this point at least, more
appropriately provided by various third-party add-on libraries.
- The U.S. considers it important that other language standards, and in
particular, C++, take these issues into account, so that if a new datatype
or datatypes are added to C, interoperability with other languages can be
maintained as well.