A Proposal for International Locale Support in the Standard C++ Library ANSI: X3J16/93-0167 ISO: WG21/N0374 by Nathan Myers myersn@roguewave.com Rogue Wave Software, Inc. P.O.Box 2328, Corvallis, OR 97339 USA voice: (800) 487-3217 FAX: (503) 757-6650 Copyright 1993 by Rogue Wave Software, Inc. 1. Introduction The need for standard language libraries to support international character sets, data formats, and messages was acknowledged by the original ANSI C committee. They invented and standardized the new library component described by , and decreed that calls to this library would change the behavior of many of the traditional C library functions. They also invented a variety of other functions that were added to other header files. Given the lack of field experience with these features before they were standardized, it is easy to see why they have been little used in industrial applications. The "normative addendum" is an attempt to fix many of the original problems, and goes a long way in that direction. However, it still leaves lots of room for improvement, at least as a foundation for the C++ Standard Library. With the benefits of hindsight and a more powerful language, we can build a C++ locale library that will be used. 2. Improvements Possible over Standard C Library Locales The worst problem in the original standard was its thorough lack of re-entrancy. The interface implied a nest of hidden global variables that could not be saved or reset, much less encapsulated behind a re-entrant interface. These hidden global variables affect many standard global functions in ways confusing to many. In addition, a number of important features were omitted, such as help in parsing the new numeric, monetary, and date/time output formats it defined, and the ability to construct a wide character incrementally from a multibyte sequence other than a FILE. The "normative addendum" fixes some of these problems, but the omissions remain. Encapsulating locale and character set semantics is still impossible. Such restrictions imperil portability in two important (and rapidly growing) application domains: multithreaded programs, and network servers. Therefore, in this proposal the C++ Library facilities are not constructed on top of C Library facilities, but bypass it entirely. The C Library locale features may be built *on top of* the more complete features described here. ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 2 3. C++ Library Locale Facilities A Standard C Library locale has five parts, or "categories": collation order, character classes (ctype), and monetary, numeric and date/time formats. POSIX adds a sixth: messages. The categories are all brought together under a locale because they are interindependent. Collation order, data formats, and messages depend on the language in use, which also determines the underlying character set. One cannot, in general, choose one category independently of others, although variations are possible in each category. A C++ Locale is, of course, an object. In implementation, it can be quite heavy-weight. To ease the memory management burden on users, it is value-oriented, so that a locale may be treated as an atomic object which may be cheaply assigned, passed as an argument, and stored as a class member. To avoid ambiguous semantics, locales are immutable; but for convenience, a new locale can be constructed as a variation on another, existing, locale. In addition to its constructors, class Locale provides static member functions to obtain three standard locales: Locale::classic(), the classic "C" locale; Locale::global(), a snapshot of the current global (Standard C Library) locale, and Locale::transparent(), which tracks the state of the global locale. The Locale instance physically contains only a pointer to a separate representation object, which provides protected virtual functions to allow extensible semantics. The Locale class provides forwarding functions to give access to the virtuals, and to allow a degree of pre- and post-processing. The representation is reference-counted, and may in turn reference-count portions it shares with other representation instances. Besides its re-entrancy and convenient encapsulation, a notable difference between class Locale and the Standard C facilities is that (as in Schwarz's iostreams) little provision is made for in-memory multibyte strings. If needed, they may be generated with the help of an mbstreambuf and a strstream; but since for most uses the wide form is more practical, the design is optimized in its favor. 4. Relationship between Classes Locale and ios Iostream function semantics depend on locale parameters. Because the functions (and the operators) involved take no locale argument, and because different streams may require (at least) different multibyte/wide conversions, each iostream needs a locale member to use for reference during locale-dependent operations. ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 3 This proposal simplifies Jerry Schwarz's proposal (X3J16/93-0125, WG21/N0332) by eliminating the members ios::btowc(), ios::wcisb(), and ios::wctob() described in his section 7.2. Instead, ios is expected to delegate to the locale any such conversions. In place of these functions, ios gains functions to set and retrieve its locale member. I call setting an iostream's locale member "imbuing the iostream". First, to set it: Locale ios::locale(Locale const& loc) { Locale old = this->xlocale; rdbuf()->sync(); // clean out buffers this->xlocale = loc; rdbuf()->locale(loc); // notify streambuf of possible new mapping return old; } To retrieve the imbued locale: Locale ios::locale() const { return this->xlocale; } If none has been imbued, it returns the classic "C" locale. To use an imbued locale, operators >> and << simply call its member functions via the public interface ios::locale(). For example: ostream& ostream::operator<<(long i) { locale().insert(*this, i); return *this; } Extraction operators must use "loc.is(Locale::SPACE, c)" in place of "isspace((unsigned char)c)" to delimit fields, if they are to support large character sets. Implicit in this proposal is Schwarz's new variant of class streambuf: mbstreambuf. If the ultimate source/sink of data prefers multibyte characters, an mbstreambuf may be "pushed" into the iostream, in between the iostream and the existing streambuf. The mbstreambuf buffers the wide characters on one side, and the multibyte characters on the other, and delegates the conversion to the imbued locale by calling Locale::overflow() or underflow() to convert a bufferful of characters at a time. This avoids any per-character function call overhead on such conversions. Because the Standard C Library locale's magical effect on other global functions was its the most confusing quality, I have proposed that iostreams default to the "C" locale behavior regardless of the current global locale; this guarantees classical behavior until something else is asked for. In effect, each stream is initially imbued with Locale::classic(). ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 4 5. Sample Definition: Here is the proposed header file. Detailed explanations follow in Section 6. I have omitted exception-handling declarations in this (early) draft. I have also omitted support for message handling, as explained later. // #ifndef __locales_h__ #define __locales_h__ 1 #include /* for wchar_t */ #include /* for UCHAR_MAX */ struct tm; class ios; class istream; class ostream; class streambuf; class mbstreambuf; class Locale { public: enum category_t { COLLATE = 1<<0, CTYPE = 1<<1, MONETARY = 1<<2, NUMERIC = 1<<3, TIME = 1<<4, MESSAGES = 1<<5, ALL = (1<<6)-1 }; ~Locale() { imp_->remove_reference(); } Locale(Locale const& l) : imp_(l.imp_) { imp_->add_reference(); } Locale(Locale::Virtuals* imp) : imp_(imp) { imp_->add_reference(); } Locale(char const*); Locale(Locale const&, char const*, category_t); Locale const& operator=(Locale const& other); int ok() const { return imp_ != 0; } // construction succeeded? int operator==(Locale const& other) const { return imp_->equal(other.imp_); } int operator!=(Locale const& other) const { return !imp_->equal(other.imp_); } // iostream support: void insert(ostream& s, long v) const { imp_->insert(s,v); } void insert(ostream& s, unsigned long v) const { imp_->insert(s,v); } void insert(ostream& s, double v) const { imp_->insert(s,v); } void extract(istream& s, long& v) const { imp_->extract(s,v); } void extract(istream& s, unsigned long& v) const { imp_->extract(s,v); } void extract(istream& s, double& v) const { imp_->extract(s,v); } ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 5 int narrow(wchar_t w, char& c) const { return imp_->narrow(w,c); } int widen(char c, wchar_t& w) const { return imp_->widen(c,w); } int overflow (mbstreambuf* from) const { return imp_->overflow(from); } int underflow(mbstreambuf* to) const { return imp_->underflow(to); } // ctype functions enum ctype { SPACE=1<<0, PRINT=1<<1, CNTRL=1<<2, UPPER=1<<3, LOWER=1<<4, ALPHA=1<<5, DIGIT=1<<6, PUNCT=1<<7, XDIGIT=1<<8, ALNUM=(1<<5)|(1<<6), GRAPH=(1<<7)|(1<<6)|(1<<5) }; int is(ctype mask, unsigned char c) const { return ((int)imp_->ctypetable[c] & (int)mask) != 0); } int is(ctype mask, char c) const { return is(mask,(unsigned char)c); } int is(ctype mask, signed char c) const { return is(mask,(unsigned char)c); } int is(ctype mask, int c) const { return ((c&~UCHAR_MAX) ? 0 : is(mask, (unsigned char)c)); } // notice that the above functions are wholly inline. int is(ctype mask, wchar_t w) const { return imp_->is(mask, w); } char toupper(char c) const { return imp_->toupper(c); } char tolower(char c) const { return imp_->tolower(c); } signed char toupper(signed char c) const { return imp_->toupper(char(c)); } signed char tolower(signed char c) const { return imp_->tolower(char(c)); } unsigned char toupper(unsigned char c) const { return imp_->toupper(char(c)); } unsigned char tolower(unsigned char c) const { return imp_->tolower(char(c)); } int toupper(int c) const { return ((c&~UCHAR_MAX) ? c : imp_->toupper(char(c)); } int tolower(int c) const { return ((c&~UCHAR_MAX) ? c : imp_->tolower(char(c)); } wchar_t toupper(wchar_t w) const { return imp_->toupper(w); } wchar_t tolower(wchar_t w) const { return imp_->tolower(w); } // string functions int collate(char const* sa, size_t la, char const* sb, size_t lb) const { return imp_->collate(sa, la, sb, lb); } int collate(wchar_t const* sa, size_t la, wchar_t const* sb, size_t lb) const { return imp_->collate(sa, la, sb, lb); } // time functions void inserttime(ostream& s, struct tm const* tmb, char const* pattern) const; void inserttime(ostream& s, struct tm const* tmb, char format) const { imp_->inserttime(s,tmb,format); } void extracttime(istream& s, struct tm* t) const { imp_->extracttime(s,t); } void extractdate(istream& s, struct tm* t) const { imp_->extractdate(s,t); } void extractweekday(istream& s, struct tm* t) const { imp_->extractweekday(s,t); } void extractmonthname(istream& s, struct tm* t) const { imp_->extractmonthname(s,t); } ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 6 enum dateorder_t { DMY, MDY, YMD, YDM, DYM, MYD }; dateorder_t dateorder() const { return imp_->dateorder(); } // money functions enum moneysymbol_t { NONE, LOCAL, INTL }; void insertmoney(ostream& s, double units, moneysymbol_t sym) const { imp_->insertmoney(s, units, sym); } void extractmoney(istream& s, double& units, moneysymbol_t sym) const { imp_->extractmoney(s, units, sym); } // static members: static Locale global(); // the current global locale static Locale global(Locale const&); // replaces ::setlocale(...) static Locale transparent(); // the transparent global locale static Locale classic(); // the "C" locale class Virtuals { protected: // miscellaneous virtual void name(ostream&) const = 0; virtual int equal(Virtuals const*) const = 0; // iostream support virtual void insert(ostream& s, long v) const = 0; virtual void insert(ostream& s, unsigned long v) const = 0; virtual void insert(ostream& s, double v) const = 0; virtual void extract(istream& s, long& v) const = 0; virtual void extract(istream& s, unsigned long& v) const = 0; virtual void extract(istream& s, double& v) const = 0; virtual int narrow(wchar_t, char&) const = 0; virtual int widen(char, wchar_t&) const = 0; virtual int overflow (mbstreambuf* from) const = 0; virtual int underflow(mbstreambuf* to) const = 0; // ctype functions Locale::ctype const* ctypetable; // data member, for is(ctype, char); virtual int is(Locale::ctype mask, wchar_t) const = 0; virtual char toupper(char) const = 0; virtual wchar_t toupper(wchar_t) const = 0; virtual char tolower(char) const = 0; virtual wchar_t tolower(wchar_t) const = 0; // stdlib functions: virtual int collate(const char*, size_t len1, const char*, size_t len2) const = 0; virtual int collate(const wchar_t*, size_t len1, const wchar_t*, size_t len2) const = 0; ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 7 // time functions virtual void inserttime(ostream& s, struct tm const* tmb, char format) const = 0; virtual void extracttime(istream& s, struct tm* t) const = 0; virtual void extractdate(istream& s, struct tm* t) const = 0; virtual void extractweekday(istream& s, struct tm* t) const = 0; virtual void extractmonthname(istream& s, struct tm* t) const = 0; virtual Locale::dateorder_t dateorder() const = 0; // money functions virtual void insertmoney(ostream& s, double units, Locale::moneysymbol_t sym) const = 0; virtual void extractmoney(istream& s, double& units, Locale::moneysymbol_t sym) const = 0; virtual Virtuals* copybut(char const*, Locale::category_t) const = 0; Virtuals(size_t refs) : refcount_(size_t(refs-1)) {} virtual ~Virtuals(); private: size_t refcount_; void add_reference() { if (this) ++refcount_; } void remove_reference() { if (this && refcount_-- == 0) delete this; } Virtuals(Virtuals const&); // not defined Virtuals const& operator=(Virtuals const&); // not defined friend class Locale; }; private: Virtuals* imp_; void name(ostream& s) const { imp_->name(s); } // used by operator<< // these insert and extract the unique ASCII name of a locale friend ostream& operator<<(ostream& s, Locale const& l) { l.name(s); return s; } friend istream& operator>>(istream& s, Locale& l); }; // Locale::category_t bitwise operators: Locale::category_t operator~(Locale::category_t a); Locale::category_t operator&(Locale::category_t a, Locale::category_t b); Locale::category_t operator|(Locale::category_t a, Locale::category_t b); Locale::category_t operator^(Locale::category_t a, Locale::category_t b); Locale::category_t const& operator&=(Locale::category_t& a, Locale::category_t b); Locale::category_t const& operator|=(Locale::category_t& a, Locale::category_t b); Locale::category_t const& operator^=(Locale::category_t& a, Locale::category_t b); #endif /* defined(__locales_h__) */ ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 8 6. Explanation of functions: Members of class Locale ----------------------- Locale(char const*); This is the generic constructor. It takes the same string argument values as the C library function ::setlocale(...). enum category_t { COLLATE = 1<<0, CTYPE = 1<<1, MONETARY = 1<<2, NUMERIC = 1<<3, TIME = 1<<4, MESSAGES = 1<<5, ALL = (1<<6)-1 }; Locale(Locale const&, char const*, category_t cat); This constructor generates a variation from an existing locale. The *cat* argument may be any bitwise combination of the categories listed. Locale(Locale const& loc) : imp_(loc.imp_) { imp_->add_reference(); } Locale const& operator=(Locale const& other) { if (imp_ != other->imp_) { imp_->remove_reference(); imp_ = other->imp_; imp_->add_reference(); } return *this; } These are the generic copy operators. As usual, the assignment operator needs to check for identity. Locale(Locale::Virtuals* imp) : imp_(imp) { imp_->add_reference(); } This constructor allows for user-defined derivations. ~Locale() { imp_->remove_reference(); } The destructor. Note that it is not virtual. int ok() { return imp_ != 0; } // construction succeeded? ok() must be used to determine if a locale was constructed successfully [if exceptions are disabled?]. ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 9 enum ctype { SPACE=1<<0, PRINT=1<<1, CNTRL=1<<2, UPPER=1<<3, LOWER=1<<4, ALPHA=1<<5, DIGIT=1<<6, PUNCT=1<<7, XDIGIT=1<<8, ALNUM=(1<<5)|(1<<6), GRAPH=(1<<5)|(1<<6)|(1<<7) }; int is(ctype mask, unsigned char c) const { return ((int)imp_->ctypetable[c] & (int)mask) != 0); } int is(ctype mask, char c) const { return is(mask,(unsigned char)c); } int is(ctype mask, signed char c) const { return is(mask,(unsigned char)c); } int is(ctype mask, int c) const { return ((c&~UCHAR_MAX) ? 0 : is(mask, (unsigned char)c)); } int is(ctype mask, wchar_t wc) const { return imp_->is(mask, wc); } is() implements semantics for the char types efficiently enough to be used per-character in stream operations, while remaining configurable. The wchar_t version is implemented virtually for greater flexibility. T toupper(T c) const { return imp_->toupper(c); } T tolower(T c) const { return imp_->tolower(c); } toupper() and tolower() implement the corresponding semantics for all the character varieties. void inserttime(ostream& s, struct tm const* tmb, char const* pattern); This interprets its *pattern* argument exactly as the corresponding arguement to ::strftime(). static Locale global(); // the current global locale static Locale transparent(); // the transparent global locale static Locale classic(); // the "C" locale global() and transparent() differ in that the locale returned by global() is stable against calls to ::setlocale(), whereas transparent() returns a locale that tracks changes resulting from such calls. classic() returns the locale that implements the standard traditional "C" locale. static Locale glocal(Locale const&); global(Locale const&) sets the global locale, like ::setlocale(). ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 10 friend ostream& operator<<(ostream&, Locale const&) { l.imp_->name(s); return s; } friend istream& operator>>(istream&, Locale&); These functions insert and extract an ASCII string that uniquely identifies a locale. The extractor recreates the locale, if it was "native" (not user-defined). These functions may safely be used regardless of any locale currently imbued in the stream. All other functions have semantics identical to their corresponding virtual implementation, described below. Members of class Locale::Virtuals --------------------------------- virtual void name(ostream&) const = 0; name() generates an ASCII string uniquely identifying the locale. This string may be passed as an argument to the locale constructor to create a copy of the locale. virtual int equal(Virtuals const*) const = 0; Returns 1 iff the two locales are identical. Equivalent to comparing the locale names. The expression (Locale("C") == Locale::classic()) is guaranteed to be true. virtual void insert(ostream& s, long v) const = 0; virtual void insert(ostream& s, unsigned long v) const = 0; virtual void insert(ostream& s, double v) const = 0; These functions are used by ostream to insert numbers for output. Smaller numbers (short, float) may be promoted as appropriate. ios::flags are set accordingly. virtual void extract(istream& s, long& v) const = 0; virtual void extract(istream& s, unsigned long& v) const = 0; virtual void extract(istream& s, double& v) const = 0; These functions are used by istream to parse numbers off the input stream. Values out of range for smaller types must be identified by the caller. ios::flags are set accordingly. virtual int narrow(wchar_t, char&) const = 0; virtual int widen(char, wchar_t&) const = 0; These functions are used by istream and ostream when converting between "skinny" and wide characters, such as when extracting into a char* from a wide stream, or vice versa. ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 11 virtual int overflow (mbstreambuf* from) const = 0; virtual int underflow(mbstreambuf* to) const = 0; These functions are used by the mbstreambuf to request conversion of a buffer full of wide characters to or from multibyte characters, according to the locale's mapping. // ctype functions const Locale::ctype* ctypetable; // used by is(ctype, char) virtual int is(Locale::ctype, wchar_t) const = 0; The ctypetable is used by the Locale::is() functions for efficient inline char classification. The virtual function is used for the corresponding wide character operation, where overhead is less important and flexibility is essential. virtual char toupper(char) const = 0; virtual wchar_t toupper(wchar_t) const = 0; virtual char tolower(char) const = 0; virtual wchar_t tolower(wchar_t) const = 0; toupper() and tolower() functions work just like the corresponding functions. // stdlib functions: virtual int collate(const char*, size_t len1, const char*, size_t len2) const = 0; virtual int collate(const wchar_t*, size_t len1, const wchar_t*, size_t len2) const = 0; The collate() members work the same as the global function of the same name, except they treat nulls like other control characters. // time functions virtual void inserttime(ostream& s, struct tm const* tmb, char format) const = 0; inserttime() generates a time string according to the *format* argument, with semantics identical to the argument to strftime(). ios::flags are set accordingly. virtual void extracttime(istream& s, struct tm* t) const = 0; extracttime() sets the tm_hour, tm_min, and tm_sec fields of the *t* argument. ios::flags are set accordingly. virtual void extractdate(istream& s, struct tm* t) const = 0; ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 12 extractdate() sets the tm_year, tm_mon, tm_mday, tm_yday, and tm_wday fields of the *t* argument. Dates may be entirely numeric, or may contain the month name. If the latter, case is ignored and "standard" abbreviations are permitted; otherwise, the order of components expected is that returned by dateorder(). Certain locales may parse dates assuming a different era and calendar; e.g. Chinese, Muslim, Jewish. ios::flags are set accordingly. virtual void extractweekday( istream& s, struct tm* t) const = 0; virtual void extractmonthname(istream& s, struct tm* t) const = 0; extractweekday() sets the tm_wday field of the *t* argument. extractmonthname() sets the tm_mon field of the *t* argument. Case is ignored, and "standard" abbreviations are permitted. ios::flags are set accordingly. enum Locale::dateorder_t { DMY, MDY, YMD, YDM, DYM, MYD }; virtual Locale::dateorder_t dateorder() const = 0; dateorder() returns an enumeration indicating the conventional order of components in a numeric date; in the U.S, it would return MDY, in Europe, DMY, and in Japan, YMD. MYD and DYM are unlikely, but are included for completeness. // money functions enum Locale::moneysymbol_t { NONE, LOCAL, INTL }; virtual void insertmoney(ostream& s, double units, Locale::moneysymbol_t sym) const = 0; virtual void extractmoney(istream& s, double& units, Locale::moneysymbol_t sym) const = 0; These functions extract and insert monetary values. The *units* argument is interpreted as an integer multiple of the smallest unit of currency. For example, in the U.S. a value of 1000.0 would indicate $10.00. The *sym* argument indicates whether to use the local (e.g. "$"), international (e.g. "USD "), or no symbol. extractmoney() takes a *sym* argument because the units sometimes differ; e.g. LOCAL currency might come in units of cents, while INTL is dollars. ios::flags are set accordingly. virtual Locale::Virtuals* copybut(char const*, Locale::category_t) const=0; copybut() is used by the Locale(Locale const&, char const*, category_t) locale constructor. Locale::Virtuals(size_t refs) : refcount_(size_t(refs-1)) {} The regular constructor. Normally the *refs* argument is 1, for the locale object being constructed. ----------- X3J16/93-0167 - WG21/N0374 ----- Myers:Locale ---- Page 13 size_t refcount_; void add_reference() { if (this) ++refcount_; } void remove_reference() { if (this && refcount_-- == 0) delete this; } These are the ordinary reference-counting functions. 7. Conclusion I believe that this proposal, in concert with Jerry Schwarz's iostream proposal, resolves all existing problems in the domain covered by the Standard C Library locale mechanism. An appropriate interface for a message facility is still needed, and would be very welcome. (Something similar to Sun's gettext() but with arbitrary-order parameter substitution seems ideal, but the details are not obvious to me.) Also not defined here is explicit support for multiple wide character mappings -- there is no category specifically for it, although CTYPE might be stretched to cover it. Note that this proposal covers little new ground -- mostly it just encapsulates the familiar Standard C Library features behind a safe, re-entrant, and extensible interface, and integrates them with iostreams. Only the date/time/money parsing is really new. Of locale-related problems not addressed here, the hardest I know is collation of strings controlled by different locales. Any suggestions for this, or other unaddressed areas, would also be most welcome. A C binding to the facilities described above, provided by system vendors, would allow different C++ library vendors to co-operate without exposing internal configuration file formats, and without requiring a standard for such file formats. The old Standard C Library locale facilities could also be built on top of such a binding. The X/Open industry group is working on such an interface; the details will be of interest primarily to compiler vendors, and need not concern us. Not to standardize an encapsulated, re-entrant interface to locale facilities would be to condemn to non-portability the entire class of applications that require such support. The importance of this class of applications is growing rapidly and will to continue to accelerate as worldwide network connectivity improves.