Doc No: WG21 N2842 = 09-0032
Date: 2009-04-01
Reply to:  Bill Seymour <stdbill.h@pobox.com>

Another numeric facet

Bill Seymour

the first of April, two thousand nine


Abstract

Wouldn’t it be nice if there were a standard way to write natural numbers as locale-specific text? This paper proposes a facet to do that.


Yes, it has been implemented.

So far, numtext<char> has been implemented in the “C,” Danish, German and French locales; and numtext<wchar_t> has been implemented in the Hindi and Russian locales. There’s a demo at http://www.stdbill.com/cgi-bin/try_numtext. (The Hindi ordinals don’t work as of this writing; but it’s hoped that they will by the time this paper is published.)

Because many current C++ implementations don’t have <cstdint> yet, instead of uintmax_t, the demo uses unsigned long long on implementations that have that type, or just unsigned long on ones that don’t.

To show that generating text for very large numbers is also possible, the demo has a page that will generate “C”-locale group names for values up to 1000UINT_MAX. The author uses How high can you count? by Landon Curt Noll as his lexical authority for the “C” locale.

The author would like to thank (in lexicographical_compare<> order) Soumitra Chatterjee, Ilya Kofman, Jens Maurer, Dhaivat Parikh, Bjarne Stroustrup, and Willem Wakker for their help with the current implmentation.


The basic design

    class numtext_base {
    public:
        enum inflection {
            none     = 0x0000,
            // ...
            cardinal = 0x0000,
            ordinal  = 0x8000,
        };
    };
We’ll have a non-template base class with bitmasks for the various inflections that we might need. At a minimum, even in the “C” locale, we’d want to produce both cardinal and ordinal numbers; and other locales will have additional requirements. In Danish, for example, both the cardinal 1 and the ordinal 2 have gender (en-et, anden-andet). In German, the ordinals are considered adjectives with number, gender, and case; there are three complete sets of adjective declensions; and although there is a word for the cardinal 1 (eins), it is often replaced by the indefinite article (ein-eines-einem-…).
    template<class charT>
    class numtext : public locale::facet, public numtext_base {
    public:
        // ...
        void convert(basic_string<charT>&, uintmax_t, inflection = none) const;
    };
Like codecvt<> and similar facets, numtext<> doesn’t actually do I/O. Instead, it just returns the correct text in a basic_string<> passed by non-const reference. Alternatively, we could return a basic_string<> by value; but the strings can be fairly long. For example, in the “C” locale, a 64-bit ULLONG_MAX would be “eighteen quintillion four hundred forty-six quadrillion seven hundred forty-four trillion seventy-three billion seven hundred nine million five hundred fifty-one thousand six hundred fifteen”; and we’re only up to about 1019.

The second argument, the number to convert, can be any unsigned integer that the C++ implementation can handle. Having many functions overloaded on the integer type wouldn’t have any noticeable effect on efficiency since the time it takes to generate the text would certainly swamp whatever it takes to promote the value to uintmax_t.

The optional third argument specifies the inflection. It defaults to producing uninflected cardinal numbers.

One or both of two additional overloads on the second argument, perhaps some user-defined type with integer semantics (presumably a bignum of some sort), or maybe a basic_string<> of digits, would serve to extend the range. These overloads are not proposed by this paper, but could be considered for a TR.


More detail on the current design

namespace std {

class numtext_base {
public:
    enum inflection {
        none          = 0x0000,

        number        = 0x0003,
        singular      = 0x0000,
        dual          = 0x0001,
        plural        = 0x0002,
     //               = 0x0003,

        gender        = 0x000C,
        common        = 0x0000,
        masculine     = 0x0004,
        feminine      = 0x0008,
        neuter        = 0x000C,

        lexcase       = 0x00F0,
        nominative    = 0x0000,
        genitive      = 0x0010,
        dative        = 0x0020,
        accusative    = 0x0030,
        oblative      = 0x0040,
        vocative      = 0x0050,
        locative      = 0x0060,
        ergative      = 0x0070,
        absolutive    = 0x0080,
        direct        = 0x0090,
        instrumental  = 0x00A0,
        prepositional = 0x00B0,
     //               = 0x00C0,
     //               = 0x00D0,
     //               = 0x00E0,
     //               = 0x00F0,

        strength      = 0x0300,  // Sometimes, adjectives can be declined
        strong        = 0x0000,  // more weakly if there’s another word in
        mixed         = 0x0100,  // the phrase, like a definite article,
        weak          = 0x0200,  // that provides the information.
     //               = 0x0300,

        scale         = 0x0C00,  // In English, is 10**9
        amer          = 0x0000,  // a billion,
        euro          = 0x0400,  // a milliard,
        olduk         = 0x0800,  // or a thousand million?
     //               = 0x0C00,

     // bits 0x7000 unused so far

        cardinal      = 0x0000,
        ordinal       = 0x8000,
    };
};

//
// The facet itself is unsurprising:
//
template<class charT>
class numtext : public locale::facet, public numtext_base {
public:
    typedef charT char_type;
    typedef basic_string<charT> string_type;

    explicit numtext(size_t refs = 0) : locale::facet(refs) { }

    void convert(string_type& dest, uintmax_t val, inflection i = none) const
    {
        do_convert(dest, val, i);
    }

    static locale::id id;

protected:
    ~numtext() { }
    virtual void do_convert(string_type&, uintmax_t, inflection) const = 0;
};

} // namespace std
(You can download the current implementation’s source code as .tar.gz or .zip. For the purposes of the demo, the leaf classes have public, static do_do_convert() functions that the CGI can call directly instead of having to first set a locale and then use the facet.)


All suggestions and corrections will be welcome; all flames will be amusing.
Mail to stdbill.h@pobox.com