N3342=12-0032
Jens Maurer
2012-01-09

Digit Separators coming back

Introduction

This paper proposes syntax extensions to C++ in order to be able to write large numeric literals with separators between the digits to make them more readable.

This paper is largely based on N2281 = 07-0141 "Digit Separators" by Lawrence Crowl. The proposed wording changes have been updated for C++11 (more specifically, the latest working draft N3290).

This paper does not propose to add binary literals or hexadecimal floating-point literals; those are considered largely independent of this paper and thus can be addressed separately.

Motivation

For most people, reading large numbers without additional (redundant) visual cues is hard. Examples:

pronounce 7237498123
compare 237498123 with 237499123 for equality
decide whether 237499123 or 20249472 is larger

Adding additional visual cues help, for example spaces:

pronounce 7 237 498 123
compare 237 498 123 with 237 499 123 for equality
decide whether 237 499 123 or 20 249 472 is larger

An alternative visual cue might be to use underscores, elsewhere often employed to form identifiers with a space-lookalike character (but without violating identifier syntax):

pronounce 7_237_498_123
compare 237_498_123 with 237_499_123 for equality
decide whether 237_499_123 or 20_249_472 is larger

Discussion

Using a space character would cause a literal potentially to become two or more preprocessing-tokens, with rather substantial impact not only on the lexing phase, but also on the parsing phase of C++. Therefore, this paper proposes to use the underscore variant.

Using underscores conflicts with user-defined literals. Appropriate disambiguation is already provided for in the current wording, see 2.14.8 lex.ext paragraph 1, but the example can be improved for the new situation. In effect, that means a user-defined literal may not start with underscore-digit. Given that user-defined literals are already severely constrained (see 2.14.8 lex.ext and 17.6.4.3.5 userlit.suffix), this seems to be a mild inconvenience for the next revision of the standard.

Wording Changes

The grammar production pp-number in 2.10 lex.ppnumber already permits underscores inside (via identifier-nondigit and nondigit). No changes are necessary.

Change in 2.14.2 lex.icon:

decimal-literal:
       nonzero-digit
       decimal-literal underscore_opt digit

octal-literal:
       0
       octal-literal underscore_opt octal-digit

hexadecimal-literal:
      0x hexadecimal-digit
      0X hexadecimal-digit
      hexadecimal-literal underscore_opt hexadecimal-digit

underscore: _

Change in 2.14.2 lex.icon paragraph 1:

An integer literal is a sequence of digits that has no period or exponent part, with optional separating underscores that are ignored when determining its value. ... [ Example: the number twelve can be written 12, 1_2, 014, 01_4, or 0XC. -- end example ]

Change in 2.14.4 lex.fcon:

digit-sequence:
       digit
       digit-sequence underscore_opt digit

Change in 2.14.4 lex.fcon paragraph 1:

... The integer and fraction parts both consist of a sequence of decimal (base ten) digits, with optional separating underscores that are ignored when determining the value. ...

Change in 2.14.8 lex.ext paragraph 1:

If a token matches both user-defined-literal and another literal kind, it is treated as the latter. [ Example: 123_km is a user-defined-literal, but 123_456 and 12LL are integer-literals 12LL is an integer-literal. -- end example ] ...