Proposal for C2y
WG14 3193
Title: Obsolete implicitly octal literals
Author, affiliation: Alex Celeste, Perforce
Date: 2023-11-30
Proposal category: Clarification/enhancement
Target audience: Compiler implementers, users
Abstract
The use of base-8 instead of base-10 by integer literals that begin with a zero digit is the source of frequent confusion. We propose marking the use of such literals as obsolete in order to encourage a warning that will prompt rewrites, and the introduction of a new prefix to explicitly mark literals that are genuinely intended to be in base-8.
Obsolete implicitly octal literals
Reply-to: Alex Celeste (aceleste@perforce.com)
Document No: N3193
Revises: (n/a)
Date: 2023-11-30
Summary of Changes
N3193
- original proposal
Introduction
In C23, binary and hexadecimal integer constants are prefixed to indicate that they describe a value using a base other than the default base-10. An unprefixed number is usually therefore implicitly a decimal number, and is naturally read by most human readers in that base.
However, a leading zero digit implicitly serves as a prefix that tells the lexer to interpret the literal as a value described by base-8, rather than by base-10. This is "obvious" enough to a human reader who is completely familiar with the rules, but in practice is unexpected by most users and is also easy for even an advanced user to miss.
Many users do not expect leading zeroes to be significant and like to use them as visual padding. This can lead to unexpected value results, or unclear error messages if they try to "pad" a literal that contains the digits 8 or 9 (though an error is the better result here). The error can go unnoticed for some time if by coincidence the user only had a sparse set of values, such as only values smaller than 8, or all the other "true" decimal literals in use are sufficiently large as to begin with a non-zero digit. The fact that other languages may allow these literals to be base-10 adds to the confusion for non-expert users.
MISRA C 2023 (and prior versions) prohibits the use of octal literals entirely (Rule 7.1, Required) on the grounds that this is so unclear that it is more likely to be misunderstood than not. There is an exception for a literal zero spelled with a single digit, which is technically an octal rather than decimal literal but the distinction is not meaningful in practice.
Example
This example is lifted directly from C++ document p0085r0:
// The following literals all specify the same number.
int literal_octal_prefered = 0o52;
int literal_octal_to_be_deprecated = 052;
int literal_decimal = 42;
int literal_hex = 0x2A;
int literal_binary = 0b00101010;
This is intended to highlight that the distinction between decimal and the prefixed octal literal syntax is clearer than the distinction between the decimal and traditional octal syntaxes.
Proposal
We propose that a new syntax is added for explicit octal constants, with a new
prefix 0o
or an alternative spelling to mark the beginning of a base-8
literal. The old syntax should be retained and marked as obsolescent to avoid
breaking the meaning of existing code.
We do not propose that leading-zero ever change meaning to be accepted as a base-10 literal. This syntax should remain obsolescent or be fully deprecated and removed, but cannot be recycled safely.
We separately propose changing escape sequences within literals at this time.
Escape sequences are visually prefixed with a \
and are therefore much less
subject to this issue. As with the existing hexademical escape syntax, there is
no leading zero on the prefix as this would interfere with a string that
intentionally contained the nul character.
We do separately propose allowing the strto_l
function family to recognize
prefixed octal digit sequences, whereas before they would have returned the
value zero.
We do not propose any changes to the formatted input or output functions here.
Printing a prefix is already a matter of user choice as the prefix is not part
of the functionality. The formatted input functions are defined in terms of the
strto_l
function family and are therefore covered by the change above.
Choice of prefix
The character o
is the most obvious choice for the prefix and is in common use
in other languages.
However, it has a major flaw for readability: depending on the typeface, the
uppercase form 0O123
may not be visually distinct enough to achieve the primary
goal of the change, which is enhanced readability. This may even be a problem
with the lowercase form, depending on the user's setup.
Alternative proposals might be to use c
, which is not in use, and visually
tends towards being read "oc-"; or t
, which is also not yet in use and has a
similar tendency. Overall, this proposal should be read as allowing any of these
letters to be used as the new prefix, or any other letter the Committee prefers.
Status of zero
Zero remains a traditional octal constant because the rule defining decimal constants requires them to begin with a non-zero.
This can be changed, but as long as traditional octal constants remain in the language, the definition of decimal constants has to be complicated in one way or another. Therefore, for the time being, we leave this as-is.
However, any tool that depends on this distinction is hiding a silent logical error.
Prior Art
A very similar change was proposed for C++ as document p0085r0.
This proposal also added the prefixed form without removing the traditional literal syntax, and a new syntax for octal escape sequences in literals.
There does not appear to be a record of this proposal being discussed by WG21 and the change was not adopted.
Impact
There is no impact to existing code, other than new deprecation warnings if the user has this functionality enabled in their tool (obsolescence is not a constraint violation and these warnings are not mandatory).
Causing tools to emit these warnings if they were not already doing so (any tool aiming to check for MISRA C or similar Guidelines compliance is already warning on any use of octal) is considered a goal of the proposal and is not a compatibility failure.
The proposed spelling is not currently valid in C and therefore use of the new octal literal format would not break existing code. However, whichever letter the Committee chooses as a prefix is effectively ruled out from also serving as a suffix. Therefore, care is warranted to not rule out a different good design for an unrelated feature in the future.
Future directions
If the Standard evolves to incorporate a distinction between deprecation and obsolescence, we would prefer implicit octal syntax to be marked as fully deprecated in that version of the Standard. This would allow for its eventual removal, and presumably require a stronger class of warning message (such as a mandatory warning against uses, rather than the current opt-in for uses of obsolescent features).
Octal escape sequences within character or string literals have an outstanding issue that the end of the sequence is not clear:
"\1234" // two characters, \123 followed by 4
"\1289" // three characters, \12 followed by 8 and 9
A future change should try to normalize this. For the time being we simply allow the prefix to appear here for visual clarity, but this could be improved by forcing prefixed octal escape sequences to have a fixed width. (The C++ proposal did the opposite here.)
Apart from their variable length, octal escape sequences seem well-understood compared to integer literals and their use does not seem to be confusing in practice.
Proposed wording
The proposed changes are based on the latest public draft of C23, which is N3096. Bolded text is new text when inlined into an existing sentence. These changes are not compatible with the words from p0085r0, which describe a different Standard (C++).
Integer constants
Within 6.4.4.1 "Integer constants", Syntax, paragraph 1 (the grammar):
Replace the existing octal-constant rule with a new rule:
octal-constant:
prefixed-octal-constant
unprefixed-octal-constant
Rename the original octal-constant rule to unprefixed-octal-constant:
unprefixed-octal-constant:
0
unprefixed-octal-constant'
opt octal-digit
Add a new rule prefixed-octal-constant immediately below unprefixed-octal-constant:
prefixed-octal-constant:
octal-prefix octal-digit
unprefixed-octal-constant'
opt octal-digit
Add a new rule octal-prefix immediately below binary-constant:
octal-prefix: one of
0o
0O
Modify paragraph 4:
A decimal constant begins with a nonzero digit and consists of a sequence of decimal digits. An octal constant consists of the prefix
0o
or0O
followed by a sequence of the digits0
through7
only. A hexadecimal constant consists of the prefix0x
or0X
followed by a sequence of the decimal digits and the lettersa
(orA
) throughf
(orF
) with values 10 through 15 respectively. A binary constant consists of the prefix0b
or0B
followed by a sequence of the digits0
or1
.
Add a new paragraph immediately after paragraph 4:
An unprefixed octal constant begins with the digit
0
optionally followed by a sequence of the digits0
through7
only. Use of an unprefixed octal constant with more than one digit is an obsolescent feature.
Escape sequences
Within 6.4.4.4 "Character constants", Syntax, paragraph 1 (the grammar):
Modify the existing octal-escape-sequence rule:
octal-escape-sequence:
\
o
opt octal-digit
\
o
opt octal-digit octal-digit
\
o
opt octal-digit octal-digit octal-digit
Modify the list in paragraph 3:
octal character
\
octal digits or\o
octal digits
Modify the beginning of paragraph 5:
The octal digits that follow the escape in an octal escape sequence are taken to be part of ...
Optionally, modify the second sentence of example 3:
To specify an integer character constant containing the two characters whose values are ’\x12’ and ’3’, the construction ’\o0223’ can be used, since an octal escape sequence is terminated after three octal digits.
6.4.5 does not need to be modified because it refers back to 6.4.4.4 for the meaning of escape sequences.
Future language directions
Add a new entry between 6.11.3 "External names" and 6.11.4 "Character escape sequences":
6.11.x Octal integer constants The use of octal integer constants without the prefix
0o
or0O
is an obsolescent feature, except for the constant0
.
(no entry is intended for octal escape sequences here)
The strto_l
functions
Add a new sentence near the end of 7.24.1.7 paragraph 3:
If the value of
base
is 2, the characters0b
or0B
may optionally precede the sequence of letters and digits, following the sign if present. If the value ofbase
is 8, the characters0o
or0O
may optionally precede the sequence of letters and digits, following the sign if present. If the value ofbase
is 16, the characters0x
or0X
may optionally precede the sequence of letters and digits, following the sign if present.
Questions for WG14
Does WG14 want to add the new spelling for base-8 integer literals with an explicit prefix?
Does WG14 want to mark the use of unprefixed base-8 integer literals, apart from zero itself, as obsolete?
Does WG14 prefer the character o
, c
, or t
for the prefix character?
Does WG14 want to add the new spelling for octal escape sequences in character and string literals?
Would WG14 prefer prefixed octal escape sequences to have a fixed width of three digits?
Does WG14 want to change the behaviour of the strto_l
function family to
allow them to interpret the new octal prefix, rather than returning zero?