Proposal for C2y
WG14 3353
Title:               Obsolete implicitly octal literals and add delimited escape sequences
Author, affiliation: Alex Celeste, Perforce
                     Corentin Jabot
                     Aaron Ballman
Date:                2024-09-18
Proposal category:   Clarification/enhancement
Target audience:     Compiler implementers, users
The use of base-8 instead of base-10 by integer literals that begin with a zero digit is the source of frequent confusion. We propose marking the use of such literals as obsolete in order to encourage a warning that will prompt rewrites, and the introduction of a new prefix to explicitly mark literals that are genuinely intended to be in base-8.
This proposal also revives N2785 to add a new escape sequence syntax for string and character literals.
Reply-to:     Alex Celeste (aceleste@perforce.com)
              Corentin Jabot (corentin.jabot@gmail.com)
              Aaron Ballman (aaron@aaronballman.com)
Document No:  N3353
Revises:      N3319
Date:         2024-09-18
- rebase against N3301 (incorporates "constant" -> "literal")
- don't touch printfat this time
- retitle to avoid confusion w.r.t inclusion of escape sequences
- fixes based on feedback
- remove original proposal for escape sequences, replace with wording from N2785
- incorporate suggested changes to printffamily
- fixed error in grammar
- rebase against N3220
- original proposal
In C23, binary and hexadecimal integer literals are prefixed to indicate that they describe a value using a base other than the default base-10. An unprefixed number is usually therefore implicitly a decimal number, and is naturally read by most human readers in that base.
However, a leading zero digit implicitly serves as a prefix that tells the lexer to interpret the literal as a value described by base-8, rather than by base-10. This is "obvious" enough to a human reader who is completely familiar with the rules, but in practice is unexpected by most users and is also easy for even an advanced user to miss.
Many users do not expect leading zeroes to be significant and like to use them as visual padding. This can lead to unexpected value results, or unclear error messages if they try to "pad" a literal that contains the digits 8 or 9 (though an error is the better result here). The error can go unnoticed for some time if by coincidence the user only had a sparse set of values, such as only values smaller than 8, or all the other "true" decimal literals in use are sufficiently large as to begin with a non-zero digit. The fact that other languages may allow these literals to be base-10 adds to the confusion for non-expert users.
MISRA C 2023 (and prior versions) prohibits the use of octal literals entirely (Rule 7.1, Required) on the grounds that this is so unclear that it is more likely to be misunderstood than not. There is an exception for a literal zero spelled with a single digit, which is technically an octal rather than decimal literal but the distinction is not meaningful in practice.
This example is lifted directly from C++ document p0085r0:
// The following literals all specify the same number.
int literal_octal_prefered          = 0o52;
int literal_octal_to_be_deprecated  = 052;
int literal_decimal                 = 42;
int literal_hex                     = 0x2A;
int literal_binary                  = 0b00101010;
This is intended to highlight that the distinction between decimal and the prefixed octal literal syntax is clearer than the distinction between the decimal and traditional octal syntaxes.
At the January 2024 meeting of WG14, this proposal was discussed. The proposal
to add the new literal syntax prefixed by 0o or 0O was well-received with
strong consensus to add to C2y (18 / 1 / 1). The second question, to immediately
make old-style octal literals (of the form 0123) obsolete without a grace
period in order to force warning messages as early as possible, also received
consensus to continue (11 / 4 / 4).
Alternative spellings for octal literals were discarded, as the o syntax was
universally preferred.
The original syntax proposal for prefixed octal escape sequences was rejected. The Committee instead preferred to see a change along the lines of N2785 for consistency with C++, which has adopted the equivalent change into C++23.
N2785 was originally discussed by WG14 during the August 2021 meeting. At this time WG21 was looking for WG14's opinion before deciding on final adoption, so the feature had not yet been added to C++. The paper received strong consensus to adopt directly into C23 (16 / 2 / 3), which did not happen because of scheduling, but the strong direction and the context of adoption by C++ suggest this change has strong support for inclusion.
It was observed that there is an existing asymmetry in the language between the
printf family of functions and the strto_l functions. No decision was reached
about modifying strto_l, but the precedent of adding 0b to the functionality
on C23 suggests that these functions should also be modified for 0o.
The Committee affirmed that there are no plans to fully remove the original
octal 0123 syntax in the foreseeable future.
We propose that a new syntax is added for explicit octal literals, with a new
prefix 0o or an alternative spelling to mark the beginning of a base-8
literal. The old syntax should be retained and marked as obsolescent to avoid
breaking the meaning of existing code.
We do not propose that leading-zero ever change meaning to be accepted as a base-10 literal. This syntax should remain obsolescent, or be fully deprecated and removed, but cannot be recycled safely.
We separately propose changing escape sequences within literals at this time.
Escape sequences are visually prefixed with a \ and are therefore much less
subject to this issue. As with the existing hexademical escape syntax, there is
no leading zero on the prefix as this would interfere with a string that
intentionally contained the nul character. Both of these existing features can
be significantly improved by adopting the "delimited escape sequences" feature
that was added to C++23.
We separately propose allowing the strto_l function family to recognize
prefixed octal digit sequences, whereas before they would have returned the
value zero. This is consistent with the change to add support for the binary
literal 0b and 0B prefixes in these functions.
We propose adding an equivalent alternative implementation to the formatted
output functions (printf family) for prefixed octal numbers, that matches the
behaviour for prefixed hexadecimal and binary numbers. The formatted input
functions are defined in terms of the strto_l function family and are
therefore covered by the previous change.
The character o is the most obvious choice for the prefix and is in common use
in other languages.
There was discussion of using c or t or another prefix instead of o at the
January 2024 meeting of WG14. There was universal agreement not to bother pursuing
these alternative spellings due to lack of precedent and unnecessary divergence
from community consensus across languages (the "surprise factor" outweighs any
potential readability improvement).
Zero "0" remains a traditional octal literal, because the rule defining decimal
literals requires them to begin with a non-zero.
This can potentially be changed, but as long as traditional octal literals remain in the language, the definition of decimal literals has to be complicated in one way or another. Therefore, for the time being, we leave this as-is to keep the grammar as simple as possible (there is no obvious gain to over-complicating the syntax just to move zero).
Any tool that depends on this distinction is hiding a silent logical error.
A very similar change was proposed for C++ as document p0085r0. This proposal also added the prefixed form without removing the traditional literal syntax, and a new syntax for octal escape sequences in literals.
There does not appear to be a record of this proposal being discussed by WG21 and the change was not adopted.
There is no impact to existing code, other than new deprecation warnings if the user has this functionality enabled in their tool (obsolescence is not a constraint violation and these warnings are not mandatory).
Causing tools to emit these warnings if they were not already doing so (any tool aiming to check for MISRA C or similar Guidelines compliance is already warning on any use of octal) is considered a goal of the proposal and is not a compatibility failure.
The proposed spelling is not currently valid in C and therefore use of the new
octal literal format would not break existing code. Adoption of the prefix does
rule out possible use of o or O as a suffix in future, but there have been
no proposals to this effect.
If the Standard evolves to incorporate a distinction between deprecation and obsolescence, we would prefer implicit octal syntax to be marked as fully deprecated in that version of the Standard. This would allow for its eventual removal, and presumably require a stronger class of warning message (such as a mandatory warning against uses, rather than the current opt-in for uses of obsolescent features).
Octal escape sequences within character or string literals have an outstanding issue that the end of the sequence is not clear:
"\1234"  // two characters, \123 followed by 4
"\1289"  // three characters, \12 followed by 8 and 9
Apart from their variable length, octal escape sequences seem well-understood compared to integer literals and their use does not seem to be confusing in practice.
A future proposal could attempt to deprecate this syntax in favour of the new delimited syntax, or to deprecate octal escapes with fewer than three digits. We do not attempt to do so here.
The proposed changes are based on the latest public draft of C2y, which is N3301. Bolded text is new text when inlined into an existing sentence. These changes are not compatible with the words from p0085r0, which describe a different Standard (C++).
Within 6.4.5.2 "Integer literals", Syntax, paragraph 1 (the grammar):
Replace the existing octal-literal rule with a new rule:
octal-literal:
prefixed-octal-literal
unprefixed-octal-literal
Rename the original octal-literal rule to unprefixed-octal-literal and modify it:
unprefixed-octal-literal:
0
0'opt octal-digit-sequence
Add a new rule prefixed-octal-literal immediately below unprefixed-octal-literal:
prefixed-octal-literal:
octal-prefix octal-digit-sequence
Add a new rule octal-prefix immediately below binary-literal:
octal-prefix: one of
0o0O
Add a new rule octal-digit-sequence immediately below nonzero-digit:
octal-digit-sequence:
octal-digit
octal-digit-sequence'opt octal-digit
Modify paragraph 4:
A decimal literal begins with a nonzero digit and consists of a sequence of decimal digits. An octal literal consists of the prefix
0oor0Ofollowed by a sequence of the digits0through7only. A hexadecimal literal consists of the prefix0xor0Xfollowed by a sequence of the decimal digits and the lettersa(orA) throughf(orF) with values 10 through 15 respectively. A binary literal consists of the prefix0bor0Bfollowed by a sequence of the digits0or1.
Add a new paragraph immediately after paragraph 4:
An unprefixed octal literal begins with the digit
0optionally followed by a sequence of the digits0through7only. Use of an unprefixed octal literal with digits other than0is an obsolescent feature.
Add a new entry between 6.11.3 "External names" and 6.11.4 "Character escape sequences":
6.11.x Octal integer literals The use of non-zero octal integer literals without the prefix
0oor0Ois an obsolescent feature.
Changes to the printf family of functions are removed from this proposal. All existing features
will still work with the current o specifier, since the 0123 octal syntax is not being removed.
A change to printf needs detailed and separate consideration of the compatibility impact.
Add a new sentence near the end of 7.24.2.8 paragraph 3:
If the value of
baseis 2, the characters0bor0Bmay optionally precede the sequence of letters and digits, following the sign if present. If the value ofbaseis 8, the characters0oor0Omay optionally precede the sequence of letters and digits, following the sign if present. If the value ofbaseis 16, the characters0xor0Xmay optionally precede the sequence of letters and digits, following the sign if present.
Add a new sentence near the end of 7.31.4.2.4 paragraph 3:
If the value of
baseis 2, the characters0bor0Bmay optionally precede the sequence of letters and digits, following the sign if present. If the value ofbaseis 8, the [wide] characters0oor0Omay optionally precede the sequence of letters and digits, following the sign if present. If the value ofbaseis 16, the wide characters0xor0Xmay optionally precede the sequence of letters and digits, following the sign if present.
Note: the inconsistency between "the characters" (for base 2) and "the wide characters" (for base 16) is present in the existing text. The inserted sentence should use whichever form is correct, and the other existing sentence should probably change to match.
These changes are taken directly from N2785, rebased against N3301:
Within 6.4.4 "Universal character names", Syntax, paragraph 1 (the grammar):
Modify the universal-character-name rule:
universal-character-name:
\uhex-quad
\Uhex-quad hex-quad
\u{simple-hexadecimal-digit-sequence}
Add a new rule immediately below hex-quad:
simple-hexadecimal-digit-sequence:
hexadecimal-digit
simple-hexadecimal-digit-sequence hexadecimal-digit
(NOTE the intent of the changes to the Constraints and Semantics specified by N2785 appears to have already been incorporated into C23 by N3124.)
Within 6.4.5.5 "Character literals", Syntax, paragraph 1 (the grammar):
Add a new rule simple-octal-digit-sequence immediately below simple-escape_sequence:
simple-octal-digit-sequence:
octal-digit
simple-octal-digit-sequence octal-digit
Modify the octal-escape-sequence rule:
octal-escape-sequence:
\octal-digit
\octal-digit octal-digit
\octal-digit octal-digit octal-digit
\o{simple-octal-digit-sequence}
Modify the hexadecimal-escape-sequence rule:
hexadecimal-escape-sequence:
\xsimple-hexadecimal-digit-sequence
\x{simple-hexadecimal-digit-sequence}
Note that this change also adds a new syntax for hexadecimal escapes, with a clear ending delimiter.
Does WG14 want to add the new spelling for base-8 integer literals with an
explicit prefix?
(WG14 previously voted yes to this question)
Does WG14 want to mark the use of unprefixed base-8 integer literals, apart
from zero itself, as obsolete, without a grace period?
(WG14 previously voted yes to this question)
Would WG14 like to see a future paper proposing changes to support the 0o
and 0O prefixes in the printf family of functions?
Does WG14 want to change the behaviour of the strto_l function family to
allow them to interpret the new octal prefix, consistently with how they
interpret the 0x and 0b prefixes?
Does WG14 want to adopt the "delimited escape sequences" change previously
proposed as N2785 and subsequently adopted into C++23?
(WG14 previously voted yes to a variant of this question)
(WG14 previously encouraged this feature to be added alongside the octal change)
Huge thanks to Joseph Myers for detailed and thorough review of previous versions.
C2y latest public draft
Wikipedia on use of octal
C++ proposal P0085R0
MISRA C 2023
N2785 Delimited escape sequences
C++ proposal P2290R3
N3124 Aligning Universal Character Names Constraints with C++