Improve readability of the C++ grammar by adding a syntax for groups and repetitions
- Document number:
- P3891R0
- Date:
2025-11-22 - Audience:
- CWG, LWG
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Reply-to:
- Jan Schultke <janschultke@gmail.com>
- GitHub Issue:
- wg21.link/P3891/github
- Source:
- github.com/eisenwave/cpp-proposals/blob/master/src/concise-grammar.cow
Contents
Introduction
Motivating example
Design
Design constraints
New syntax
Alternatives considered
Alternative to seq
and seq opt
seq opt
vs. opt seq
Alternative brackets for groups
Core wording
[syntax]
Bulk operations
[lex.name]
[lex.icon]
[lex.fcon]
[lex.string]
[dcl.spec.general]
[dcl.type.general]
[dcl.decl.general]
[dcl.fct]
[dcl.init.general]
[dcl.enum]
[namespace.udecl]
[dcl.attr.grammar]
[stmt.select.general]
[stmt.if]
[dcl.pre]
Library wording
[locale.numpunct.general]
[locale.moneypunct.general]
[format.string.general]
[time.format]
[fs.path.generic]
References
1. Introduction
The current C++ syntax notation as specified in [syntax] and summarized in [gram] has only a handful of features:
- concatenation, such as
pp-number identifier-continue , - alternatives or unions, sometimes specified using "one of", and
- optional expansions, such as
long-suffix opt.
Two notably absent features are grouping and repetition.
This leads to many cases of low expressiveness and grammatical bloat,
like our many
declaration-seq :- declaration declaration-seqopt
template-parameter-list :- template-parameter
- template-parameter-list
template-parameter,
In plain English, we say:
A
declaration-seq is adeclaration followed by an optionaldeclaration-seq .A
template-parameter-list is either a singletemplate-parameter or atemplate-parameter-list , followed by a comma token, followed by atemplate-parameter .
No reasonable person should teach the language syntax in those words, but that is what the grammar says. This means that any reader (or author of grammar changes) has to mentally deobfuscate the grammar into something intuitive, like:
A
declaration-seq is one or moredeclaration s.A
template-parameter-list is one or moretemplate-parameter s, separated by a comma token.
This proposal adds grouping and repetition,
which obsoletes all
2. Motivating example
With these new features, more concise grammar is possible:
| Before | After |
|---|---|
|
|
|
|
|
|
Notably, the boilerplate
3. Design
3.1. Design constraints
- The new grammar syntax needs to be easily writable both in proposals that contain grammatical changes as well as the standard itself. Note that authors frequently use strikethrough or underline text to indicate insertions and deletions, in addition to color. Those styles should be avoided.
- The C++ grammar already contains all sorts of tokens without any delimiters like quotes, which we probably want to keep that way due to familiarity. This means that any brackets and punctuation may visually conflict with C++ tokens.
- Some users may rely on assistive technology such as screen readers. This means that if font choice is the only distinction between e.g. C++ tokens and the new grammar features, the standard would be inaccessible to those users.
3.2. New syntax
X seq-
One or more repetitions of
X . This replacesX-seq . The new syntax is similar to opt, so it is obviously feasible in the standard draft. Proposal authors can use subscript text. Screen readers would pronounce "seq" uninterrupted (ignoring subscript), which is fine as long as "seq" is only used in this operator form. X seq opt-
Zero or more repetitions of
X . This replacesX-seq opt. - ⟪
X yyy ⟫ -
Groups
X andyyy , which allows applying opt, seq, and seq opt to multiple elements. The characters used here are U+27EA MATHEMATICAL LEFT DOUBLE ANGLE BRACKET and U+27EB MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET. These get pronounced distinct from other brackets by screen readers, are handled well by many fonts, and are sufficiently visually distinct from( ,[ ,{ , and< . LaTeX packages likeMnSymbol provide these characters. Paper authors can copy and paste these Unicode characters, use HTML character references such as, or use text editor extensions like Insert Unicode for typing these.⟪
3.3. Alternatives considered
3.3.1. Alternative to seq
and seq opt
An alternatively syntax to seq and seq opt briefly considered
was superscript +
and superscript 🞰
for one-or-more and zero-or-more repetitions, respectively.
These characters are widely used in regular expression with this meaning,
and superscript asterisks denote the
Kleene Star in computer science papers.
However, these characters can be hard to distinguish based on font weight and font family.
Replacing
3.3.2. seq opt
vs. opt seq
Both notations are equivalent in the sense that the same inputs would be matched.
opt seq
could be argued to be "more natural"
because it translates to "optional sequence".
However, if we consider seq
and opt
to be postfix unary operators,
then
would be a sequence where every a
either way,
but we should prefer the syntax that doesn't raise such questions in the first place.
3.3.3. Alternative brackets for groups
There are many possible Unicode bracket characters that could have been used instead. However,
- some are too exotic, like ⟅ S-SHAPED BAG DELIMITERS ⟆,
- some are just font variations of existing brackets, like FULLWIDTH or WHITE (which means that users with visual impairment could mistake these too easily, especially with bad choice of font),
- some are used too commonly for mathematical operations, like ⌈ CEILING ⌉,
- some rely too much on "good fonts" and become too visually similar to parentheses for bold font weight, like ⟬ MATHEMATICAL WHITE TORTOISE SHELL BRACKETS ⟭,
- etc.
The chosen ⟪ MATHEMATICAL DOUBLE ANGLE BRACKETS ⟫ do not suffer from any of these issues, although they are visually similar to 《 DOUBLE ANGLE BRACKETS 》 used in Chinese punctuation for proper nouns.
4. Core wording
The changes are relative to [N5014].
[syntax]
Change [syntax] as follows:
1 In the syntax notation used in this document,
syntactic categories non-terminal symbols
are indicated by literal words and characters terminal symbols
in ⟪
and ⟫
.
Consecutive syntactic elements are listed from left to right.
Alternatives are listed on separate lines
except in a few cases where a long set of alternatives is marked by the phrase one of
.
If the text of an alternative is too long to fit on a line,
the text is continued on subsequent lines indented from the first one.
2 An optional terminal or non-terminal symbol
syntactic element is indicated
by the postfix subscript opt
, so .
{ expressionopt}
indicates an optional expression enclosed in braces.
One or more repetitions of a syntactic element
are indicated by the postfix subscript seq
.
[Example:
initializer-list :- initializer-clause
... opt ⟪, initializer-clause... opt ⟫seq opt
This notation means that the non-terminal symbol
,
followed by zero or more repetitions of
,
.
— end example]
2 3 [Note:
Names for syntactic categories have generally been chosen according to the following rules:
-
X-name is a use of an identifier in a context that determines its meaning (e.g.,class-name ,typedef-name ). -
X-id is an identifier with no context-dependent meaning (e.g.,qualified-id ). -
X-seq is one or moreX s without intervening delimiters (e.g.,declaration-seq is a sequence of declarations). -
X-list is one or moreX s separated by intervening commas (e.g.,identifier-list is a sequence of identifiers separated by commas).
— end note]
Bulk operations
Replace the following non-terminals
in the document with
Remove all definitions of the replaced non-terminals. These are all of the form:
X-seq :- X X-seqopt
A
In my opinion, this is acceptable.
Referring to
From an English perspective, both
[…] from any
in [module.global.frag] paragraph 4,
rather than considering the
is eliminated.
This would allow using (direct) of
instead of (indirect) in
,
if we wanted to.
In my opinion, simply applying the bulk edit is fine; pointing out the indirection through seq would be an optional and redundant wording choice, but it wouldn't be incorrect.
Perhaps, CWG could decide on a consistent policy so that we always call out the seq indirection, or never do so.
[lex.name]
Change [lex.name]
identifier :- identifier-start identifier-continueseq opt
identifier identifier-continue
[lex.icon]
Change [lex.icon]
binary-literal :- ⟪ one of
0b 0B ⟫ binary-digit ⟪opt binary-digit ⟫seq opt' 0B binary-digitbinary-literalopt binary-digit'
Change [lex.icon]
octal-literal :0 ⟪opt octal-digit ⟫seq opt' octal-literalopt octal-digit'
Change [lex.icon]
decimal-literal :- nonzero-digit ⟪
opt digit ⟫seq opt' decimal-literalopt digit'
Do not change [lex.icon]
hexadecimal-literal :- hexadecimal-prefix hexadecimal-digit-sequence
Change [lex.icon]
hexadecimal-digit-sequence :- hexadecimal-digit ⟪
' opt hexadecimal-digit ⟫seq opt hexadecimal-digit-sequence' opt hexadecimal-digit
[lex.fcon]
Change [lex.fcon]
exponent-part :- ⟪ one of
e E ⟫ signopt digit-sequence E signopt digit-sequence
Change [lex.fcon]
binary-exponent-part :- ⟪ one of
p P ⟫ signopt digit-sequence P signopt digit-sequence
Change [lex.fcon]
digit-sequence :- digit ⟪
' opt digit ⟫seq opt digit-sequence' opt digit
[lex.string]
Change the grammar in [lex.string] as follows:
- […]
raw-string :" d-char-sequenceseq opt( r-char-sequenceseq opt d-char-sequenceseq opt) " r-char-sequence :r-char r-char-sequenceoptr-char :-
any member of the translation character set,
except a U+0029 RIGHT PARENTHESIS
followed by the initial
d-char-sequence (which may be empty)d-char seq opt followed by U+0022 QUOTATION MARK - […]
The current "(which may be empty)" seems incorrect anyway;
a
The optional
attribute-specifier-seq […]
For consistency, we could say
[…] by the initial optional
d-seq seq followed by U+0022 QUOTATION MARK
However, "initial optional" feels a bit clunky.
[dcl.spec.general]
Change [dcl.spec.general]
decl-specifier-seq decl-specifiers-and-attributes :- decl-specifierseq attribute-specifier
-seqseq opt decl-specifier decl-specifier-seq
Replace all occurrences of
with
[dcl.type.general]
Change [dcl.type.general]
type-specifier-seq type-specifiers-and-attributes :- type-specifierseq attribute-specifier
-seqseq opt type-specifier type-specifier-seq
Replace all occurrences of
with
Change [dcl.type.general]
defining-type-specifier-seq defining-type-specifiers-and-attributes :- defining-type-specifierseq attribute-specifier
-seqseq opt defining-type-specifier type-specifier-seq
Replace all occurrences of
with
[dcl.decl.general]
Change [dcl.decl.general]
ptr-declarator :- ptr-operatorseq opt noptr-declarator
ptr-operator ptr-declarator
[dcl.fct]
Change [dcl.fct]
parameter-declaration-list :- parameter-declaration ⟪
, parameter-declaration ⟫seq opt parameter-declaration-list, parameter-declaration
[dcl.init.general]
Change [dcl.fct]
initializer-list :- initializer-clause
... opt ⟪, initializer-clause... opt ⟫seq opt initializer-list, initializer-clause... opt
Change [dcl.fct]
designated-initializer-list :- designated-initializer-clause ⟪
, designated-initializer-clause ⟫seq opt designated-initializer-list, designated-initializer-clause
[dcl.enum]
Change [dcl.enum]
enumerator-list :- enumerator-definition ⟪
, enumerator-definition ⟫seq opt enumerator-list, enumerator-definition
Change [dcl.enum]
enumerator-definition :enumerator- enumerator ⟪
= constant-expression ⟫opt
[namespace.udecl]
Change [namespace.udecl]
using-declarator-list :- using-declarator
... opt ⟪, using-declarator... opt ⟫seq opt using-declarator-list, using-declarator... opt
[dcl.attr.grammar]
Do not change [dcl.attr.grammar]
attribute-list :- attributeopt
- attribute-list
, attributeopt- attribute
... - attribute-list
, attribute...
I did not want to factor out a new
Change [dcl.fct]
annotation-list :- annotation
... opt ⟪, annotation... opt ⟫seq opt annotation-list, annotation... opt
[stmt.select.general]
Change [stmt.select.general] [selection], [statement] as follows:
selection-statement :if optconstexpr ( init-statementopt condition) statement ⟪statement ⟫optelse if optconstexpr ( init-statementopt condition) statementstatementelse if ! optcompound-statement ⟪consteval statement ⟫optelse if ! optcompound-statementconsteval statementelse switch ( init-statementopt condition) statement
[stmt.if]
Change [stmt.if] paragraph 1 as follows:
If the condition ([stmt.pre]) yields ,
the first substatement is executed.
If the part selection statement ,
the second substatement is executed.
If the first substatement is reached via a label,
the condition is not evaluated and the second substatement is not executed.
In the second form of
In an statement
(the one including ), statement where the statement ,
then that inner statement
shall contain an part
is not relevant to this change
because it can only contain a
"Part" is removed because we never define what an " part" is,
so this wording smells a bit fishy.
[dcl.pre]
Change [dcl.pre]
sb-identifier-list :- sb-identifier ⟪
, sb-identifier ⟫seq opt sb-identifier-list, sb-identifier
Change [dcl.pre]
static_assert-declaration :static_assert ( constant-expression ⟪, static_assert-message ⟫opt) static_assert ( constant-expression, static_assert-message)
5. Library wording
The changes are relative to [N5014].
[locale.numpunct.general]
Change [locale.numpunct.general] paragraph 2 as follows:
[…] Integer values have the format:
units :digitsdigits thousands-sep units- digitseq ⟪ thousands-sep digitseq ⟫seq opt
digits :digit digitsopt
and floating-point values have:
floatval :- signopt units fractionalopt exponentopt
- signopt decimal-point
digitsdigitseq exponentopt fractional :- decimal-point
digitsdigitseq opt exponent :e⟪ one ofe E ⟫ signoptdigitsdigitseqe :e E
where the number of digits between
[locale.moneypunct.general]
Change [locale.moneypunct.general] paragraph 3 as follows:
The format of the numeric monetary value is a decimal number:
value :- units fractionalopt
- decimal-point
digitsadigitseq fractional :- decimal-point
digitsadigitseq opt
if
value: - units
otherwise.
The symbol
units :digitsdigits thousands-sep units- adigitseq ⟪ thousands-sep adigitseq ⟫seq opt
In the syntax specification,
the symbol through (inclusive)
and is a reference of type […]
[format.string.general]
Change the grammar in [format.string.general] as follows:
positive-integer :- nonzero-digit digitseq opt
positive-integer digitnonnegative-integer :- digitseq
nonnegative-integer digit
[time.format]
Do not change [time.format]
chrono-specs :- conversion-spec
- chrono-specs conversion-spec
- chrono-specs literal-char
[fs.path.generic]
Change the grammar in [fs.path.generic] as follows:
- […]
relative-path :filenamefilename directory-separator relative-pathan empty path- ⟪ filename directory-separator ⟫seq opt filenameopt
filename :-
non-empty sequence of characters other than
directory-separator characters directory-separator :- preferred-separator directory-separatoropt
- fallback-separator directory-separatoropt
- […]