1. Change Log
-
R3
-
Minor wording fixes.
-
Added string-literal as a follower for
.import
-
-
R2
-
Add wording that rejects #includes from turning into imports in the purview of a module.
-
Reword how directive-introducing token is defined.
-
Make it ill-formed to have an object like macro with the names export, module, or import defined when they are used as a directive.
-
Restrict the global module fragment to preprocessing directives that start with #.
-
-
R1
-
Added an example of using
to change the module declaration.#if -
Clarified benchmark.
-
Adding token backtracking feedback from implementors.
-
2. Effects of This Paper
At the start of phase 4 an
or
token is treated as starting a directive and are converted
to their respective keywords iff:
-
After skipping horizontal whitespace are
-
at the start of a logical line, or
-
preceded by an
at the start of the logical line.export
-
-
Are followed by an identifier pp token (before macro expansion), or
-
,<
, or"
(but not:
) pp tokens for::
, orimport -
for; module
-
Otherwise the token is treated as an identifier.
Additionally:
-
The entire
orimport
directive (including the closingmodule
) must be on a single logical line and for;
must not come from anmodule
.#include -
The expansion of macros must not result in an
orimport
directive introducer that was not there prior to macro expansion.module -
A
directive may only appear as the first preprocessing tokens in a file (excluding the global module fragment.)module -
Preprocessor conditionals shall not span a module declaration.
Failure to meet these additional requirements makes the program ill-formed (rather than not being interpreted as a directive).
Status Quo | This Paper |
---|---|
|
|
|
|
Legend:
-
✅ The declaration was accepted, but may not be a modules declaration.
-
⛔ A token was identified as a modules keyword, but the program is later found to be ill-formed.
-
❌ Ill-formed.
-
☠️ Undefined behaivor
3. Motivation
3.1. Fast Dependency Scanning
Fast dependency scanning relies on partial preprocessing. Several people are working on fast dependency scanning for modules, and they all share the common trait of skipping over non-directives. clang-scan-deps is one such tool and is a representative motivating usecase for these changes.clang-scan-deps works by minimizing source code via partial preprocessing.
Partial preprocessing works by running phases 1-3 of translation. Minimization
proceeds by throwing away every line which is not a preprocessing directive,
then removing all preprocessing directives that can’t impact the set of dependencies for that file.
Up to C++17 this means it only needs to keep
,
,
,
,
,
, and
. And only the bodies of
s that contain a
preprocessing directive that could impact dependencies. This reduces most files down to their header guards and list of
s, and is correct for standard C++ except for abuse of
such
as
. The only other things that break it are compiler
extensions which do not follow the parsing rules of
directives such as
, or
in inline assembly. Additionally,
abuse of the compiler extension
can break this.
This minimization is context free, and is done once for each file in the entire build.
Then the minimized source file is fully preprocessed by running phases 1-4, with
s resolving to their minimized equivalent, macros expanded, and all
preprocessing directives executed.
Minimization followed by using clang’s full preprocessor currently provides
about a 9x speed up (~3 seconds vs ~28 seconds out of a 15m build on a 18 core
iMac Pro) over no minimization at scanning all of llvm and clang (~7k files and ~3.8m LoC). We do not
have access to a large C++20 modules only (no header units) codebase to do tests
on, but that case still requires scanning the entire file due to the potential
for
based x-macros which will not be going away with modules, as they are intentionally non-modular
includes. There will still be a difference
in that we no longer have normal includes, and thus don’t need to preprocess a
header for each TU it’s used in.
Of the 3s only 33ms are spent minimizing, while the majority of the time is spent in the full lexer, preprocessor, and header search. We believe we can reduce this overhead. This overhead is important as it’s on the critical path. Any speedup here turns into a direct reduction in build latency, which is important for extremely parallel builds.
C++20 currently breaks this approach. [P1703r1] resolved the issue for
, but dependency scanning also needs to find
declarations, and
the current rules are not enough.
3.2. import
is Too Relaxed
[P1703r1] went too far in fixing the dependency scanning issue with import
.
Now any line starting with import
is treated as an import directive, even if
it obviously couldn’t be. This breaks a decent amount of real code with
function arguments or local variables named import
, as any use of them without
a prefix is treated as an import directive, and prepending ::
doesn’t work,
you must wrap some part of the expression in ()
.
4. Discussion
4.1. Implementation
There were some concerns expressed during the discussion of [P1703r1] that there may be performance or complexity issues. I have implemented this fix in Clang. It was rather simple to implement, with the most complex part being doing token lookahead while lexing (which isn’t that complicated). Measurment did not show any performance impact, which I expect to be due to the codepath only occuring with theimport
and module
tokens.
4.2. One Line Restriction
To be fully resiliant, a modules directive must be entirely on a single logical line. You could have a rule that the;
must not come from a macro and
say that the directive extends to the next ;
, but that is incorrect:
#define f(x) "blah" #x "blah" import f ( ; );
4.3. Extra Dependencies
#define eat(x) eat ( import < a > ; );
In the WD this has undefined behaivor due to [cpp]/2
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the following constraints: The first token in the sequence, referred to as a directive-introducing token, is a # preprocessing token, an import preprocessing token, or an export preprocessing token immediately followed by an import preprocessing token, that (at the start of translation phase 4) either begins with the first character in the source file (optionally after white space containing no new-line characters) or follows white space containing at least one new-line character.
and [cpp.replace]/11
If there are sequences of preprocessing tokens within the list of [macro] arguments that would otherwise act as preprocessing directives, the behavior is undefined.
A minimizer will transform this into just
which when preprocessed
will form a valid module import. There are two outcomes of this:
-
The dependency scanner emits an error because
is not a header unit or it is able to detect the usage in a macro.< a > -
The dependency scanner succeeds and then the compiler sees the code and i̴̝̍ĝ̸̠̳͚̻ņ̸̱̗̅́̾̐͠o̶̲̳̫͒͊͊ȑ̵̺̱͚̩͎͌͑͒̕e̸̗̾s̸̮̰̻̥͑̑͊̂̆ ̴̤͍̄ȉ̸̬͝m̵̞͈̿́p̷̢̛̣̹̒̑̽̀ǫ̴̜̖̱̈̏̈͝r̸̨̻̖̪̔̍̾ͅt̵̙̑s̸̩͎̜̼͑́̚ ̵̤̒͑ȩ̶̹̫̥͗̂r̵͍͈̦͗r̵̩̲̊̿͝ǫ̸̞̙́͊͐͗͗r̷̝̥̬̲͇̀͋̚s̷̰̠͙̤̉ͅ due to undefined behavior.
This is not the best outcome, as an implementation is not guaranteed to emit a
diagnosic, from my tests of clang, msvc, gcc, and icc, only clang rejectes
in this case, and no compiler rejects an import directive (I plan
to fix this in clang as part of implementing P1703 and this paper). This should
be fixed by changing [cpp.replace]/11 to be ill-formed instead of UB, but that deserves a separate paper. With that
change though the possible outcomes would be:
-
The dependency scanner emits an error because
is not a header unit or it is able to detect the usage in a macro.< a > -
The dependency scanner succeeds and then the compiler sees the code and rejects it because directives cannot appear as a macro argument.
Both of these outcomes are fine as their result is the same. If a partial preprocessing would ever get the dependencies wrong, the compiler will reject the program as ill-formed.
4.4. Ship Vehicle
This needs to ship with modules as this is a backwards incompatible change. Thus it needs to be in C++20.5. Wording
5.1. [lex.pptoken]
preprocessing - token : header - name import - keyword module - keyword export - keyword ...
4 The import-keyword is produced by processing an import directive ([cpp.import]) , the module-keyword is produced by preprocessing a module directive ([cpp.module]), and the export-keyword is produced by preprocessing either of the previous two directives.and has no[Note: None have anyassociated grammar productionsobservable spelling . —end note]
5.2. [lex.key]
Add import-keyword, module-keyword, and export-keyword to Table 5: Keywords [tab:lex.key].
5.3. [basic.link]
translation - unit : top - level - declaration - seq opt global - module - fragment opt module - declaration top - level - declaration - seq opt private - module - fragment opt private - module - fragment : module module - keyword : private ; top - level - declaration - seq opt ...
3 A token sequence beginning withand not immediately followed by
export opt module is never interpreted as the declaration of a top-level-declaration.
::
5.4. [module.unit]
module - declaration : export export - keyword opt module module - keyword module - name module - partition opt attribute - specifier - seq opt ;
5.5. [module.import]
module - import - declaration : export export - keyword opt import - keyword module - name attribute - specifier - seq opt ; export export - keyword opt import - keyword module - partition attribute - specifier - seq opt ; export export - keyword opt import - keyword header - name attribute - specifier - seq opt ;
5.6. [module.global]
global - module - fragment : module module - keyword ; top - level - declaration - seq opt
5.7. [cpp.pre]
1 A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the following constraints: At the start of translation phase 4, theThefirst token in the sequence, referred to as a directive-introducing token, begins with the first character in the source file (optionally after white space containing no new-line characters) or follows white space containing at least one new-line character, and is
a
preprocessing token, or
# an
preprocessing token
import , or animmediately followed on the same logical line by a header-name,preprocessing token immediately followed by an
export preprocessing token,
import , identifier, string-literal, or
< preprocessing token, or
: - a
preprocessing token immediately followed on the same logical line by an identifier,
module , or
: preprocessing token, or
; - an
preprocessing token immediately followed on the same logical line by one of the two preceding forms.
export that (at the start of translation phase 4) either begins with the first character in the source file (optionally after white space containing no new-line characters) or follows white space containing at least one new-line character.The last token in the sequence is the first token in the sequence that is immediately followed by whitespace containing a new-line characterthat follows the first token in the sequence. [Note: A new-line character ends the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro. —end note] [Example:—end example]# // preprocessing directive module ; // preprocessing directive export module leftpad ; // preprocessing directive import < string > ; // preprocessing directive export import "squee" ; // preprocessing directive import rightpad ; // preprocessing directive import : part ; // preprocessing directive module // not a preprocessing directive ; // not a preprocessing directive export // not a preprocessing directive import // not a preprocessing directive foo ; // not a preprocessing directive export // not a preprocessing directive import foo ; // preprocessing directive (ill-formed at phase 7) import :: // not a preprocessing directive import -> // not a preprocessing directive preprocessing - file : group opt module - file ... module - file : pp - global - module - fragment opt pp - module group opt pp - private - module - fragment opt pp - global - module - fragment :
module ; new - line group opt pp - private - module - fragment :
module :
private ; new - line group opt control - line : # include pp-tokens new-line export opt import pp - tokens new - line pp - import ... 2 A
3At the start of phase 4 of translation the group of a pp-global-module-fragment shall neither contain a control-line not starting with atext line shall not begin with a # preprocessing tokensequence of preprocessing tokens is only a text-line if it does not begin with a directive-introducing token . A conditionally-supported-directive shall not begin with any of the directive names appearing after ain the syntax. A conditionally-supported-directive is conditionally-supported with implementation-defined semantics.
# preprocessing token nor a text-line.
#
5.8. [cpp.module]
Add a new section before [cpp.import]: Module directive [cpp.module]pp - module : export opt module pp - tokens opt ; new - line 1 A pp-module shall neither appear in a context where
is an identifier defined as an object-like macro nor where
module is an identifier defined as an object-like macro if the first token of the pp-module is
export .
export 2 Any preprocessing tokens after the
preprocessing token in the
module directive are processed just as in normal text. [Note: Each identifier currently defined as a macro name is replaced by its replacement list of preprocessing tokens. —end note]
module 3 The
and
module (if it exists) preprocessing tokens are replaced by the module-keyword and export-keyword preprocessing tokens respectively. [Note: This makes the line no longer a directive so it is not removed at the end of phase 4. —end note]
export
5.9. [cpp.import]
Insert a new paragraph immediately after the grammar.A pp-import shall neither appear in a context where
is an identifier defined as an object-like macro nor where
import is an identifier defined as an object-like macro if the first token of the pp-import is
export .
export
Insert a new paragraph after paragraph 1
If an import-directive is produced by source file inclusion (including by the rewrite produced when a #include
directive names an importable header) while processing the group of a module-file, the program is ill-formed.
Update paragraph 2
In all three forms of pp-import, theand
import (if it exists)
export tokenpreprocessing tokensisare replaced by the import-keyword and export-keywordtokenpreprocessing tokens respectively . [Note: This makes the line no longer a directive so it is not removed at the end of phase 4. —end note] Additionally, in the second form of pp-import, a header-name token is formed as if the header-name-tokens were the pp-tokens of adirective. The header-name-tokens are replaced by the header-name token. [Note: This ensures that imports are treated consistently by the preprocessor and later phases of translation. —end note]
#include
5.10. [cpp.glob.frag]
Remove this section.pp - global - module - fragment : module ; pp - balanced - token - seq module pp - balanced - token - seq : pp - balanced - token pp - balanced - token - seq pp - balanced - token pp - balanced - token : pp - ldelim pp - balanced - token - seq opt pp - rdelim any preprocessing - token other than a pp - ldelim or pp - rdelim pp - ldelim : one of ( [ { <: <% pp - rdelim : one of ) ] } :> %> 1
If the first two preprocessing tokens at the start of phase 4 of translation are module ;, the result of preprocessing shall begin with a pp-global-module-fragment for which all preprocessing-tokens in the pp-balanced-token-seq were produced directly or indirectly by source file inclusion ([cpp.include]), and for which the second module preprocessing-token was not produced by source file inclusion or macro replacement ([cpp.replace]). Otherwise, the first two preprocessing tokens at the end of phase 4 of translation shall not be module ;.
5.11. [diff.cpp17.basic]
The editor should consider a different section of annex C to move this to.Affected subclauses: [basic.link], [module.unit], and [module.import]
Change: New identifiers with special meaning.
Rationale: Required for new features.
Effect on original feature:Top-level declarationsLogical lines beginning withor
module may be
import either ill-formed orinterpreted differently in this International Standard. [ Example:— end example ]
class module {} ; module * m1 ; // ill-formed; previously well-formed :: module * m2 ; // OK module m1 ; // was variable declaration; now module-declaration module * m1 ; // OK
class import {}; import j1 ; // was variable declaration; now import-declaration :: import j2 ; // variable declaration
6. Acknowledgments
Thanks to Boris Kolpackov for writing P1703, Mathias Stearn and Walter Brown for feedback on this paper and to Richard Smith, Corentin, Gabriel Dos Reis, Ben Craig, Matthew Woehlke, and Nathan Sidwell for discussion regarding this issue.