ISO/IEC JTC1 SC22 WG21 N3426 = 12-0116 - 2012-09-23
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org
Introduction
Problem
Prior Approaches
The Pre-Parsed Header Approach
Approach Constraints
Implementation of the Approach
Implementation Status
GCC Implementation Problems
Google Build Problems
Inherent Implementation Problems
Changes to the Problem
Alternate Approaches
Reduce Project Scope to a Chainable PCH
Stream Parser Actions
Make Definitions Exclusive to a Package
More Aggressively Compile Inline Functions
Implement Export Control
Summary
Google's build system is highly parallel. As a consequence, the dominant factors in interactive programming latency are the time to compile the longest translation unit and the time to link the executable. In this paper, we briefly describe some of our efforts to reduce interactive compile time. We go into further detail on our latest effort, “Pre-Parsed Headers”, which can be considered a precursor to C++ modules. In particular, the lessons arising from our effort have implications on the pending C++ standards effort on modules.
The compilation model of C/C++ is a large number of translation units, each textually including headers that are often used by many other translation units. This model causes two problems.
Because header files are textually included, the meaning of a header file can be affected, intentionally or otherwise, by the headers that happened to be included before it. E.g., the system header file stddef.h accepts “parameters” in the form of #defines specified before the #include. These #defines and their consequences within stddef.h go on to spread over all the other header files included downstream. As a consequence, the parametric nature of some headers makes the meaning of all headers potentially unreliable.
It is clear that large systems are being built with the current approach, and therefore that the problem is manageable. However, programmer productivity would be better if the problem did not exist.
In C++, headers include a large number of inline function definitions and template definitions, which dramatically increases the size of headers. Indeed, in the limiting case, C++ compilation time is proportional to the product of header files and compilation units. In practice, the complexity is not that bad, but compilation is still one of the largest components of build latency. Generally, though, much of build work is redundant.
In addition to redundant work between translation units, there is redundant work in one translation unit as it is recompiled during the edit-compile-debug cycle. Only very rarely do most headers change meaning in that cycle, and textual header recompilation contributes to the latency of the cycle.
Google's codebase is somewhat different from many in that we follow an “include what you use” approach. As a consequence, Google compilation units tend to use a high fraction of symbols included. We have measured symbols use rates in the range of 70%.
The primary prior approach to reducing compile time is pre-compiled headers (PCH). The compiler can load a PCH file instead of parsing the first header file inclusion. When any file included from that first header file changes, then the entire PCH file must be rebuilt. PCH is effective when translation units use a common, large, and stable header that represents a substantial portion of parse time. In Google's environment, no such header exists. In general, we found that any PCH header that represented a substantial portion of parse time was too unstable to provide any net benefit.
As an experiment, we tried Pre-tokenized Headers (PTH). In this approach, the compiler saves the tokens from a header into a state file. The next time a header is included, instead of tokenizing the text, the compiler reads the previously saved stream of tokens. This approach was relatively straightforward to implement, but the savings were negligible. In the meantime, normal optimizations to the preprocessor and increased effort in the parser and code generator have further reduced the benefit.
We next explored an Incremental Compiler Server. In this approach, some roles of the build system and the compiler are reversed. The compiler lives as a server in the nodes of the build cluster. The compiler uses an in-memory database of previously seen header files to determine whether a header needs to be re-parsed or recovered from the database. Long before implementation was complete, we determined that the approach would be too difficult to maintain in our build system, primarily due to issues in the repeatability of failures.
The fundamental change in this approach over earlier approaches is that it uses a factored representation, so that a typical compilation unit will read multiple pre-parsed header (PPH) files. With this approach, a change in a header will only cause invalidation of the PPH files that either directly include that header or depend on a PPH file that was invalidated. Thus in the DAG of header inclusions, PPH only requires recompiling the PPH files that are on a path of includes from the changed header to the compilation unit. The majority of PPH files would be unchanged by most source changes.
The approach had significant constraints as well -- the primary being to minimize the change to existing sources. The PPH approach reinterprets a #include directive to read a PPH file instead when a PPH file exists for that header. Otherwise, the header is read and parsed as normal.
While the semantics of PPH are not identical to regular compilation, they are nearly identical for "modular" headers. In PPH, we call a header H modular if it meets the following requirements:
H is guarded against multiple inclusion. That is, the file contains:
#ifndef SOME_LABEL #define SOME_LABEL ... #endif
H can be pre-processed and parsed in isolation. This means that the file must not depend on symbols/types defined by the environment or another file included before it.
H is always included from the global namespace.
These requirements meet common practice for header design, and most headers meet these requirement without change. Indeed, PPH makes the meaning of header files more stable, as the author of the header can be assured that client inclusion environments will not change header semantics.
Significantly, some headers are not modular. In particular, system headers are often very far from modular. These headers must be included textually. As a consequence, two different PPH files may have the same textual header included within them, requiring the parser to merge, or unify, the multiply defined symbols and types contained in the two PPH files. This merge operation is never needed in regular compilations.
Since Pre-Parsed Headers are headers, they must provide the preprocessor semantics of headers. In particular, PPH files export the same set of preprocessor macros that the headers export. Those macro definitions must be replayed when PPH files are read.
Because the compilation of a header will bind any preprocessor conditionals or substitutions, PPH files must validate the macro environment against which they were compiled with the macro environment in which they were included. In particular, we must keep the set of identifiers used within a header within its PPH file so that we can ensure that an identifier also has no macro definition when that PPH file is included.
The implementation streams GCC internal data structures to the PPH file. Because of the need to handle two PPH files including the same textual header, symbols in the two PPH files that represent the same C++ symbol must be merged into a single C++ symbol.
The compiler walks the C++ data structures and identifies those symbols that may possibly be merged. These are streamed first with identifying information, so that upon read, the compiler can search for an earlier load of that same symbol. After these potentially mergeable symbols are streamed, the compiler streams the dependent internal data structures.
PPH implementation status is measured against an internal Google application using a fast build mode. All the metrics referenced in this section correspond to the compilation of the main translation unit for that application.
The initial performance measurements are a qualified success. We have halved effective parse time even with an implementation that has not been tuned. Of the remaining time, nearly half is attributed to hash table lookup for template specializations. That expense is not incurred in regular compilations, and is therefore almost certainly a bug.
Furthermore, in PPH files, close to half of all pointers emitted are null, which means that we could save significant further time by simply not saving or restoring them. Overall, we suspect that with minimal effort, PPH parse time will be less than a quarter of regular parse time.
We find that application headers are predominantly compatible with PPH, thus meeting our goal of requiring few source changes. However, system headers are both often incompatible and very heavily used.
Current functionality measurements are less successful. We are currently able to reuse less than a fifth of the PPH files we create. Of the error messages emitted, three quarters are a single error. There are likely few root causes to that error, and a particularly likely root is the template specialization bug mentioned above. As a further example, the fourth most common error messages all come from a single built-in function in a widely used system header.
The primary development concern is that we do not know how many root causes are shadowed by the current root causes.
A significant number of implementation problems arise from the structure of GCC.
GCC has a test on preprocessor state to avoid unnecessarily reading a header file. Unfortunately, this test is not isomorphic to “not multiply included”. In particular, if the guarding symbol is conditionally set, the header may be included multiple times. Such headers require a manual exception to work with our system.
The parser updates symbols as it parses. At the end of the compilation, when it comes time to write out a PPH file, the data structures are circular. While it is possible to stream circular structures, breaking the circularity at different points can have different effects. In contrast, the C++ source files have already broken that circularity simply because, for the most part, the language requires the programmer to do so.
The state necessary for parsing and some code generation is mixed into a single data structure. Furthermore, the parser often provides code generation information in the data structures as it is parsing. The code generation information is not essential to PPH files, but neither is it easily removed.
The GCC compiler has received significant micro-optimizations towards the goal of making traditional compilation fast. Many of these optimizations interfere with the process of repurposing the compiler. In particular, tokens point to the symbols that they may reference. When the scope changes, the tokens are updated to point to different symbols. As a consequence, tokens are not immutable and streaming the tokens requires extra processing.
The compiler has instances where its representation loses fidelity with C++. For example, on parsing a tagless struct, GCC creates a counter-based unique tag name. It then does a lookup to see if the tag has been defined before. Of course, it has not. Later, GCC pattern matches on the tag name to see if it is one of the names assigned to tagless structs, so that it can infer the struct is tagless. While all this divergence from fidelity works in a normal parse, it does not work with PPH. PPH merges the results of multiple files, all using the same initial counter sequence. Thus, lookups do find a prior definition. The problem is solvable with a different strategy of creating unique tags. However, the problem appeared, required debugging, required a workaround, and occupies space in the PPH file.
Other problems arise from the structure of Google's build system, which is highly parallel. With multi-core systems available as commodity desktop systems, we believe that future build systems will be increasingly parallel. So, while Google's build system is unusual, it is relevant.
The build system creates a graph of actions necessary to build a target. The introduction of PPH will create a deeper action graph, and in the limit, have twice as many nodes. These larger graphs will take longer to compute. Most of those nodes will already be complete, so we have every expectation that using the PPH files will more than pay for the increased graph computation. However, an edit that invalidates a PPH file near the base of the graph will create a deeper serial dependence in compilations, which extends compilation time when concurrency is essentially free.
PPH, as well as many other approaches, requires that we save some form of state and that state needs to be exposed to the build system. This additional state generates additional work in the build system, which contributes to latency.
The essential tradeoff in the build is increased effort in building a PPH file versus decreased effort in using a PPH file. If this additional build overhead is more expensive than the savings provided by the generated compilation state, the approach is a net loss. As yet we do not know the effect of PPH, or other approaches, on the build system. We likely will not know until we have a working system and can experiment.
The current implementation approach requires manually changing the streaming code whenever the data structures change. Thus, PPH, in its present state, represents a burden to continued compiler development in GCC.
The current approach has some fundamental implementation problems. These problems are independent of the compiler chosen for PPH implementation.
The structure of system headers makes many of them unsuitable to PPH. While many of these can be wrapped or multiply included; doing so has the effect of replicating symbols in multiple PPH files. That replication costs both space and time for the very headers that are most often used.
The replication requires that PPH be able to merge identical symbols. Being able to merge symbols costs roughly 1/4 to 1/3 of the PPH load time. The information necessary to merge symbols is slow to create and occupies a significant amount of PPH file space.
More importantly, merging is the primary source of bugs in the implementation. Restructuring of PPH, due to the discovery of yet more information needed for merging, is common.
The primary compile-time benefit of PPH is the ability of the compiler to avoid symbol lookup and type analysis required during header parsing. Unfortunately, PPH does not avoid creating the same number of symbols, along with all of the memory allocation and content recreation implied. In essence, PPH files contain as much information as the original headers, and that information takes time to load into the compiler data structures.
The detailed performance characteristics that started the PPH project have changed in the meantime. While the changes are specific to our benchmark, we believe the trends are general to the industry.
Combining the first four characteristics, the parser is now 60% of compilation time rather than 80%. As a consequence, maximum possible PPH speedup is now 2.5x rather than 5x.
The compiler is now only one tool in a build. Static analysis tools may run in parallel with the compiler. Build times may be limited by other tools. Specifically, our builds now use Clang for C++ diagnostics, and this parallel tool has two effects.
Our builds now do syntax checking with Clang in parallel with doing the regular compile with GCC. The build latency is the maximum of those two components, which means that making GCC faster than Clang will not improve latency over making it the same speed. Our maximum speedup without modifying Clang is now roughly 3x-4x.
A major benefit of PPH is that the build system need only ship the PPH files, not header files from which they were built. However, with Clang syntax checking, we must ship headers through the build system, thus losing that potential performance gain.
Finally, at the inception of PPH, large-scale source changes were unlikely to ever be implemented. However, we now have tools that help us to automate large-scale changes. The restriction on minimum source change seems less necessary.
For this section, we use the term “package” to indicate something like a compiled header. A package could be a PPH file, a full module, or something else.
There are several alternate approaches that might improve the effectiveness of a package feature. However, one must be careful when judging effectiveness to consider several criteria.
Much of the problem with the current PPH implementation is merging. If we were to not support merging, the result would be something like PCH, but with the ability to chain one package off of the end of another. Such an approach would avoid the problem that one change invalidates the entire PCH file. So, packages would be more stable than PCH files.
In our benchmark, uncompressed PPH files are 41% of the size of PCH files and compressed by 7:1 to 51% of compressed PCH files. PPH files would be even smaller in a tuned implementation and without the need to support merging. However, we would need more of them because each package would represent a portion of a full linear context, not just a local context. Aggregate file sizes are unknown.
The generation of PPH files is 24% faster than PCH for our benchmark, and would be even faster without the need to support merging. However, reuse speed is slower by 31% to 57%. As a mitigating task, a chained package reuse would also require less time than PPH reuse. (E.g., macro validation is not needed.)
The primary problem with this approach is that it would require source changes to our codebase to be effective. We would need a canonical order for commonly used headers. Some headers would need to be aggregated into a single header to avoid explosion of package files.
This approach is most likely to achieve feature status in GCC in the short term, but deployment within Google will likely take longer than deployment of PPH. The reason is that the changes to the source base to exploit chained PCH are not trivial.
The current implementation streams the data structures after the parser has completed their construction. Instead, we could stream the parser actions that create the data structures. Unlike in regular parsing, we would need extra code to accept but ignore duplicate definitions.
With careful restructuring of GCC, these actions could be near an alternate pluggable implementation of the regular parser internal API. Doing so would at least reduce lag between functional modification and working streaming, and at best enable automatic streamer generation. The primary problem here would be in identifying those actions and building an API that encapsulates them.
The primary problem with the current PPH implementation is that it needs to merge symbols. We can avoid that problem if we require each symbol to be defined in at most one package. Given the existing structure of system headers, such an approach would require defining and implementing a new set of system packages. Users would then be required to use these new packages in place of existing #includes.
PPH essentially captures parser state. Any inline functions in a PPH file are not re-parsed on use, but they are compiled on each use. We can shift work into package compilation by representing inline functions in the optimizer's internal form.
This shift is reasonable with GCC because its front end does not do inlining. Other compilers may require more care.
One of the problems with the current approach is that including a PPH file entails including all of the PPH files that were used in its construction. We must do so because all types in the chain of PPH files are implicitly exported.
If instead, the package exported only the packages interface, the each package would need to contain only enough information to compile and reuse the interface. Within the package, all non-exported types and variables would be lowered to the code generator level, with enough information to enable correct and optimizable code, but without the extra C++-level information. (Effectively packages would meet ABI constraints in their representation of non-exported symbols, but not API constraints.) Such an approach would avoid nested module reads and substantially reduce the bulk of information used. It would also enforce include-what-you-use.
The problem needs a solution. The C++ header model is becoming a serious drag on productivity. Our free ride on sequential processor improvements is over, and we need a more scalable solution. All alternate approaches should be carefully considered, even if they require non-trivial changes to the source base.
Automated source conversion changes compatibility. In the past, compatibility meant compiling the same code in the same way. With automated source code modification tools, compatibility means identifying an automatable transformation from old code to new code.
Modules should not ignore the build system. The performance effects of detailed module semantics can have significant complexity impact on the build of large systems. Indeed, we are struggling now with the consequences of textual inclusion.
PPH can deliver significant, but not revolutionary, compile-time improvements. We anticipate an overall doubling of compile-time performance. However, we are unlikely to get order-of-magnitude improvements.
Export control will yield better performance. The primary limit to the compile-time performance of PPH is that it must load the transitive closure of symbols in a PPH file. By limiting the set of symbols exported from a module, the amount of information imported can be substantially reduced.
Merging symbols is technically difficult. The problems are not theoretically difficult, but have proven to be so in practice. This difficulty will be reflected in slower implementations. Furthermore, our implementation problems have been largely the result of merging. At any one time, we have had few known bugs. However, fixing one bug has moved the frontier to yet another bug. It is difficult to anticipate when we will cross the threshold to general functionality. If we were not merging symbols, we believe we would have already deployed a successful PPH.
One module per symbol will likely yield better systems. Given the merge performance issues and bugs discussed above, we believe that a module system will be more performant and more reliable if prohibits multiple definitions.
System headers must be redefined as modules. Application headers are generally well-structured and their conversion to a module system will likely be not too painful. Unfortunately, current system headers are often highly tied to the textual inclusion model. Any attempt to reuse them unmodified in a module system will constitute a significant burden to both implementers and users.