[Tooling] Dependency information for module-aware build tools

Ben Boeckel ben.boeckel at kitware.com
Mon Mar 4 23:57:53 CET 2019


Hi,

For CMake support for C++ modules, I've patched GCC so it outputs
dependency information in a JSON format. Before going too far down this
road, I'd like to get feedback on the format. This is for the purposes
of being able to implement D1483R1[1] without requiring build tools to
implement a C++ parser and instead have the compiler do the "scan" step
described there.

    {                               //
    "outputs": [                    // Files to be output for this
    "source.o"                      // compilation[2].
    ],                              //
    "provides": [                   // BMI files provided by this
    "I.gcm"                         // compilation.
    ],                              //
    "logical-provides": {           // Mapping of module names provided
    "I": "I.gcm"                    // to provided BMI files.
    },                              //
    "requires": [                   // Modules names required by this
    "M"                             // compilation.
    ],                              //
    "depends": [                    // Preprocessor dependency files
    "../path/to/source.cpp",        // which affect this scan (so it can
    "/usr/include/stdc-predef.h"    // be rerun if necessary).
    ],                              //
    "version": 0,                   // The file format version.
    "revision": 1                   // The file format revision.
    }                               //

This example output is for a file with the contents:

    export module I;
    import M;

    export int i() {
        return m();
    }

My existing patch to GCC is currently missing `revision` and uses
`version` == 1 (but my CMake patches also don't check the field right
now). I'd like to get a Clang patch written up in the next few weeks.

Points to note:

  - All top-level types are as-is and the key names are never localized:
    * `outputs`: array
    * `provides`: array
    * `logical-provides`: object
    * `requires`: array
    * `depends`: array
    * `version`: int
    * `revision`: int
  - Values are strings if the name or path is valid UTF-8. The keys of
    `logical-provides` must be strings, therefore `requires` must also
    be only strings since these are used as lookup keys to find the
    on-disk file representing the listed `provide` (shouldn't be an
    issue since these are module names).
  - In the case of invalid UTF-8, an object is used with the following
    layout (all data here is literal and not localized):

    {                               //
    "format": "...",                // The format of the data.
    "data": [...]                   // Array of integers interpreted as
    }                               // the appropriate integer size.

  - Relative paths are relative to the working directory of the
    compiler. Build tools may need to rewrite paths for the build tool
    to actually understand them.[3]
  - `version` is bumped if there is any semantic data added (e.g., more
    information which is required to get a correct build), types change,
    etc.
  - `revision` is bumped if additionally helpful, but not semantically
    important, field is added to the format.

Defined formats (I'm fine with bikeshedding these names once the overall
format has been hammered out):

  - "raw8": interpret `data` as an array of uint8_t bytes to be passed
    to platform-specific filesystem APIs as an 8-bit encoding
  - "raw16": interpret `data` as an array of uint16_t bytes to be passed
    to platform-specific filesystem APIs as a 16-bit encoding
  - "raw32": interpret `data` as an array of uint32_t bytes to be passed
    to platform-specific filesystem APIs as a 32-bit encoding

This basically means "check if it is UTF-8, if it is, escape `\` and `"`
and output that, otherwise indicate the byte size of the data and write
it as an integer array".

In the future, we can add additional formats as a revision bump.

So,

  - Is anything missing from this format?
  - Is there any issue with getting this information from compilers?
  - Are any of the constraints too onerous on compilers?
  - Are there any constraints which should be added to make it even
    easier for build tools to parse/interpret this format?
  - For non-UTF-8 data, do we want to default to `raw8` format without
    one specified? Or should it always be required?
  - Are non-UTF-8 module names valid? Does anyone know what SG16 is
    saying about Unicode identifiers (which I presume would affect
    module names as well)?

Thanks,

--Ben

[1]https://mathstuf.fedorapeople.org/fortran-modules/fortran-modules.html
[2]Note that some flags to GCC can cause it to output multiple files for
a compilation step (such as -fsplit-dwarf). It is my hope that such
flags can be wired up to this facility in the future as I don't think it
is done right now.
[3]Additionally, CMake takes the `.gcm` files and places it elsewhere in
the tree via GCC's module map files. Object files are also placed via
`-o` which is otherwise occupied during the scan step at the moment, so
their paths may also need to be reinterpreted at the build tool level.


More information about the Tooling mailing list