[Tooling] Dependency information for module-aware build tools

Olga Arkhipova olgaark at microsoft.com
Tue Mar 5 01:11:39 CET 2019


Hi Ben,
Why did you choose to have
    "logical-provides": {           // Mapping of module names provided
    "I": "I.gcm"                    // to provided BMI files.

instead of just module name which this source exports? 

Thanks,
Olga

-----Original Message-----
From: tooling-bounces at open-std.org <tooling-bounces at open-std.org> On Behalf Of Ben Boeckel
Sent: Monday, March 4, 2019 2:58 PM
To: modules at lists.isocpp.org; tooling at open-std.org
Cc: brad.king at kitware.com
Subject: [Tooling] Dependency information for module-aware build tools

Hi,

For CMake support for C++ modules, I've patched GCC so it outputs dependency information in a JSON format. Before going too far down this road, I'd like to get feedback on the format. This is for the purposes of being able to implement D1483R1[1] without requiring build tools to implement a C++ parser and instead have the compiler do the "scan" step described there.

    {                               //
    "outputs": [                    // Files to be output for this
    "source.o"                      // compilation[2].
    ],                              //
    "provides": [                   // BMI files provided by this
    "I.gcm"                         // compilation.
    ],                              //
    "logical-provides": {           // Mapping of module names provided
    "I": "I.gcm"                    // to provided BMI files.
    },                              //
    "requires": [                   // Modules names required by this
    "M"                             // compilation.
    ],                              //
    "depends": [                    // Preprocessor dependency files
    "../path/to/source.cpp",        // which affect this scan (so it can
    "/usr/include/stdc-predef.h"    // be rerun if necessary).
    ],                              //
    "version": 0,                   // The file format version.
    "revision": 1                   // The file format revision.
    }                               //

This example output is for a file with the contents:

    export module I;
    import M;

    export int i() {
        return m();
    }

My existing patch to GCC is currently missing `revision` and uses `version` == 1 (but my CMake patches also don't check the field right now). I'd like to get a Clang patch written up in the next few weeks.

Points to note:

  - All top-level types are as-is and the key names are never localized:
    * `outputs`: array
    * `provides`: array
    * `logical-provides`: object
    * `requires`: array
    * `depends`: array
    * `version`: int
    * `revision`: int
  - Values are strings if the name or path is valid UTF-8. The keys of
    `logical-provides` must be strings, therefore `requires` must also
    be only strings since these are used as lookup keys to find the
    on-disk file representing the listed `provide` (shouldn't be an
    issue since these are module names).
  - In the case of invalid UTF-8, an object is used with the following
    layout (all data here is literal and not localized):

    {                               //
    "format": "...",                // The format of the data.
    "data": [...]                   // Array of integers interpreted as
    }                               // the appropriate integer size.

  - Relative paths are relative to the working directory of the
    compiler. Build tools may need to rewrite paths for the build tool
    to actually understand them.[3]
  - `version` is bumped if there is any semantic data added (e.g., more
    information which is required to get a correct build), types change,
    etc.
  - `revision` is bumped if additionally helpful, but not semantically
    important, field is added to the format.

Defined formats (I'm fine with bikeshedding these names once the overall format has been hammered out):

  - "raw8": interpret `data` as an array of uint8_t bytes to be passed
    to platform-specific filesystem APIs as an 8-bit encoding
  - "raw16": interpret `data` as an array of uint16_t bytes to be passed
    to platform-specific filesystem APIs as a 16-bit encoding
  - "raw32": interpret `data` as an array of uint32_t bytes to be passed
    to platform-specific filesystem APIs as a 32-bit encoding

This basically means "check if it is UTF-8, if it is, escape `\` and `"` and output that, otherwise indicate the byte size of the data and write it as an integer array".

In the future, we can add additional formats as a revision bump.

So,

  - Is anything missing from this format?
  - Is there any issue with getting this information from compilers?
  - Are any of the constraints too onerous on compilers?
  - Are there any constraints which should be added to make it even
    easier for build tools to parse/interpret this format?
  - For non-UTF-8 data, do we want to default to `raw8` format without
    one specified? Or should it always be required?
  - Are non-UTF-8 module names valid? Does anyone know what SG16 is
    saying about Unicode identifiers (which I presume would affect
    module names as well)?

Thanks,

--Ben

[1]https://mathstuf.fedorapeople.org/fortran-modules/fortran-modules.html
[2]Note that some flags to GCC can cause it to output multiple files for a compilation step (such as -fsplit-dwarf). It is my hope that such flags can be wired up to this facility in the future as I don't think it is done right now.
[3]Additionally, CMake takes the `.gcm` files and places it elsewhere in the tree via GCC's module map files. Object files are also placed via `-o` which is otherwise occupied during the scan step at the moment, so their paths may also need to be reinterpreted at the build tool level.
_______________________________________________
Tooling mailing list
Tooling at isocpp.open-std.org
http://www.open-std.org/mailman/listinfo/tooling


More information about the Tooling mailing list