Document number

ISO/IEC/JTC1/SC22/WG21/P1689R2

Date

2010-01-13

Reply-to

Ben Boeckel, Brad King, ben.boeckel@kitware.com, brad.king@kitware.com

Audience

EWG (Evolution), SG15 (Tooling)

1. Abstract

When building C++ source code, build tools need to discover dependencies of source files based on their contents. This occurs during the build so that file contents can change without build tools having to reinitialize and so the dependencies of source file generated during the build are correct. With the advent of modules in [P1103R3], there are now ordering requirements among compilation rules. These also need to be discovered during the build. This paper specifies a format for communicating this information to build tools.

2. Changes

2.1. R2 (pre-Prague)

Added:

  • more background information (motivation and assumptions)

  • validity of source entries depends on uniqueness in the outputs, not the inputs.

  • input is now an array, inputs. This is to support representation of unity builds where multiple input sources are compiled at once into a single set of outputs.

  • sources is renamed to rules. There is not necessarily a 1-to-1 correlation with source files with compilation rules.

  • full example output for a C++ source

  • uniformly use "property" instead of "key" for JSON fields.

2.2. R1 (post-Cologne)

The following changes have been made in response to feedback in SG15:

  • rename keys to be more "noun-like" or for clarity including:

  • readablereadable-name

  • logicallogical-name

  • filepathcompiled-module-path

  • remove future-link (no known use case)

  • remove %-encoding for filepaths

  • remove top-level extensions key (still possible, just use _ keys)

  • require vendor prefixes for extensions

  • add an optional source-path key to depinfo objects

The following changes have been made in response to feedback in SG16:

  • change the name of the "data" key to "code-units"

  • mention normalization for JSON encoders and decoders

2.3. R0 (Initial)

Description of the format and its semantics.

3. Introduction

This paper describes a format designed primarily to communicate dependencies of source files during the building of C++ source code. While other uses may exist, the primary goal is correct compilation of C++ sources. The tool which generates this format is referred to as a "dependency scanning tool" in this paper.

The contents of this format includes:

  • the dependencies of running the dependency scanning tool itself;

  • the resources that will be required to exist when the scanned translation unit is compiled; and

  • the resources that will be provided when the scanned translation unit is compiled.

This information is sufficient to allow a build tool to order compilation rules to get a valid build in the presence of C++ modules.

4. Motivation

Before C++ modules, the only kinds of dependencies on files that a build system would care about could be determined during the execution of that rule. This is because each compilation was independent of other compilation rules. However, with modules, compilation rules can now depend on each other and they must be executed in order. Build tools need to be able to extract this information from source files before compiling them due to this new ordering requirement.

Incidentally, this is exactly analogous to the problem that Fortran build systems has with Fortran modules. To that end, this format is explicitly not specific to C++ and is intended for use within the Fortran ecosystem in the future. Terminology specific to C++ is avoided in this format to avoid any indications that it is C++-specific.

4.1. Why Makefile snippets don’t work

Historically, dependency information of a build rule has been handled by Makefile snippets. An example of this is:

output: input_a input_b input_c

This states that the artifact of the build rule is output and files input_a, input_b, and input_c were read during its creation. This allows the build system to know that if any of the listed input_* files changes, the rule for output needs to be brought up-to-date as well.

This works decently well for the kinds of dependencies that have occurred in C++ to date, namely header includes. This is because these dependencies can be discovered while executing the rule associated with output.

The issue that arises with the Makefile design is that modules are a new kind of dependency that cannot be represented in declarative Makefile syntax. For example, GCC outputs variable modifications (CXX_MODULES+=…) into these snippets which is commonly not supported by the consuming tools. In addition, because these dependencies must be discovered before the compilation rule is executed, there would need to be one rule that writes dependency information for another.

As an example of the restrictions placed on these Makefile snippets, the ninja [ninja] build tool requires that output be the same for the rule which wrote out the dependency snippet and that no other outputs are mentioned. No other Makefile syntax is supported (variables, adding rules, special variables, macro expansions, etc.). This is because ninja is reading these for just the dependency information.

5. Assumptions

This format assumes the following things about the environment in which it is used: uniformity of the environment between creation and usage; only used within one build of a project; it does not apply to different configurations of a build (since dependencies may vary with the target platform or build settings such as whether it is debug or not).

It is generally assumed that the environment in which a file of this format is created is the same as the environments in which it will be read and ultimately used during the actual compilation. However, build systems may have different strategies for executing rules and when this is the case, it is assumed that the build system itself knows how to translate between the environments it sets up for each rule. For example, a build system which distributes the builds across multiple machines (whether over a network or using containerization) should know how to translate between the environment set up for one execution and another execution.

Environments can have many knobs which change fundamental behaviors of the system. A non-exhaustive list includes:

  • mount layout (particularly of the input and output absolute paths)

  • encoding (active code page, locale)

  • effective permissions (process user and group, security modules, anti-virus)

The first two can be translated between different rules in a straightforward way. For example, if one rule is executed in a /chroot/exec1 prefix while another is under /chroot/exec2, it is assumed that the build system constructed those environments and knows that paths underneath those prefixes should be rerooted for another execution rule to get its paths correct. Encoding differences can be converted between using either system APIs or libraries which handle encodings. If there are permission differences between the scanner and the compiler, it is hard to imagine how a build tool would be able to translate the file effectively.

6. Format

The format uses JSON [ECMA-404] as a base for encoding its information. This is suitable because it is structured (versus a plain-text format), parsers for JSON are readily available (versus candidates with a custom structural format), and the format is simple to implement (versus candidates such as YAML or TOML) which will allow for easy adoption.

JSON specifies that documents are Unicode. However, due to the way filepaths are represented in this format, it is further constrained to be a valid UTF-8 sequence.

6.1. Schema

For the information provided by the format, the following JSON Schema [JSON-Schema] may be used.

JSON Schema for the format

{
  "$schema": "",
  "$id": "http://example.com/root.json",
  "type": "object",
  "title": "SG15 TR depformat",
  "definitions": {
    "datablock": {
      "$id": "#datablock",
      "type": [
        "object",
        "string"
      ],
      "description": "A binary sequence. See associated prose for interpretation",
      "minLength": 1,
      "required": [
        "format",
        "code-units"
      ],
      "properties": {
        "format": {
          "$id": "#format",
          "enum": ["raw8", "raw16"],
          "description": "Storage size of code-units' integers"
        },
        "code-units": {
          "$id": "#code-units",
          "type": "array",
          "description": "Integer representation of binary values",
          "minItems": 1,
          "items": {
            "type": "integer",
            "minimum": 1
          }
        },
        "readable-name": {
          "$id": "#readable-name",
          "type": "string",
          "description": "Readable version of the sequence (purely for human consumption; no semantic meaning)",
          "minLength": 1
        }
      }
    },
    "depinfo": {
      "$id": "#depinfo",
      "type": "object",
      "description": "Dependency information for a compilation rule",
      "required": [
        "inputs"
      ],
      "properties": {
        "inputs": {
          "$id": "#inputs",
          "type": "array",
          "description": "Files that were read by this execution",
          "uniqueItems": true,
          "minLength": 1,
          "items": {
            "$ref": "#/definitions/datablock"
          }
        },
        "outputs": {
          "$id": "#outputs",
          "type": "array",
          "description": "Files that will be output by this execution",
          "uniqueItems": true,
          "items": {
            "$ref": "#/definitions/datablock"
          }
        },
        "depends": {
          "$id": "#depends",
          "type": "array",
          "description": "Paths read during this execution",
          "uniqueItems": true,
          "items": {
            "$ref": "#/definitions/datablock"
          }
        },
        "future-compile": {
          "$ref": "#/definitions/future-depinfo"
        }
      }
    },
    "future-depinfo": {
      "$id": "#future-depinfo",
      "type": "object",
      "properties": {
        "outputs": {
          "$id": "#outputs",
          "type": "array",
          "description": "Files output by a future rule for this source using the same flags",
          "uniqueItems": true,
          "items": {
            "$ref": "#/definitions/datablock"
          }
        },
        "provides": {
          "$id": "#provides",
          "type": "array",
          "description": "Modules provided by a future compile rule for this source using the same flags",
          "uniqueItems": true,
          "items": {
            "$ref": "#/definitions/module-desc"
          }
        },
        "requires": {
          "$id": "#requires",
          "type": "array",
          "description": "Modules required by a future compile rule for this source using the same flags",
          "uniqueItems": true,
          "items": {
            "$ref": "#/definitions/module-desc"
          }
        }
      }
    },
    "module-desc": {
      "$id": "#module-desc",
      "type": "object",
      "required": [
        "logical-name"
      ],
      "properties": {
        "source-path": {
          "$ref": "#/definitions/datablock"
        },
        "compiled-module-path": {
          "$ref": "#/definitions/datablock"
        },
        "logical-name": {
          "$ref": "#/definitions/datablock"
        }
      }
    }
  },
  "required": [
    "version",
    "work-directory",
    "rules"
  ],
  "properties": {
    "version": {
      "$id": "#version",
      "type": "integer",
      "description": "The version of the output specification"
    },
    "revision": {
      "$id": "#revision",
      "type": "integer",
      "description": "The revision of the output specification",
      "default": 0
    },
    "work-directory": {
      "$ref": "#/definitions/datablock"
    },
    "rules": {
      "$id": "#rules",
      "type": "array",
      "title": "rules",
      "minItems": 1,
      "items": {
        "$ref": "#/definitions/depinfo"
      }
    }
  }
}

6.2. Storing binary data

This format uses UTF-8 as a communication channel between a dependency scanning tool and a build tool, but filepath encodings are specific to the platform which means considerations for paths containing non-UTF-8 sequences must be made. However, the most common uses of paths and filenames are either valid UTF-8 sequences or may be unambiguously represented using UTF-8 (e.g., a platform using UTF-16 for its path APIs has a valid UTF-8 encoding), so requiring excessive obfuscation in all cases is unnecessary.

In order to store a non-UTF-8 sequence losslessly, there must be a way to encode the non-UTF-8 sequence into this format. There have been multiple ways utilized in the past for storing binary data into JSON including Base64 (as well as other related encodings such as Base85 or Base91), integer arrays, and going so far as to convert the entire file format over to binary (e.g., [BSON], [UBJSON], etc.). These encodings do not handle sequences of 16-bit data well since endianness information is not stored in them. These solutions are over-pessimistic about the common case of valid UTF-8 paths used so this encoding scheme uses UTF-8 wherever possible while dropping down to a less efficient encoding only when necessary.

Note that some JSON encoders and decoders will normalize Unicode sequences. Due to the presence of platforms where non-normalized sequences are valid paths, any such normalization logic should be disabled when interacting with this format.

The most general format for storing data is to use an array of integers tagged with the size of the values in memory. This is done by using an object with two required properties: code-units storing the unsigned integers representing the raw data and format describing the bit size of the integers in memory. Supported formats are raw8 and raw16. Other formats are ill-formed. There is an optional readable-name property which contains a string for communicating the contents in a human-readable format using UTF-8. The value of the readable-name property is purely information and does not have any normative meaning to the interpretation of the format.

  • raw8 indicates that the integers of the code-units array are 8-bit unsigned integers. All values of the code-units array are required to be integers in the range of 1 to 255, inclusive.

Example raw8-encoded filepath
{
  "format": "raw8",
  "code-units": [112, 97, 197, 163, 104, 45, 116, 111, 45, 102, 105, 108, 195, 171],
  "readable-name": "paţh-to-filë"
}
  • raw16 indicates that the integers of the code-units array are 16-bit unsigned integers. All values of the code-units array are required to be integers in the range of 1 to 65535, inclusive.

Example raw16-encoded filepath
{
  "format": "raw16",
  "code-units": [112, 97, 355, 104, 45, 116, 111, 45, 102, 105, 108, 235],
  "readable-name": "paţh-to-filë"
}

Requirements for passing data to the platform’s APIs such as a terminating ASCII NUL byte or endianness are not included in the format. Using integer values outside of the range specified for the format is ill-formed.

Example filepaths represented as UTF-8 strings
[
  "paţh-to-filë",
  "path-to-file-ascii",
]

When a path can be communicated as a series of UTF-8 codepoints, it should be done, but it is not required. That is, all fields which may contain binary data in the format are allowed to be unconditionally encoded using the most general format.

6.3. Filepaths

Filepaths may either be relative or absolute. To this end, the dependency scanning tool must output its working directory in the work-directory property at the root of the document. The build tool may then construct the absolute paths as necessary.

For concrete examples where absolute paths may not be suitable:

  • A distributed build may perform the compilation in a different directory on another machine than the host machine is using.

  • A build tool uses a chroot for each command it invokes.
    [Concretely, the Tup build tool can execute compile rules inside of individual FUSE chroots where absolute paths are meaningless outside of that context.]

6.4. Rule items

The rules array allows for the dependency information of multiple rules to be specified in a single file.

The only restriction on the contents of the collective set of rules objects is that the set of all outputs in each object and future-compile object must be unique. This is because if they are not unique, there are outputs which have multiple rules that write to them, which is undefined behavior.

6.5. Dependency information

Each rule represented in the rules array is a JSON object which has a single required property, inputs. The value of this property is an array of datablock entries representing a set of filepaths that are read directly by the execution of this rule. A rule must have at least one input because otherwise the rule is idempotent and never needs dependency information to be discovered. Two optional properties exist to indicate the dependencies of the execution of the dependency scanning tool itself: the outputs array and the depends array. The depends value is an array in which each element is a filepath for files that affect the results of the run. For C++, this will generally be paths due to #include, but other mechanisms may be in effect.

6.6. Future dependency information

The core of this specification is the future-compile property on a rules object. future-compile objects have three optional properties, outputs, provides, and requires.

The outputs array contains filepaths which will be written to when the source is compiled. The provides and requires arrays contain descriptions of modules that. The provides array is for modules that the inputs will produce and the requires array is for modules that the inputs require. Each item of these arrays is a JSON object with one required property, logical-name, and two optional properties: compiled-module-path and source-path. All of these property’s values are filepaths. The logical-name value is what build tools must use to discover the ordering among translation unit compilations. In C++, this will generally be the name of the module (including its partition, if any) as included in the source. The compiled-module-path should be provided only if the location of the module’s artifact is known when the dependency information is discovered. The source-path is the path to the main source of the module. This is intended to be used to communicate the location of a header for a header-unit import or a module interface unit for a C++20 module when it is known.

Example source entry with future-compile information
{
  "inputs": [
    "path/to/input.cxx"
  ],
  "future-compile": [
    "outputs": [
      "path/to/output.o"
    ],
    "provides": [
      {
        "compiled-module-path": "exported.bmi",
        "logical-name": "exported"
      }
    ],
    "requires": [
      {
        "logical-name": "imported"
      }
    ]
  ]
}

6.7. Extensions

Vendor extensions may be added to the format using properties prefixed with an underscore (_) followed by a vendor-specific string followed by another underscore. None of these may be used to store semantically relevant information required to execute a correct build. Consumers must be able to ignore all _-prefixed properties and not suffer any loss of essential functionality.

Example source entry with extended information
{
  "input": "path/to/input",
  "_VENDOR_extension": true
}

7. Versioning

There are two properties with integer values in the top-level JSON object of the format: version and revision. The version property is required and if revision is not provided, it can be assumed to be 0. These indicate the version of the information available in the format itself and what features may be used. Tools creating this format should have a way to create older versions of the format to support consumers that do not support newer format versions.

The version integer is incremented when semantic information required for a correct build is different than the previous version. When the version is incremented, the revision integer is reset to 0.

The revision integer is incremented when the semantic information of the format is the same as the previous revision of the same version, but it may include additionally specified information or use an additionally specified format for the same information. For example, adding a format type would cause an increment of the revision value.

The version specified in this document is:

Version fields for this specification
{
  "version": 1,
  "revision": 0
}

8. Full example

Given this reduced source file module.cpp:

Reduced example C++ module source
export module my.module;

import other.module;
import <header>;

#include "config.h"

a full output for scanning this source could be:

Example dependency output
{
  "version": 1,
  "revision": 0,
  "work-directory": "/scanner/working/dir",
  "rules": [
    "inputs": [
      "module.cpp"
    ],
    "outputs": [
      "depinfo.json"
    ],
    "depends": [
      "/system/include/path/header",
      "include/path/config.h"
    ],
    "future-compile": {
      "outputs": [
        "source.cpp.o",
        "my_module.bmi"
      ],
      "provides": [
        "logical-name": "my.module",
        "source-path": "source.cpp",
        "compiled-module-path": "my_module.bmi"
      ],
      "requires": [
        "logical-name": "other.module"
      ]
    }
  ]
}

9. References