2023-12-13
document number | date | comment |
---|---|---|
n3190 | 202312 | this paper, original proposal |
CC BY, see https://creativecommons.org/licenses/by/4.0
The C and C++ preprocessor have recently attracted some attention because it provides means of textual replacement that have an expressibility that is balanced: it allows to express relatively sophisticated compile time features by guaranteeing termination within quite reasonable time frames. The preprocessor is not Turing complete, which has the advantage that processing in general stays bounded. Compiling with relative complex macro packages (such as boost or P99) is in general several orders of magnitude faster than compiling an equivalent code written with templates or constexpr
, and has the advantage of producing intermediate equivalent source code that can be inspected.
Several projects are currently on the way to extend the preprocessing phases to gain in expressitivity. Our goal is to collect the different proposed features here. Most of them are relatively simple to implement, whereas the gain for every-day programming in C may be significant. It will be important to watch that all of this does not incur too much slowdown in compilation times.
There are several angles to preprocessing. Infact, generally several phases of the C and C++ translation model are commonly subsumed under the name, in particular lexing, evaluation of directives and macro replacement. The extensions we will discuss range from simple (but useful) predefined macros such as __COUNTER__
, over new forms of string and character literals such as R"(bäh)"
, to extensions of existing directives. prefix
for #include
, new directives #bind
, to new rules for macro replacement (bounded recursion).
A lot of predefined macros have appeared as compiler or library specific extensions that would better be promoted to the C standard. This concerns object-like macros and function-like macros.
__COUNTER__
returns an incremented value each time it is expanded. Already common in many implementations. Very useful to generate unique local identifiers that valid within the macro expansion that should not collide with those from another invocation of the same macro. Needs minimal compiler magic.
__BASE_FILE__
the name of the top level source file. Allows to identify a TU. Needs minimal magic.
__ISO_DATE__
same as __DATE__
but in the form "YYYY-MM-DD"
. Needs minimal magic.
__INTEGER_DATE__
same as __DATE__
but in the form YYYYMMDD
, that is without "
such that it is interpreted as an integer literal and may be used in arithmetic. Needs minimal magic.
__ERROR__(message)
The same as an error directive but may appear anywhere, in particular in macro expansions. Gnuc and similar have that around the back of the head by introducing a #pragma GNU error
and then invoking that through _Pragma
. Needs some magic.
__WARNING__(message)
Same, but for #warning
. Needs some magic.
__MANGLE__(...)
A implementation specific result that eases the pain when programming with vendor attributes, for example. Needs magic.
__COMMAS__(...)
Count the number of top level commas in the argument list. Basically the number of arguments plus one, but with no distinction when the list is completely empty. Can be done without magic, but is nasty to program. Better done by the implementation.
__EMPTY__(...)
Detect an empty argument list. needs no magic since C23 when using __VA_OPT__
.
__NARGS__(...)
number of arguments where empty accounts as 0. same complexity as for __COMMAS__
.
__UGLIFY__(...)
add leading and trailing __
to an identifier. No magic needed.
__STRINGIFY__(...)
No magic needed.
__EXPAND__(...)
Expand the argument list once. No magic needed.
__EXPAND_STRINGIFY__(...)
Expand the argument list once and then stringify. No magic needed.
__EXPAND_DEC__(...)
expand and then evaluate as a prepro expression and convert to a decimal literal plus optional leading -
token. Needs magic for the evaluation part.
__EXPAND_HEX__(...)
expand and then evaluate as a prepro expression and convert to a hexadecimal literal including leading 0x
. Needs magic for the evaluation part.
In combination with the #expand
prefix and #include
the latter two allow to have
C and C++ have prefixes for string and character literals u8
, L
, u
and U
for specific execution encoding, namely for multi-byte encoding (without prefix), UTF-8, wchar_t
, char16_t
and char32_t
characters, respectively.
Note, that these prefixes only concern the execution encoding. The source encoding is unchanged, characters in the input are interpreted as anywhere else, including the interpretation of escape sequences. The prefix only changes the interpretation/realization as an array. So for example a given source character such as ö
would lead to several array elements in multi-byte encodings (such as UFT-8) to one or two elements in UTF-16 and to one element in UTF-32. The character literals 'x'
and u8'x'
refer to the same concept (the character x
in the source encoding) but result in different semantics, one is a char
with the value of the character x
in the execution environment, the other is unsigned char
with a portable value, namely 120.
In contrast to that, C++ adds a specific syntax for a modification of source encoding of a string. Namely they add an R
at the end of one of the prefixes (including the empty one) to indicate that the source encoding is “raw” without escape sequences. Examples
The introduction of UTF-8 character literals has the strange effect of introducing literals with base type unsigned char
but by imposing a specific interpretation. It was already unfortunate that for historic reasons characters and bare bytes have the same type, we are repeating the same mistake, again. The type that should be reserved for bytes also represent a particular textual concept, namely UTF-8 characters and strings. While the decision to do so is consistent within C’s restricted framework, it has the inherent danger that in particular UTF-8 strings will be used to encode arbitrary literals of type unsigned char
by sprinkling \x
escape sequences all over the place. We think that UTF-8 characters and strings should be reserved to properly encoded strings and that there should be other features that encode arbitrary binary data.
We propose the following prefixes as new extensions for C2y.
Form | type | encoding |
---|---|---|
x"cont\x00E4\0nts" |
unsigned char[] |
restricted UTF-8, with escape sequences |
x'\xFFFF' |
unsigned char |
restricted UTF-8, with escape sequences |
B"gAf4yu==" |
unsigned char[] |
base64 |
Here the x
prefix (mnemonic “hex encoded string”) is meant for arrays of base type unsigned char
that hold arbitrary data in the range of 0
to UCHAR_MAX
, including, and which is encoded in the usual way by presenting escape sequences. For portability, the source encoding of these strings should be fixed to one-byte sequences of characters that are representable with UTF-8 in the range between codes 32 and 126, including, that is ASCII. Byte values outside that range should be specified with octal or hexadecimal escape sequences or with \n
or similar. By these specifications, the array then holds portable binary data, as long as the byte values that are presented fit in 8 bit.
The B
prefix (mnemonic “Base64 Binary” encoding) represents binary data that is packed consecutively in an array of base type unsigned char
and where the encoding is base64. It uses the 62 alpha-numerical characters of the source character set plus the characters +
, /
and =
that are present in all currently used encodings to encode packs of 12 bit of data with 4 characters. By that it is relatively efficient because it only uses about 1.33 times space than it must, but is still portable on all modern architectures. This encoding is widely used in the transfer of binary data and it is almost trivial to program.
Both concepts (without using a different prefix) are already widely used in practice. In particular, hex, octal or base64 encoded strings are used by some implementation as intermediate source format for #embed
. Such an intermediate format could be forced through a if_empty
parameter, see below.
Usually in C identifiers are not directly followed by strings. But when U
prefixed literals were introduced in C. there still were some rare clashes with existing code. This happened were a macro U
that expanded to a string was used to add some sort of leading character sequence to a string. Prior, this usage was not sensible to whether or not there was a space between the two. By introducing the prefix the two usages (with and without space) became distinct and code changed its meaning or became invalid. So for this situation space is in fact significant.
Generally, it is often assumed that in C spaces don’t contribute much to the interpretation of programming text, but we think that for C23 this is a simplification that does not really reflect the current situation. Additionally, there is the problem of interfacing with C++, where some of the rules are different.
syntax | meaning, C | different meaning, C++ |
---|---|---|
# define X(A) |
function like macro, empty | |
# define X (A) |
object macro, expands to (A) |
|
0x4'7'a |
hex number with digit separators | |
0x4 '7'a |
number, character literal, and identifier | number, character literal with suffix |
0x4 '7' a |
number, character literal, and identifier | |
"%" PRIx64 |
valid format string for printf |
|
"%"PRIx64 |
valid format string for printf |
string literal with suffix |
R "(hör)" |
identifier followed by multi-byte string | |
R"(hör)" |
identifier followed by multi-byte string | raw multi-byte string, contains just hör |
R "hör" |
identifier followed by multi-byte string | |
R"hör" |
identifier followed by multi-byte string | invalid raw string |
U "hör" |
identifier, followed by multi-byte string | |
U"hör" |
UTF-32 string |
We think that it would be in order to coordinate here between C and C++ and in general to discourage any use of identifiers that are adjacent to character and string literals. If we want this to be diagnosed it should be before phase 4, in particular before macro expansion. Best would be if this is diagnosed in phase 3, lexing. We propose:
Change the definitions of character and string literals to include leading and trailing identifiers, and then add constraints for the accepted prefixes (and for C++ suffixes) to phase 5, decoding.
Implementations could start to diagnose such possible collision immediately.
#embed
resource representationFor #embed
we went with a compromise that is that the output of that directive is as-if a comma-separated list of integer values, representing the byte values is inserted in the program text. This is not much suitable for implementations that have the option of keeping preprocessed program text for intermediate stages of compilation. Such an intermediate file with all bytes spelled out as integer literals, looses all the advantages of #embed
.
Thus, even today, they already use intermediate formats such as string literals with base64 encoding and wrap them inside some magic builtin. It would be good to generalize that idea, such that programmers would have the possibilities to specify what intermediate representation to expect. This could be achieved quite simply:
If the
if_empty
embed parameter specifies a narrow string literal, the encoded resource shall be represented as-if by a string literal of the same kind.
Example:
is as if given as
where AQIDLS4+U0A=
is the base64 encoding of the contents of the resource, here the 8 byte values \001\002\003\055\056\076\123\100
or 0x01 0x02 0x03 0x2d 0x2e 0x3e 0x53 0x40
. So without the proposed convention the equivalent code as of C23 would be as if given as
which uses about 5 encoding characters per encoded byte, about 3.8 times as much as with a B
encoding.
offset
parameter to #embed
This is more or less obvious to do and should account for the position in bytes from the start of the resource.
#include
The same form of parameters as for #embed
could be added, here, only that the semantics should be adapted to the case. Namely an #include
resource should be accounted in lines instead of bytes. That is an offset
or limit
would skip and count the number of lines to be included.
The prefix
and suffix
parameters would always add directives before and after the file contents that are executed in the context of the include file.
#bind
directive, see below, but similar effects as with bind could be a combination of #define
in a prefix and #undef
in a suffix. #include "my-main-xcode.c" \
__prefix__(expand bind TOTO WHATDOWEHAVE(35)) \
__suffix__(include "my-secondary-xcode.c")
similar, but without #bind
#include "my-main-xcode.c" \
__prefix__(expand define TOTO WHATDOWEHAVE(35)) \
__suffix__(include "my-secondary-xcode.c") \
__suffix__(undef TOTO)
A slash at the end of the input file name add that file to the corresponding list of places instead of including a file. Example
This allows to distinguish additions to all the four lists that an implementation has to maintain, name #include
with "/pa/th/"
and </pa/th/>
and #embed
with "/pa/th/"
and </pa/th/>
.
This feature is perhaps not the most needed by normal code, but eases the tuning of system header files a lot.
#bind
Semantically this is really nothing else than an improved version of
or
only that the #undef
part is inserted automatically
#if/#elif/#else
block, if anyThis is simple to implement, because it only uses recursive preprocessor program structure that is already there. It really helps in programming because it avoids a pollution of the macro name space with local macros that everybody forgets to #undef
, in particular for programming with xcode inclusion. It is king when combined with the prefix
parameter extension for #include
and #include_source
.
#include_source
, #embed_resource
and #linenumber
directivesI found the sometime-evaluate-and-sometimes-not definitions of #include
, #embed
and #line
in combination with the weird filename strings à la <stdsomething.h>
quite annoying to implement. It adds a lot of complexity for a feature that not many people use (expansion on #include
lines).
So these three directives don’t expand their line and have to receive proper file names, line numbers or limit parameters directly.
#expand
prefixThe idea of that prefix is that it allows to have user controlled expansion of the line, and not as #include
currently to make expansion (or not) depend on some weird syntactic property of the rest of the line. I find
much clearer than hiding the evaluation and concatenation inside a macro as you would do currently, something like
also this allows expansion for directives that currently don’t have that
Derived from that prefix are #xdefine
, #xbind
, … that are just shortcuts for adding a #expand
prefix.
#do
and #foreach
Some of the macro preprocessing libraries allow to loop over argument lists or lists of tokens. This can be interesting when defining features for a list of types or when building interfaces that work with enumeration types.
As mentioned above, a combination of the #expand
prefix and #include
allows to define macros that emulate finite recursion / iteration.
Macro recursion is a dangerous feature, because it easily leads to unbounded depth and introduces the halting problem. Additionally it has the direct problem, that it is not backwards compatible. C23 expects that a macro that is called recursively does not expand. Many C code out there relies on this and for example a macro with the same name as a function. Once the macro level is expanded, a second level remains, which is then carried on into later compiler phases.
So recursion has two imperatives:
There are several possible designs for this, some of it has already been discussed in the context of LLVM. In particular:
#define
but that allows to expand the defined macros recursively__VA_OPT__
that refers to the current macro.As mentioned above a combination of the #expand
prefix and #include
allows to define macros that emulate finite recursion / iteration.