Doc. No.: | WG21/N1566, J16/04-0006 |
---|---|
Date: | 2004-02-05 |
Reply to: | Clark Nelson |
Phone: | +1-503-712-8433 |
Email: | clark.nelson@intel.com |
This document is intended as a basis for discussion of the details of adopting text from C99 to describe the C++ preprocessor. This was proposed at the Kona meeting, and was supported almost unanimously by the Evolution working group.
This paper summarizes every change that was made to the preprocessor section of either the C++ standard (as of 2003) or the C standard (as of 2001), taking the 1989 C standard as the base. The descriptions of the phases of translation and of trigraphs are also covered; they were explicitly mentioned in the original straw vote from the first Nashua meeting (1991-03) on the subject of incorporating text from the C standard.
Differences in the area of universal character names are also mentioned, as they affect the phases of translation. UCNs were developed/introduced concurrently in both committees/standards; nevertheless (and unfortunately) there is considerable variance in the way they are described. Unfortunately, UCNs were not specifically mentioned at Kona, and therefore it is not yet clear that there is consensus to synchronize with C, nor which direction may be favored for resolving discrepancies.
References are to the C++ standard, with the corresponding C standard section number in parentheses. A complete C reference is used when a paragraph or section has been added to the C standard.
Changes fall into four major categories:
I expect that everyone would agree that C and C++ should be synchronized with respect to universal character names. I am less certain that everyone will agree where changes should be made to effect synchronization. Personally, I believe that the model described in the C standard is at least as good as that in C++. Therefore, I recommend that the C++ standard be changed to match C: the changes from C99 should be adopted, and the changes from C++ should be abandoned.
It should be noted that this list of changes is not complete. Universal character names are also mentioned elsewhere in both standards: where they are defined, in the descriptions of identifiers and string and character literals, and in annexes specifying which characters are permitted in identifiers. More work in this area will be needed, especially if the committee prefers close synchronization.
§16.3.2¶2 (§6.10.3.2): Added a statement that stringizing a string literal containing a UCN is implementation-defined.
§2.1 phase 1 (§5.1.1.2): Processing of characters not in the basic source character set is described in terms of universal character names.
§2.1 phase 2 (§5.1.1.2): A universal character name may not be split by an escaped new-line.
§2.1 phase 5 (§5.1.1.2): Universal character names are mapped onto the execution character set. [In C, no change is needed here because of a terminology difference: a universal character name is described as an escape sequence, which is already mentioned.]
The technical changes are presented roughly in order of decreasing controversy (in my best guess).
This represents quite a lot of technical work, in both specification and implementation. I am not prepared to make a recommendation at this time.
§16.6¶1 (§6.10.6): Added statements distinguishing non-standard pragmas from standard pragmas.
§16.6 -- new paragraph (§6.10.6¶2): Added introductions of the standard pragmas:
#pragma STDC FP_CONTRACT
#pragma STDC FENV_ACCESS
#pragma STDC CX_LIMITED_RANGE
Their semantics are specified elsewhere.
Although the conditionally-defined macros added to C99 represent a fair amount
of work in specification and/or coordination, my recommendation would be to adopt
them into C++. __STDC_HOSTED__
is comparatively easy, makes as much
sense for C++ as it does for C.
Clearly the description of __cplusplus
in the C++ standard should
not be synchronized with C. __STDC_VERSION__
might be trivially (and
usefully?) defined to have the same value as __cplusplus
. With respect
to __STDC__
, perhaps existing practice should be surveyed.
§16.8¶1 (§6.10.8): New macros were added:
__STDC_HOSTED__
__STDC_VERSION__
Several editorial clarifications are also applied.
§16.8 -- new paragraph (§6.10.8¶2): New conditionally-defined macros were added:
__STDC_IEC_559__
__STDC_IEC_559_COMPLEX__
__STDC_ISO_10646__
wchar_t
.§16.8 -- new paragraph (§6.10.8¶5): Added a prohibition against predefining or
defining __cplusplus
. [This was added more or less as a courtesy,
to ensure that __cplusplus
could be used to distinguish reliably
between C and C++.]
§16.8¶1 (§6.10.8):
__STDC__
__cplusplus
In addition, a restriction on the spellings of any other predefined macros (i.e. that they must begin either with two underscores or an underscore and a capital letter) was deleted. [I believe this was removed due to a general reluctance to state restrictions on implementations using the word "shall". Other such instances were rephrased, not deleted. It is not clear to me that this particular change is worth preserving.]
Probably every hosted C++ implementation already supports 64-bit integers, most
by the name long long
. So adopting it, along with the other
<stdint.h>
changes, would amount to codification of existing practice.
I recommend it.
§16.1¶4 (§6.10.1): long
and unsigned long
were replaced
by intmax_t
and uintmax_t
, respectively. Also, integer
literals can have other widths than int
or long
.
This is a very simple change; there is no interaction with the rest of the language. It should be adopted.
§16.3.4¶3 (§6.10.3.4): Added a statement that pragma operators are processed after macro expansion.
§16.9 -- new section (§6.10.9): Added description of pragma operator:
_Pragma (
string-literal )
§2.1 phase 4 (§5.1.1.2): Added a statement that pragma operators are interpreted.
Paul Mensonides made this proposal in isolation at the Kona meeting. I trumped it by suggesting this grander unification before many people had a chance to comment on this aspect specifically. This is unquestionably the largest change under consideration. Along with Paul, I recommend it.
§16 control-line grammar rule (§6.10): Alternatives were added with an ellipsis before the close parenthesis.
§16.3¶4 (§6.10.3): A variadic macro may be invoked with more arguments than the definition has parameters.
§16.3 -- new paragraph (§6.10.3¶5): __VA_ARGS__
may be used only
in the definition of a variadic macro.
§16.3¶9 (§6.10.3): Alternatives were added with an ellipsis before the close parenthesis.
§16.3¶10 (§6.10.3): Removed statement that empty macro arguments yield undefined behavior.
§16.3 -- new paragraph (§6.10.3¶12): Added description of argument collection for variadic macros.
§16.3.1 -- new paragraph (§6.10.3.1¶2): __VA_ARGS__
is an implicit
parameter of a variadic macro.
§16.3.2¶2 (§6.10.3.2): Added definition of the result of stringizing an empty macro argument.
§16.3.3¶2-3 (§6.10.3.3): Added definition of token-pasting with an empty macro argument.
§16.3.3 -- new paragraph (§6.10.3.3¶4): A token-pasting example was added.
§16.3.5¶5 (§6.10.3.5): Examples of token-pasting and stringizing with empty macro arguments were added.
§16.3.5 -- new paragraph (§6.10.3.5¶7): More examples of token-pasting with empty macro arguments.
§16.3.5 -- new paragraph (§6.10.3.5¶9): Examples using variadic macros.
This change should be adopted. Note that, since the Technical Report on extensions for new character data types (WG14/N1040) has new kinds of string literals, its rules are slightly different, although analogous.
§2.1 phase 6 (§5.1.1.2): If adjacent string literals are of different types, the result of concatenation is a wide string literal.
It is interesting to note that C89 explicitly allowed only letters in header and include file names. C++ added underscores, and C99 added digits. Probably both standards should allow both.
I have no idea why C99 dropped that the requirement that the implementation document the mapping to external file names. But there is probably no practical impact, so by default C++ should probably drop it as well.
§16.2¶5 (§6.10.2): The mapping from header or source file name syntax to external source file names is no longer implementation-defined.
§16.2¶5 (§6.10.2): Non-initial digits are now allowed in include syntax.
§16.2¶5 (§6.10.2): Underscores are allowed.
There is probably no support for adopting the lower limit on the significance of a header or include file name from C, even though it has now been increased.
On the other hand, I imagine it was only by oversight that the limitation to
15-bit numbers in a #line
directive survived into C++. There is certainly
no need to preserve it.
§16.2¶5 (§6.10.2): The lower limit on the significant characters of an include file or header name was raised to eight.
§16.4¶2 (§6.10.4): The lower limit on the number that can be specified in a
#line
directive was raised to 2147483647.
§16.2¶5 (§6.10.2): The standard does not explicitly grant license to limit the number of significant characters in the name of an included file or header.
This is a considered difference from C, in which these identifier-like alternative token spellings are explicitly implemented as macros. It should be preserved.
§16.1¶4 (§6.10.1): Added a footnote clarifying that an identifier-like spelling of an alternative token is not replaced by zero in a condition directive.
bool
data typeAlthough C now has a Boolean type, Boolean-valued operators are still specified
as having int
results, unlike in C++. Also, in C++ true
and false
are not defined as macros. So this difference is still
justified.
§16.1¶4 (§6.10.1): In a condition directive, true
and false
are not replaced by zero, and bool
-typed subexpressions are immediately
integral-promoted.
This change is obviously still justified.
§2.1 phase 8 -- new phase (§5.1.1.2): Template instantiation was inserted between parsing/translation and linking.
Unless someone would like to convince either committee to adopt terms from the other, these are simply areas where the committees have agreed to disagree. I recommend no changes.
"integral constant expression" was changed to "integer constant expression".
"comprise" was changed to "compose".
"preprocessing translation unit" was added, referring to a translation unit before macro expansion.
"character constant" was changed to "character literal".
The implication of "shall" in a Semantics paragraph of the C standard is spelled out as "undefined behavior".
When "shall" was used to express a requirement on an implementation, the requirement was rewritten.
§16.3¶2-3 (§6.10.3): Constraints on macro redefinition were made explicit using "ill-formed".
Although I frankly do not see the point of a few of the changes made to C99, for simplicity I recommend that they all be adopted, including the small edits.
The changes made to C++ should be forwarded to the C committee for their consideration.
§16¶1 (§6.10): Clarifications were added with respect to translation phases (specifically, processing of comments and expansion of macros). An accompanying example was added as a new paragraph immediately before §16.1.
§16 grammar rules (§6.10): New rules were added for text-line and
non-directive, and group-part was changed to use them, to
clarify (for example) that any line beginning with #
is interpreted
as a directive (even though it also matches the grammar of a non-directive line).
Two new accompanying text paragraphs were also added before §16¶2.
§16.3 -- new paragraph (§6.10.3¶3): Added a requirement for white-space after the macro name in an object-like macro definition.
§16.3.4¶1 (§6.10.3.4): Added clarification that token-pasting and stringizing precede rescanning. Also minor editorial changes.
§16.3.5¶1 (§6.10.3.5): Added clarification that macros are not used after translation phase 4.
§16.6¶1 (§6.10.6): Added clarification that (non-standard) pragmas may cause translation failure or non-conforming behavior.
§2.1 phase 1 (§5.1.1.2): Clarified that source may contain multibyte characters.
§2.1 phase 2 (§5.1.1.2): Clarified that a line that ends with two backslashes can not result in two line-splices.
§2.1 phase 4 (§5.1.1.2): Clarified that preprocessing directives do not survive past phase 4.
§2.1 phase 5 (§5.1.1.2): The mapping to the execution character set was clarified: a character not in the execution set must not be mapped to a null character, but different missing characters may be mapped to different execution characters.
§2.1 phase 7 (§5.1.1.2): Added clarification that the results of preprocessing are translated "as a translation unit".
Several examples were changed to include "C++-style" comments.
§16 grammar rules (§6.10): The definition of lparen was tweaked.
§16.3¶2-3 (§6.10.3): Definitions of object-like and function-like macro were moved down, and forward-referenced from here. Constraints were made explicit using "shall". Paragraphs were joined into one.
§16.3.1¶1 (§6.10.3.1): "translation unit" changed to "preprocessing file".
§16.3.3¶2 (§6.10.3.3): Clarify that special case for parameters in token-pasting applies only in function-like macros.
§16.3.4¶2 (§6.10.3.4): Change "Further" to "Furthermore".
§16.3.5¶6 (§6.10.3.5): A comment referring (misleadingly) to a previous example was deleted.
§2.1 phase 2 (§5.1.1.2): The description of an escaped new-line was rearranged.
§2.1 phase 3 (§5.1.1.2): Added "in a" in "or in a partial comment."
§16.1¶2 (§6.10.1): Added a restriction that only valid tokens may appear in a condition directive.
§16.1¶4 (§6.10.1): Added clarification that (most) keywords are replaced by zero in a condition directive.
§16.3¶8 (§6.10.3): Added clarification that object-like macros are rescanned.
§16.5¶1 (§6.10.5): Added a statement that #error
causes a program
to be ill-formed.
§2.1 phase 3 (§5.1.1.2): The footnote pointing out the context-dependent nature of tokenization (specifically with respect to header names) was made normative.
§2.1 phase 7 (§5.1.1.2): A note was added clarifying that there need not be a one-to-one correspondence between (for example) source files and external file system files.
§2.3¶1 (§5.2.1.1): Added clarification that trigraphs are recognized before preprocessing.
§2.3¶1 (§5.2.1.1): Added an example using several trigraphs. Deleted the example
demonstrating a boundary condition (???/
).
§16¶1 (§6.10): Modified to break up a very long sentence.
§16.1¶1 (§6.10.1): Spelled out "0" as "zero".
§16.3¶9 (§6.10.3): "arguments" was replaced with "parameters".
§2.3¶1 (§5.2.1.1): Description of trigraph processing changed from plural (collective) to singular (distributive). Also, trigraph sequences were formatted into a table.