______________________________________________________________________ 2 Lexical conventions [lex] ______________________________________________________________________ 1 A C++ program need not all be translated at the same time. The text of the program is kept in units called source files in this standard. A source file together with all the headers (_lib.headers_) and source files included (_cpp.include_) via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion (_cpp.cond_) preprocessing directives, is called a transla tion unit. Previously translated translation units may be preserved individually or in libraries. The separate translation units of a program communicate (_basic.link_) by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units may be separately translated and then later linked to produce an executable program. (_basic.link_). 2.1 Phases of translation [lex.phases] 1 The precedence among the syntax rules of translation is specified by the following phases.1) 1 Physical source file characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (_lex.trigraph_) are replaced by corresponding single-character internal representations. 2 Each instance of a new-line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. A source file that is not empty shall end in a new-line character, which shall not be immediately pre ceded by a backslash character. 3 The source file is decomposed into preprocessing tokens (_lex.pptoken_) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or partial comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation- defined. The process of dividing a source file's characters into preprocessing tokens is context-dependent. For example, see the _________________________ 1) Implementations must behave as if these separate phases occur, al though in practice different phases may be folded together. handling of < within a #include preprocessing directive. 4 Preprocessing directives are executed and macro invocations are expanded. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. 5 Each source character set member and escape sequence in character constants and string literals is converted to a member of the exe cution character set. 6 Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated. 7 White-space characters separating tokens are no longer signifi cant. Each preprocessing token is converted into a token. (See _lex.digraph_). The resulting tokens are syntactically and seman tically analyzed and translated. The result of this process starting from a single source file is called a translation unit. 8 The translation units that will form a program are combined. All external object and function references are resolved. +------- BEGIN BOX 1 -------+ What about shared libraries? +------- END BOX 1 -------+ Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which con tains information needed for execution in its execution environment. 2.2 Trigraph sequences [lex.trigraph] 1 Before any other processing takes place, each occurrence of one of the following sequences of three characters (trigraph sequences) is replaced by the single character indicated in Table 1. Table 1--trigraph sequences +-----------------------+------------------------+------------------------+ |trigraph replacement | trigraph replacement | trigraph replacement | +-----------------------+------------------------+------------------------+ | ??= # | ??( [ | ??< { | +-----------------------+------------------------+------------------------+ | ??/ \ | ??) ] | ??> } | +-----------------------+------------------------+------------------------+ | ??' ^ | ??! | | ??- ~ | +-----------------------+------------------------+------------------------+ 2 For example, ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??) becomes #define arraycheck(a,b) a[b] || b[a] 2.3 Preprocessing tokens [lex.pptoken] preprocessing-token: header-name identifier pp-number character-constant string-literal operator digraph punctuator each non-white-space character that cannot be one of the above 1 Each preprocessing token that is converted to a token (_lex.token_) shall have the lexical form of a keyword, an identifier, a constant, a string literal, an operator, a digraph, or a punctuator. 2 A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character constants, string literals, operators, punctuators, digraphs, and sin gle non-white-space characters that do not lexically match the other preprocessing token categories. If a ' or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (_lex.comment_), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in Clause _cpp_, in cer tain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space may appear within a preprocessing token only as part of a header name or between the quotation characters in a character con stant or string literal. 3 If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token. 4 The program fragment 1Ex is parsed as a preprocessing number token (one that is not a valid floating or integer constant token), even though a parse as the pair of preprocessing tokens 1 and Ex might pro duce a valid expression (for example, if Ex were a macro defined as +1). Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating constant token), whether or not E is a macro name. 5 The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y are of built-in types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression. 2.4 Digraph sequences [lex.digraph] 1 Alternate representations are provided for the operators and punctua tors whose primary representations use the national characters. These include digraphs and additional reserved words. digraph: <% %> <: :> %: 2 In translation phase 3 (_lex.phases_) the digraphs are recognized as preprocessing tokens. Then in translation phase 7 the digraphs and the additional identifiers listed below are converted into tokens identical to those from the corresponding primary representations, as shown in Table 2. Table 2--identifiers that are treated as operators +--------------------+---------------------+---------------------+ |alternate primary | alternate primary | alternate primary | +--------------------+---------------------+---------------------+ | <% { | and && | and_eq &= | +--------------------+---------------------+---------------------+ | %> } | bitor | | or_eq |= | +--------------------+---------------------+---------------------+ | <: [ | or || | xor_eq ^= | +--------------------+---------------------+---------------------+ | :> ] | xor ^ | not ! | +--------------------+---------------------+---------------------+ | %: # | compl ~ | not_eq != | +--------------------+---------------------+---------------------+ | bitand & | | | +--------------------+---------------------+---------------------+ 2.5 Tokens [lex.token] token: identifier keyword literal operator punctuator 1 There are five kinds of tokens: identifiers, keywords, literals (which include strings and character and numeric constants), operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, white space), as described below, are ignored except as they serve to separate tokens. Some white space is required to separate otherwise adjacent identifiers, keywords, and literals. 2 If the input stream has been parsed into tokens up to a given prepro cessing token, the next token is taken to be the longest string of preprocessing tokens that could possibly constitute a token. 2.6 Comments [lex.comment] 1 The characters /* start a comment, which terminates with the charac ters */. These comments do not nest. The characters // start a com ment, which terminates with the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white- space characters may appear between it and the new-line that termi nates the comment; no diagnostic is required. The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment. 2.7 Identifiers [lex.name] identifier: nondigit identifier nondigit identifier digit nondigit: one of _ a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z digit: one of 0 1 2 3 4 5 6 7 8 9 1 An identifier is an arbitrarily long sequence of letters and digits. The first character must be a letter; the underscore _ counts as a letter. Upper- and lower-case letters are different. All characters are significant. 2.8 Keywords [lex.key] 1 The identifiers shown in Table 3 are reserved for use as keywords, and may not be used otherwise in phases 7 and 8: Table 3--keywords +------------------------------------------------------------------------+ |asm delete if reinterpret_cast true | |auto do inline return try | |bool double int short typedef | |break dynamic_cast long signed typeid | |case else mutable sizeof union | |catch enum namespace static unsigned | |char extern new static_cast using | |class false operator struct virtual | |const float private switch void | |const_cast for protected template volatile | |continue friend public this wchar_t | |default goto register throw while | +------------------------------------------------------------------------+ 2 Furthermore, the alternate representations shown in Table 4 for cer tain operators and punctuators (_lex.digraph_) are reserved and may not be used otherwise: Table 4--alternate representations +-----------------------------------------------+ |bitand and bitor or xor compl | |and_eq or_eq xor_eq not not_eq | +-----------------------------------------------+ 3 In addition, identifiers containing a double underscore (__) or begin ning with an underscore and an upper-case letter are reserved for use by C++ implementations and standard libraries and should be avoided by users; no diagnostic is required. 4 The ASCII representation of C++ programs uses as operators or for punctuation the characters shown in Table 5. Table 5--operators and punctuation characters +------------------------------------------------------+ |! % ^ & * ( ) - + -- { } | ~ | |[ ] \ ; ' : " < > ? , . / | +------------------------------------------------------+ Table 6 shows the character combinationations that are used as opera tors. Table 6--character combinations used as operators +-------------------------------------------------------------+ |-> ++ -- .* ->* << >> <= >= == != && | ||| *= /= %= += -= <<= >>= &= ^= |= :: | +-------------------------------------------------------------+ Each is converted to a single token in translation phase 7 (_lex.phases_). 5 Table 7 shows character combinations that are used as alternative rep resentations for certain operators and punctuators (_lex.digraph_). Table 7--digraphs +-----------------------+ |<% %> <: :> %: | +-----------------------+ Each of these is also recognized as a single token in translation phases 3 and 7. 6 Table 8 shows additional tokens that are used by the preprocessor. Table 8--preprocessing tokens +---------------------------+ |# ## %: %:%: | +---------------------------+ 7 Certain implementation-dependent properties, such as the type of a sizeof (_expr.sizeof_) and the ranges of fundamental types (_basic.fundamental_), are defined in the standard header files (_cpp.include_) <float.h> <limits.h> <stddef.h> These headers are part of the ISO C standard. In addition the headers <new.h> <stdarg.h> <stdlib.h> define the types of the most basic library functions. The last two headers are part of the ISO C standard; <new.h> is C++ specific. 2.9 Literals [lex.literal] 1 There are several kinds of literals (often referred to as constants). literal: integer-literal character-literal floating-literal string-literal boolean-literal 2.9.1 Integer literals [lex.icon] integer-literal: decimal-literal integer-suffixopt octal-literal integer-suffixopt hexadecimal-literal integer-suffixopt decimal-literal: nonzero-digit decimal-literal digit octal-literal: 0 octal-literal octal-digit hexadecimal-literal: 0x hexadecimal-digit 0X hexadecimal-digit hexadecimal-literal hexadecimal-digit nonzero-digit: one of 1 2 3 4 5 6 7 8 9 octal-digit: one of 0 1 2 3 4 5 6 7 hexadecimal-digit: one of 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F integer-suffix: unsigned-suffix long-suffixopt long-suffix unsigned-suffixopt unsigned-suffix: one of u U long-suffix: one of l L 1 An integer literal consisting of a sequence of digits is taken to be decimal (base ten) unless it begins with 0 (digit zero). A sequence of digits starting with 0 is taken to be an octal integer (base eight). The digits 8 and 9 are not octal digits. A sequence of dig its preceded by 0x or 0X is taken to be a hexadecimal integer (base sixteen). The hexadecimal digits include a or A through f or F with decimal values ten through fifteen. For example, the number twelve can be written 12, 014, or 0XC. 2 The type of an integer literal depends on its form, value, and suffix. If it is decimal and has no suffix, it has the first of these types in which its value can be represented: int, long int, unsigned long int. If it is octal or hexadecimal and has no suffix, it has the first of these types in which its value can be represented: int, unsigned int, long int, unsigned long int. If it is suffixed by u or U, its type is the first of these types in which its value can be represented: unsigned int, unsigned long int. If it is suffixed by l or L, its type is the first of these types in which its value can be repre sented: long int, unsigned long int. If it is suffixed by ul, lu, uL, Lu, Ul, lU, UL, or LU, its type is unsigned long int. 3 A program is ill-formed if it contains an integer literal that cannot be represented by any of the allowed types. 2.9.2 Character literals [lex.ccon] character-literal: 'c-char-sequence' L'c-char-sequence' c-char-sequence: c-char c-char-sequence c-char c-char: any member of the source character set except the single-quote ', backslash \, or new-line character escape-sequence escape-sequence: simple-escape-sequence octal-escape-sequence hexadecimal-escape-sequence simple-escape-sequence: one of \' \" \? \\ \a \b \f \n \r \t \v octal-escape-sequence: \ octal-digit octal-escape-sequence octal-digit hexadecimal-escape-sequence: \x hexadecimal-digit hexadecimal-escape-sequence hexadecimal-digit 1 A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by the letter L, as in L'x'. Single character literals that do not begin with L have type char, with value equal to the numerical value of the character in the machine's character set. Multicharacter literals that do not begin with L have type int and implementation-defined value. 2 A character literal that begins with the letter L, such as L'ab', is a wide-character literal. Wide-character literals have type wchar_t. They are intended for character sets where a character does not fit into a single byte. Wide-character literals have implementation- defined values, regardless of the number of characters in the literal. 3 Certain nongraphic characters, the single quote ', the double quote ", ?, and the backslash \, may be represented according to Table 9. Table 9--escape sequences +----------------------------------+ |new-line NL (LF) \n | |horizontal tab HT \t | |vertical tab VT \v | |backspace BS \b | |carriage return CR \r | |form feed FF \f | |alert BEL \a | |backslash \ \\ | |question mark ? \? | |single quote ' \' | |double quote " \" | |octal number ooo \ooo | |hex number hhh \xhhh | +----------------------------------+ If the character following a backslash is not one of those specified, the behavior is undefined. An escape sequence specifies a single character. 4 The escape \ooo consists of the backslash followed by one or more octal digits that are taken to specify the value of the desired char acter. The escape \xhhh consists of the backslash followed by x fol lowed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in either sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation dependent if it exceeds that of the largest char (for ordinary literals) or wchar_t (for wide literals). 2.9.3 Floating literals [lex.fcon] floating-constant: fractional-constant exponent-partopt floating-suffixopt digit-sequence exponent-part floating-suffixopt fractional-constant: digit-sequenceopt . digit-sequence digit-sequence . exponent-part: e signopt digit-sequence E signopt digit-sequence sign: one of + - digit-sequence: digit digit-sequence digit floating-suffix: one of f l F L 1 A floating literal consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent, and an optional type suffix. The integer and fraction parts both consist of a sequence of decimal (base ten) digits. Either the integer part or the fraction part (not both) may be missing; either the decimal point or the letter e (or E) and the exponent (not both) may be miss ing. The type of a floating literal is double unless explicitly spec ified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double. 2.9.4 String literals [lex.string] string-literal: "s-char-sequenceopt" L"s-char-sequenceopt" s-char-sequence: s-char s-char-sequence s-char s-char: any member of the source character set except the double-quote ", backslash \, or new-line character escape-sequence 1 A string literal is a sequence of characters (as defined in _lex.ccon_) surrounded by double quotes, optionally beginning with the letter L, as in "..." or L"...". A string literal that does not begin with L has type array of n char and static storage duration (_basic.stc_), where n is the size of the string as defined below, and is initialized with the given characters. Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is imple mentation dependent. The effect of attempting to modify a string lit eral is undefined. 2 A string literal that begins with L, such as L"asdf", is a wide- character string. A wide-character string is of type array of n wchar_t, where n is the size of the string as defined below. Concate nation of ordinary and wide-character string literals is undefined. +------- BEGIN BOX 2 -------+ Should this render the program ill-formed? Or is it deliberately undefined to encourage extensions? +------- END BOX 2 -------+ 3 Adjacent string literals are concatenated. Characters in concatenated strings are kept distinct. For example, "\xA" "B" contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB'). 4 After any necessary concatenation '\0' is appended so that programs that scan a string can find its end. The size of a string is the num ber of its characters including this terminator. Within a string, the double quote character " must be preceded by a \. 5 Escape sequences in string literals have the same meaning as in char acter literals (_lex.ccon_). 2.9.5 Boolean literals [lex.bool] boolean-literal: false true 1 The Boolean literals are the keywords false and true. Such literals have type bool and the given values. They are not lvalues.