Ascii file: lex.lat


  ______________________________________________________________________

  2   Lexical conventions                                [lex]

  ______________________________________________________________________

1 A C++ program need not all be translated at the same time.   The  text
  of  the program is kept in units called source files in this standard.
  A source file together with all the headers (_lib.headers_) and source
  files   included   (_cpp.include_)  via  the  preprocessing  directive
  #include, less any source lines skipped  by  any  of  the  conditional
  inclusion  (_cpp.cond_) preprocessing directives, is called a transla
  tion unit.  Previously translated translation units can  be  preserved
  individually  or  in  libraries.   The separate translation units of a
  program communicate (_basic.link_) by (for example) calls to functions
  whose identifiers have external linkage, manipulation of objects whose
  identifiers have external linkage,  or  manipulation  of  data  files.
  Translation  units  can be separately translated and then later linked
  to produce an executable program.  (_basic.link_).

  2.1  Phases of translation                                [lex.phases]

1 The precedence among the syntax rules of translation is  specified  by
  the following phases.1)

    1 Physical source file characters are mapped to the source character
      set  (introducing  new-line characters for end-of-line indicators)
      if necessary.  Trigraph sequences (_lex.trigraph_) are replaced by
      corresponding single-character internal representations.

    2 Each instance of a new-line character and an immediately preceding
      backslash character is deleted, splicing physical source lines  to
      form  logical source lines.  A source file that is not empty shall
      end in a new-line character, which shall not be  immediately  pre
      ceded by a backslash character.

    3 The   source   file   is   decomposed  into  preprocessing  tokens
      (_lex.pptoken_) and sequences of white-space characters (including
      comments).  A source file shall not end in a partial preprocessing
      token or partial comment2).  Each comment is replaced by one space
  _________________________
  1) Implementations shall behave as if these separate phases occur, al
  though in practice different phases might be folded together.
  2) A partial preprocessing token would arise from a source file ending
  in  one  or  more  characters of a multi-character token followed by a
  line-splicing backslash.  A partial comment would arise from a  source
  file  ending  with  an  unclosed /* comment, or a // comment line that
  ends with a line-splicing backslash.

      character.    New-line  characters  are  retained.   Whether  each
      nonempty sequence of white-space characters other than new-line is
      retained  or  replaced  by  one space character is implementation-
      defined.  The process of dividing a source file's characters  into
      preprocessing  tokens  is context-dependent.  For example, see the
      handling of < within a #include preprocessing directive.

    4 Preprocessing directives are executed and  macro  invocations  are
      expanded.   A  #include  preprocessing  directive causes the named
      header or source file to be processed from phase 1  through  phase
      4, recursively.

    5 Each  source character set member and escape sequence in character
      constants and string literals is converted to a member of the exe
      cution character set.

    6 Adjacent  character  string  literal  tokens  are concatenated and
      adjacent wide string literal tokens are concatenated.

    7 White-space characters separating tokens are  no  longer  signifi
      cant.   Each  preprocessing token is converted into a token.  (See
      _lex.token_).  The resulting tokens are syntactically and semanti
      cally  analyzed and translated.  The result of this process start
      ing from a single source file is called a translation unit.

    8 The translation units that will form a program are combined.   All
      external object and function references are resolved.

  +-------                 BEGIN BOX 1                -------+
    What about shared libraries?
  +-------                  END BOX 1                 -------+

    Library  components  are  linked  to  satisfy external references to
    functions and objects not defined in the current  translation.   All
    such  translator output is collected into a program image which con
    tains information needed for execution in its execution environment.

  2.2  Trigraph sequences                                 [lex.trigraph]

1 Before any other processing takes place, each occurrence of one of the
  following  sequences  of  three  characters  (trigraph  sequences)  is
  replaced by the single character indicated in Table 1.

                       Table 1--trigraph sequences

  +-----------------------+------------------------+------------------------+
  |trigraph   replacement | trigraph   replacement | trigraph   replacement |
  +-----------------------+------------------------+------------------------+
  |  ??=           #      |   ??(           [      |   ??<           {      |
  +-----------------------+------------------------+------------------------+
  |  ??/           \      |   ??)           ]      |   ??>           }      |
  +-----------------------+------------------------+------------------------+
  |  ??'           ^      |   ??!           |      |   ??-           ~      |
  +-----------------------+------------------------+------------------------+

2 For example,
          ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
  becomes
          #define arraycheck(a,b) a[b] || b[a]

  2.3  Preprocessing tokens                                [lex.pptoken]

  +-------                 BEGIN BOX 2                -------+
  We  have deleted the non-terminal for 'digraph', because the alternate
  representations are just alternative  ways  of  expressing  a  "first-
  class"  preprocessing  token.   In C, # and ## are grouped with opera
  tors, but that would involve more work in clause 13, and wouldn't  fit
  the  "spirit  of  C++".   Instead, we simply list under 'preprocessing
  token' all the valid preprocessing tokens.  They are not further cate
  gorized until phase 7, in which they are actual tokens.
  +-------                  END BOX 2                 -------+

          preprocessing-token:
                  header-name
                  identifier
                  pp-number
                  character-constant
                  string-literal
                  preprocessing-op-or-punc
                  each non-white-space character that cannot be one of the above

1 Each  preprocessing  token  that is converted to a token (_lex.token_)
  shall have the lexical form of a keyword, an identifier, a constant, a
  string literal, an operator, or a punctuator.

2 A  preprocessing  token is the minimal lexical element of the language
  in translation phases 3 through 6.  The  categories  of  preprocessing
  token are: header names, identifiers, preprocessing numbers, character
  constants, string literals, preprocessing-op-or-punc, and single  non-
  white-space  characters  that do not lexically match the other prepro
  cessing token categories.  If a ' or a " character  matches  the  last
  category, the behavior is undefined.  Preprocessing tokens can be sep
  arated by white space; this consists of comments  (_lex.comment_),  or

  white-space characters (space, horizontal tab, new-line, vertical tab,
  and form-feed), or both.  As described in  Clause  _cpp_,  in  certain
  circumstances  during translation phase 4, white space (or the absence
  thereof) serves as more than preprocessing  token  separation.   White
  space can appear within a preprocessing token only as part of a header
  name or between the quotation characters in a  character  constant  or
  string literal.

3 If  the input stream has been parsed into preprocessing tokens up to a
  given character, the next preprocessing token is the longest  sequence
  of characters that could constitute a preprocessing token.

4 The  program  fragment  1Ex  is parsed as a preprocessing number token
  (one that is not a valid floating or  integer  constant  token),  even
  though a parse as the pair of preprocessing tokens 1 and Ex might pro
  duce a valid expression (for example, if Ex were a  macro  defined  as
  +1).  Similarly, the program fragment 1E1 is parsed as a preprocessing
  number (one that is a valid floating constant token), whether or not E
  is a macro name.

5 The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and
  y are of built-in types, violates a constraint on increment operators,
  even though the parse x ++ + ++ y might yield a correct expression.

  2.4  Alternate tokens                                    [lex.digraph]

1 Alternate  token  representations  are provided for some operators and
  punctuators3).

2 In  all  respects  of  the  language, each alternate token behaves the
  same,  respectively,  as its primary token, except for its spelling4).
  The set of alternate tokens is defined in Table 2.

  _________________________
  3) These include digraphs and additional reserved words.  The term di
  graph  (token  consisting of two characters) is not perfectly descrip
  tive, since one of the alternate preprocessing-tokens is %:%:  and  of
  course  several  primary  tokens contain two characters.  Nonetheless,
  those alternate tokens that aren't lexical keywords  are  colloquially
  known as digraphs.
  4)   Thus   [   and   <:   behave    differently    when    stringized
  (_cpp.stringize__), but can otherwise be freely interchanged.

                        Table 2--alternate tokens

    +--------------------+---------------------+---------------------+
    |alternate   primary | alternate   primary | alternate   primary |
    +--------------------+---------------------+---------------------+
    |   <%          {    |    and        &&    |  and_eq       &=    |
    +--------------------+---------------------+---------------------+
    |   %>          }    |   bitor        |    |   or_eq       |=    |
    +--------------------+---------------------+---------------------+
    |   <:          [    |    or         ||    |  xor_eq       ^=    |
    +--------------------+---------------------+---------------------+
    |   :>          ]    |    xor         ^    |    not         !    |
    +--------------------+---------------------+---------------------+
    |   %:          #    |   compl        ~    |  not_eq       !=    |
    +--------------------+---------------------+---------------------+
    |  %:%:        ##    |  bitand        &    |                     |
    +--------------------+---------------------+---------------------+

  2.5  Tokens                                                [lex.token]
          token:
                  identifier
                  keyword
                  literal
                  operator
                  punctuator

1 There are five kinds of tokens: identifiers, keywords, literals (which
  include  strings  and character and numeric constants), operators, and
  other separators.  Blanks, horizontal  and  vertical  tabs,  newlines,
  formfeeds,  and  comments  (collectively,  white  space), as described
  below, are ignored except as they  serve  to  separate  tokens.   Some
  white  space  is  required to separate otherwise adjacent identifiers,
  keywords, and literals.

  2.6  Comments                                            [lex.comment]

1 The characters /* start a comment, which terminates with  the  charac
  ters  */.  These comments do not nest.  The characters // start a com
  ment, which terminates with the next new-line character. If there is a
  form-feed  or  a vertical-tab character in such a comment, only white-
  space characters can appear between it and the  new-line  that  termi
  nates  the comment; no diagnostic is required.  The comment characters
  //, /*, and */ have no special meaning within a  //  comment  and  are
  treated just like other characters.  Similarly, the comment characters
  // and /* have no special meaning within a /* comment.

  2.7  Identifiers                                            [lex.name]
          identifier:
                  nondigit
                  identifier nondigit
                  identifier digit

          nondigit: one of
                  _ a b c d e f g h i j k l m
                    n o p q r s t u v w x y z
                    A B C D E F G H I J K L M
                    N O P Q R S T U V W X Y Z
          digit: one of
                  0 1 2 3 4 5 6 7 8 9

1 An identifier is an arbitrarily long sequence of letters  and  digits.
  The  first character is a letter; the underscore _ counts as a letter.
  Upper- and lower-case letters are different.  All characters are  sig
  nificant.

  2.8  Keywords                                                [lex.key]

1 The identifiers shown in Table 3 are reserved for use as keywords, and
  shall not be used otherwise in phases 7 and 8:

                            Table 3--keywords

  +--------------------------------------------------------------------------+
  |asm          do             inline             short         typeid       |
  |auto         double         int                signed        union        |
  |bool         dynamic_cast   long               sizeof        unsigned     |
  |break        else           mutable            static        using        |
  |case         enum           namespace          static_cast   virtual      |
  |catch        explicit       new                struct        void         |
  |char         extern         operator           switch        volatile     |
  |class        false          private            template      wchar_t      |
  |const        float          protected          this          while        |
  |const_cast   for            public             throw                      |
  |continue     friend         register           true                       |
  |default      goto           reinterpret_cast   try                        |
  |delete       if             return             typedef                    |
  +--------------------------------------------------------------------------+

2 Furthermore, the alternate representations shown in Table 4  for  cer
  tain  operators and punctuators (_lex.digraph_) are reserved and shall
  not be used otherwise:

                    Table 4--alternate representations

             +-----------------------------------------------+
             |bitand   and     bitor    or    xor      compl |
             |and_eq   or_eq   xor_eq   not   not_eq         |
             +-----------------------------------------------+

3 In addition, identifiers containing a double underscore (__) or begin
  ning  with an underscore and an upper-case letter are reserved for use
  by C++ implementations and standard libraries and should be avoided by

  users; no diagnostic is required.

4 The  lexical  representation of C++ programs includes a number of pre
  processing tokens which are used in the syntax of the preprocessor  or
  are converted into tokens for operators and punctuators:
          preprocessing-op-or-punc: one of
          {       }       [       ]       #       ##      =       (       )       ,
          <:      :>      <%      %>      %:      %:%:    ;       :       ...
          new     delete  new[]   delete[]        ?
          +       -       *       /       %       ^       &       |       ~
          !       =       <       >       +=      -=      *=      /=      %=
          ^=      &=      |=      <<      >>      >>=     <<=     ==      !=
          <=      >=      &&      ||      ++      --      ,       ->*     ->
          and     bitand  bitor   compl   new<%%> delete<%%>
          not     or      xor     and_eq  not_eq  or_eq   xor_eq

  After  preprocessing,  each preprocessing-op-or-punc is converted to a
  single token in translation phase 7 (_lex.phases_).

5 Certain implementation-dependent properties, such as  the  type  of  a
  sizeof  (_expr.sizeof_)  expression,  the  ranges of fundamental types
  (_basic.fundamental_), and the types of the most basic  library  func
  tions     are     defined     in    the    standard    header    files
  (_lib.language.support_)
          <float.h>   <limits.h>   <stddef.h>
  These headers are part of the ISO C standard.  In addition the headers
          <new.h>   <stdarg.h>   <stdlib.h>
  define  the  types  of the most basic library functions.  The last two
  headers are part of the ISO C standard; <new.h> is C++ specific.

  2.9  Literals                                            [lex.literal]

1 There are several kinds of literals (often referred to as  constants).
          literal:
                  integer-literal
                  character-literal
                  floating-literal
                  string-literal
                  boolean-literal

  2.9.1  Integer literals                                     [lex.icon]
          integer-literal:
                  decimal-literal integer-suffixopt
                  octal-literal integer-suffixopt
                  hexadecimal-literal integer-suffixopt
          decimal-literal:
                  nonzero-digit
                  decimal-literal digit
          octal-literal:
                  0
                  octal-literal octal-digit

          hexadecimal-literal:
                  0x hexadecimal-digit
                  0X hexadecimal-digit
                  hexadecimal-literal hexadecimal-digit
          nonzero-digit: one of
                  1  2  3  4  5  6  7  8  9
          octal-digit: one of
                  0  1  2  3  4  5  6  7
          hexadecimal-digit: one of
                  0  1  2  3  4  5  6  7  8  9
                  a  b  c  d  e  f
                  A  B  C  D  E  F
          integer-suffix:
                  unsigned-suffix long-suffixopt
                  long-suffix unsigned-suffixopt
          unsigned-suffix: one of
                  u  U
          long-suffix: one of
                  l  L

1 An  integer  literal consisting of a sequence of digits is taken to be
  decimal (base ten) unless it begins with 0 (digit zero).   A  sequence
  of  digits  starting  with  0  is  taken  to be an octal integer (base
  eight).  The digits 8 and 9 are not octal digits.  A sequence of  dig
  its  preceded  by  0x or 0X is taken to be a hexadecimal integer (base
  sixteen).  The hexadecimal digits include a or A through f or  F  with
  decimal  values  ten  through fifteen.  For example, the number twelve
  can be written 12, 014, or 0XC.

2 The type of an integer literal depends on its form, value, and suffix.
  If it is decimal and has no suffix, it has the first of these types in
  which its value can be represented: int, long int, unsigned long  int.
  If  it  is octal or hexadecimal and has no suffix, it has the first of
  these types in which its value can be represented: int, unsigned  int,
  long int, unsigned long int.  If it is suffixed by u or U, its type is
  the first of these types  in  which  its  value  can  be  represented:
  unsigned  int,  unsigned  long  int.  If it is suffixed by l or L, its
  type is the first of these types in which  its  value  can  be  repre
  sented: long int, unsigned long int.  If it is suffixed by ul, lu, uL,
  Lu, Ul, lU, UL, or LU, its type is unsigned long int.

3 A program is ill-formed if it contains an integer literal that  cannot
  be represented by any of the allowed types.

  2.9.2  Character literals                                   [lex.ccon]
          character-literal:
                  'c-char-sequence'
                  L'c-char-sequence'
          c-char-sequence:
                  c-char
                  c-char-sequence c-char

          c-char:
                  any member of the source character set except
                          the single-quote ', backslash \, or new-line character
                  escape-sequence
          escape-sequence:
                  simple-escape-sequence
                  octal-escape-sequence
                  hexadecimal-escape-sequence
          simple-escape-sequence: one of
                  \'  \"  \?  \\
                  \a  \b  \f  \n  \r  \t  \v
          octal-escape-sequence:
                  \ octal-digit
                  octal-escape-sequence octal-digit
          hexadecimal-escape-sequence:
                  \x hexadecimal-digit
                  hexadecimal-escape-sequence hexadecimal-digit

1 A  character  literal  is  one  or  more characters enclosed in single
  quotes, as in 'x', optionally preceded by the letter L,  as  in  L'x'.
  Single  character  literals  that  do not begin with L have type char,
  with value equal to the  numerical  value  of  the  character  in  the
  machine's  character  set.   Multicharacter literals that do not begin
  with L have type int and implementation-defined value.

2 A character literal that begins with the letter L, such as L'ab', is a
  wide-character  literal.   Wide-character  literals have type wchar_t.
  They are intended for character sets where a character  does  not  fit
  into  a  single  byte.   Wide-character  literals have implementation-
  defined values, regardless of the number of characters in the literal.

3 Certain nongraphic characters, the single quote ', the double quote ",
  ?, and the backslash \, can be represented according to Table 5.

                        Table 5--escape sequences

                   +----------------------------------+
                   |new-line          NL (LF)   \n    |
                   |horizontal tab    HT        \t    |
                   |vertical tab      VT        \v    |
                   |backspace         BS        \b    |
                   |carriage return   CR        \r    |
                   |form feed         FF        \f    |
                   |alert             BEL       \a    |
                   |backslash         \         \\    |
                   |question mark     ?         \?    |
                   |single quote      '         \'    |
                   |double quote      "         \"    |
                   |octal number      ooo       \ooo  |
                   |hex number        hhh       \xhhh |
                   +----------------------------------+

  If the character following a backslash is not one of those  specified,
  the  behavior  is  undefined.   An  escape sequence specifies a single
  character.

4 The escape \ooo consists of the backslash  followed  by  one  or  more
  octal  digits that are taken to specify the value of the desired char
  acter.  The escape \xhhh consists of the backslash followed by x  fol
  lowed  by one or more hexadecimal digits that are taken to specify the
  value of the desired character.  There is no limit to  the  number  of
  digits  in either sequence.  A sequence of octal or hexadecimal digits
  is terminated by the first character that is not an octal digit  or  a
  hexadecimal  digit, respectively.  The value of a character literal is
  implementation dependent if it exceeds that of the largest  char  (for
  ordinary literals) or wchar_t (for wide literals).

  2.9.3  Floating literals                                    [lex.fcon]
          floating-constant:
                  fractional-constant exponent-partopt floating-suffixopt
                  digit-sequence exponent-part floating-suffixopt
          fractional-constant:
                  digit-sequenceopt . digit-sequence
                  digit-sequence .
          exponent-part:
                  e signopt digit-sequence
                  E signopt digit-sequence
          sign: one of
                  +  -
          digit-sequence:
                  digit
                  digit-sequence digit
          floating-suffix: one of
                  f  l  F  L

1 A  floating  literal  consists  of an integer part, a decimal point, a
  fraction part, an e or E, an optionally signed integer  exponent,  and
  an  optional type suffix.  The integer and fraction parts both consist
  of a sequence of decimal (base ten) digits.  Either the  integer  part
  or  the  fraction  part  (not both) can be missing; either the decimal
  point or the letter e (or E) and the exponent (not both) can be  miss
  ing.  The type of a floating literal is double unless explicitly spec
  ified by a suffix.  The suffixes f and F specify float, the suffixes l
  and L specify long double.

  2.9.4  String literals                                    [lex.string]
          string-literal:
                  "s-char-sequenceopt"
                  L"s-char-sequenceopt"
          s-char-sequence:
                  s-char
                  s-char-sequence s-char
          s-char:
                  any member of the source character set except
                          the double-quote ", backslash \, or new-line character
                  escape-sequence

1 A   string  literal  is  a  sequence  of  characters  (as  defined  in
  _lex.ccon_) surrounded by double quotes, optionally beginning with the
  letter L, as in "..." or L"...".  A string literal that does not begin
  with  L  has  type  array  of  n  char  and  static  storage  duration
  (_basic.stc_), where n is the size of the string as defined below, and
  is initialized with the given characters.  Whether all string literals
  are distinct (that is, are stored in nonoverlapping objects) is imple
  mentation dependent.  The effect of attempting to modify a string lit
  eral is undefined.

2 A  string  literal  that  begins  with  L, such as L"asdf", is a wide-
  character string.  A wide-character string  is  of  type  array  of  n
  wchar_t, where n is the size of the string as defined below.  Concate
  nation of ordinary and wide-character string literals is undefined.

  +-------                 BEGIN BOX 3                -------+
  Should this render the program  ill-formed?   Or  is  it  deliberately
  undefined to encourage extensions?
  +-------                  END BOX 3                 -------+

3 Adjacent string literals are concatenated.  Characters in concatenated
  strings are kept distinct.  For example,
          "\xA" "B"
  contains the two characters '\xA' and 'B' after concatenation (and not
  the single hexadecimal character '\xAB').

4 After  any  necessary  concatenation '\0' is appended so that programs
  that scan a string can find its end.  The size of a string is the num
  ber of its characters including this terminator.  Within a string, the
  double quote character " shall be preceded by a \.

5 Escape sequences in string literals have the same meaning as in  char
  acter literals (_lex.ccon_).

  2.9.5  Boolean literals                                     [lex.bool]
          boolean-literal:
                  false
                  true

1 The  Boolean  literals are the keywords false and true.  Such literals
  have type bool and the given values.  They are not lvalues.