Ascii file: lex.lat


  ______________________________________________________________________

  2   Lexical conventions                                [lex]

  ______________________________________________________________________

1 A C++ program need not all be translated at the same time.   The  text
  of  the program is kept in units called source files in this standard.
  A source file together with all the headers (_lib.headers_) and source
  files   included   (_cpp.include_)  via  the  preprocessing  directive
  #include, less any source lines skipped  by  any  of  the  conditional
  inclusion  (_cpp.cond_) preprocessing directives, is called a transla
  tion unit.  Previously translated translation units may  be  preserved
  individually  or  in  libraries.   The separate translation units of a
  program communicate (_basic.link_) by (for example) calls to functions
  whose identifiers have external linkage, manipulation of objects whose
  identifiers have external linkage,  or  manipulation  of  data  files.
  Translation  units  may be separately translated and then later linked
  to produce an executable program.  (_basic.link_).

  2.1  Phases of translation                                [lex.phases]

1 The precedence among the syntax rules of translation is  specified  by
  the following phases.1)

    1 Physical source file characters are mapped to the source character
      set  (introducing  new-line characters for end-of-line indicators)
      if necessary.  Trigraph sequences (_lex.trigraph_) are replaced by
      corresponding single-character internal representations.

    2 Each instance of a new-line character and an immediately preceding
      backslash character is deleted, splicing physical source lines  to
      form  logical source lines.  A source file that is not empty shall
      end in a new-line character, which shall not be  immediately  pre
      ceded by a backslash character.

    3 The   source   file   is   decomposed  into  preprocessing  tokens
      (_lex.pptoken_) and sequences of white-space characters (including
      comments).  A source file shall not end in a partial preprocessing
      token or partial comment.  Each comment is replaced by  one  space
      character.    New-line  characters  are  retained.   Whether  each
      nonempty sequence of white-space characters other than new-line is
      retained  or  replaced  by  one space character is implementation-
      defined.  The process of dividing a source file's characters  into
      preprocessing  tokens  is context-dependent.  For example, see the
  _________________________
  1)  Implementations must behave as if these separate phases occur, al
  though in practice different phases may be folded together.

      handling of < within a #include preprocessing directive.

    4 Preprocessing directives are executed and  macro  invocations  are
      expanded.   A  #include  preprocessing  directive causes the named
      header or source file to be processed from phase 1  through  phase
      4, recursively.

    5 Each  source character set member and escape sequence in character
      constants and string literals is converted to a member of the exe
      cution character set.

    6 Adjacent  character  string  literal  tokens  are concatenated and
      adjacent wide string literal tokens are concatenated.

    7 White-space characters separating tokens are  no  longer  signifi
      cant.   Each  preprocessing token is converted into a token.  (See
      _lex.digraph_).  The resulting tokens are syntactically and seman
      tically  analyzed  and  translated.   The  result  of this process
      starting from a single source file is called a translation unit.

    8 The translation units that will form a program are combined.   All
      external object and function references are resolved.

  +-------                 BEGIN BOX 1                -------+
    What about shared libraries?
  +-------                  END BOX 1                 -------+

    Library  components  are  linked  to  satisfy external references to
    functions and objects not defined in the current  translation.   All
    such  translator output is collected into a program image which con
    tains information needed for execution in its execution environment.

  2.2  Trigraph sequences                                 [lex.trigraph]

1 Before any other processing takes place, each occurrence of one of the
  following  sequences  of  three  characters  (trigraph  sequences)  is
  replaced by the single character indicated in Table 1.

                       Table 1--trigraph sequences

  +-----------------------+------------------------+------------------------+
  |trigraph   replacement | trigraph   replacement | trigraph   replacement |
  +-----------------------+------------------------+------------------------+
  |  ??=           #      |   ??(           [      |   ??<           {      |
  +-----------------------+------------------------+------------------------+
  |  ??/           \      |   ??)           ]      |   ??>           }      |
  +-----------------------+------------------------+------------------------+
  |  ??'           ^      |   ??!           |      |   ??-           ~      |
  +-----------------------+------------------------+------------------------+

2 For example,
          ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
  becomes
          #define arraycheck(a,b) a[b] || b[a]

  2.3  Preprocessing tokens                                [lex.pptoken]
          preprocessing-token:
                  header-name
                  identifier
                  pp-number
                  character-constant
                  string-literal
                  operator
                  digraph
                  punctuator
                  each non-white-space character that cannot be one of the above

1 Each  preprocessing  token  that is converted to a token (_lex.token_)
  shall have the lexical form of a keyword, an identifier, a constant, a
  string literal, an operator, a digraph, or a punctuator.

2 A  preprocessing  token is the minimal lexical element of the language
  in translation phases 3 through 6.  The  categories  of  preprocessing
  token are: header names, identifiers, preprocessing numbers, character
  constants, string literals, operators, punctuators, digraphs, and sin
  gle  non-white-space  characters that do not lexically match the other
  preprocessing token categories.  If a ' or a " character  matches  the
  last category, the behavior is undefined.  Preprocessing tokens can be
  separated by white space; this consists of  comments  (_lex.comment_),
  or  white-space  characters (space, horizontal tab, new-line, vertical
  tab, and form-feed), or both.  As described in Clause _cpp_,  in  cer
  tain  circumstances  during  translation  phase 4, white space (or the
  absence thereof) serves as more than preprocessing  token  separation.
  White  space may appear within a preprocessing token only as part of a
  header name or between the quotation characters in  a  character  con
  stant or string literal.

3 If  the input stream has been parsed into preprocessing tokens up to a
  given character, the next preprocessing token is the longest  sequence
  of characters that could constitute a preprocessing token.

4 The  program  fragment  1Ex  is parsed as a preprocessing number token
  (one that is not a valid floating or  integer  constant  token),  even
  though a parse as the pair of preprocessing tokens 1 and Ex might pro
  duce a valid expression (for example, if Ex were a  macro  defined  as
  +1).  Similarly, the program fragment 1E1 is parsed as a preprocessing
  number (one that is a valid floating constant token), whether or not E
  is a macro name.

5 The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and
  y are of built-in types, violates a constraint on increment operators,
  even though the parse x ++ + ++ y might yield a correct expression.

  2.4  Digraph sequences                                   [lex.digraph]

1 Alternate  representations are provided for the operators and punctua
  tors whose primary representations use the national characters.  These
  include digraphs and additional reserved words.
          digraph:
                  <%
                  %>
                  <:
                  :>
                  %:

2 In  translation  phase 3 (_lex.phases_) the digraphs are recognized as
  preprocessing tokens.  Then in translation phase 7  the  digraphs  and
  the  additional  identifiers  listed  below  are converted into tokens
  identical to those from the corresponding primary representations,  as
  shown in Table 2.

            Table 2--identifiers that are treated as operators

    +--------------------+---------------------+---------------------+
    |alternate   primary | alternate   primary | alternate   primary |
    +--------------------+---------------------+---------------------+
    |   <%          {    |    and        &&    |  and_eq       &=    |
    +--------------------+---------------------+---------------------+
    |   %>          }    |   bitor        |    |   or_eq       |=    |
    +--------------------+---------------------+---------------------+
    |   <:          [    |    or         ||    |  xor_eq       ^=    |
    +--------------------+---------------------+---------------------+
    |   :>          ]    |    xor         ^    |    not         !    |
    +--------------------+---------------------+---------------------+
    |   %:          #    |   compl        ~    |  not_eq       !=    |
    +--------------------+---------------------+---------------------+
    | bitand        &    |                     |                     |
    +--------------------+---------------------+---------------------+

  2.5  Tokens                                                [lex.token]
          token:
                  identifier
                  keyword
                  literal
                  operator
                  punctuator

1 There are five kinds of tokens: identifiers, keywords, literals (which
  include strings and character and numeric constants),  operators,  and
  other  separators.   Blanks,  horizontal  and vertical tabs, newlines,
  formfeeds, and comments  (collectively,  white  space),  as  described
  below,  are  ignored  except  as  they serve to separate tokens.  Some
  white space is required to separate  otherwise  adjacent  identifiers,
  keywords, and literals.

2 If  the input stream has been parsed into tokens up to a given prepro
  cessing token, the next token is taken to be  the  longest  string  of
  preprocessing tokens that could possibly constitute a token.

  2.6  Comments                                            [lex.comment]

1 The  characters  /* start a comment, which terminates with the charac
  ters */.  These comments do not nest.  The characters // start a  com
  ment, which terminates with the next new-line character. If there is a
  form-feed or a vertical-tab character in such a comment,  only  white-
  space  characters  may  appear between it and the new-line that termi
  nates the comment; no diagnostic is required.  The comment  characters
  //,  /*,  and  */  have no special meaning within a // comment and are
  treated just like other characters.  Similarly, the comment characters
  // and /* have no special meaning within a /* comment.

  2.7  Identifiers                                            [lex.name]
          identifier:
                  nondigit
                  identifier nondigit
                  identifier digit
          nondigit: one of
                  _ a b c d e f g h i j k l m
                    n o p q r s t u v w x y z
                    A B C D E F G H I J K L M
                    N O P Q R S T U V W X Y Z
          digit: one of
                  0 1 2 3 4 5 6 7 8 9

1 An  identifier  is an arbitrarily long sequence of letters and digits.
  The first character must be a letter; the underscore  _  counts  as  a
  letter.   Upper- and lower-case letters are different.  All characters
  are significant.

  2.8  Keywords                                                [lex.key]

1 The identifiers shown in Table 3 are reserved for use as keywords, and
  may not be used otherwise in phases 7 and 8:

                            Table 3--keywords

  +------------------------------------------------------------------------+
  |asm          delete         if          reinterpret_cast   true         |
  |auto         do             inline      return             try          |
  |bool         double         int         short              typedef      |
  |break        dynamic_cast   long        signed             typeid       |
  |case         else           mutable     sizeof             union        |
  |catch        enum           namespace   static             unsigned     |
  |char         extern         new         static_cast        using        |
  |class        false          operator    struct             virtual      |
  |const        float          private     switch             void         |
  |const_cast   for            protected   template           volatile     |
  |continue     friend         public      this               wchar_t      |
  |default      goto           register    throw              while        |
  +------------------------------------------------------------------------+

2 Furthermore,  the  alternate representations shown in Table 4 for cer
  tain operators and punctuators (_lex.digraph_) are  reserved  and  may
  not be used otherwise:

                    Table 4--alternate representations

             +-----------------------------------------------+
             |bitand   and     bitor    or    xor      compl |
             |and_eq   or_eq   xor_eq   not   not_eq         |
             +-----------------------------------------------+

3 In addition, identifiers containing a double underscore (__) or begin
  ning with an underscore and an upper-case letter are reserved for  use
  by C++ implementations and standard libraries and should be avoided by
  users; no diagnostic is required.

4 The ASCII representation of C++ programs  uses  as  operators  or  for
  punctuation the characters shown in Table 5.

              Table 5--operators and punctuation characters

         +------------------------------------------------------+
         |!   %   ^   &   *   (   )   -   +   --  {   }   |   ~ |
         |[   ]   \   ;   '   :   "   <   >   ?   ,   .   /     |
         +------------------------------------------------------+
  Table  6 shows the character combinationations that are used as opera
  tors.

            Table 6--character combinations used as operators

      +-------------------------------------------------------------+
      |->   ++   --   .*   ->*   <<   >>    <=    >=   ==   !=   && |
      |||   *=   /=   %=   +=    -=   <<=   >>=   &=   ^=   |=   :: |
      +-------------------------------------------------------------+
  Each  is  converted  to  a  single  token  in  translation   phase   7
  (_lex.phases_).

5 Table 7 shows character combinations that are used as alternative rep
  resentations for certain operators and punctuators (_lex.digraph_).

                            Table 7--digraphs

                         +-----------------------+
                         |<%   %>   <:   :>   %: |
                         +-----------------------+
  Each of these is also recognized as  a  single  token  in  translation
  phases 3 and 7.

6 Table 8 shows additional tokens that are used by the preprocessor.

                      Table 8--preprocessing tokens

                       +---------------------------+
                       |#   ##   %:   %:%:         |
                       +---------------------------+

7 Certain  implementation-dependent  properties,  such  as the type of a
  sizeof  (_expr.sizeof_)  and   the   ranges   of   fundamental   types
  (_basic.fundamental_),  are  defined  in  the  standard  header  files
  (_cpp.include_)
          <float.h>   <limits.h>   <stddef.h>
  These headers are part of the ISO C standard.  In addition the headers
          <new.h>   <stdarg.h>   <stdlib.h>
  define  the  types  of the most basic library functions.  The last two
  headers are part of the ISO C standard; <new.h> is C++ specific.

  2.9  Literals                                            [lex.literal]

1 There are several kinds of literals (often referred to as  constants).
          literal:
                  integer-literal
                  character-literal
                  floating-literal
                  string-literal
                  boolean-literal

  2.9.1  Integer literals                                     [lex.icon]
          integer-literal:
                  decimal-literal integer-suffixopt
                  octal-literal integer-suffixopt
                  hexadecimal-literal integer-suffixopt
          decimal-literal:
                  nonzero-digit
                  decimal-literal digit
          octal-literal:
                  0
                  octal-literal octal-digit
          hexadecimal-literal:
                  0x hexadecimal-digit
                  0X hexadecimal-digit
                  hexadecimal-literal hexadecimal-digit
          nonzero-digit: one of
                  1  2  3  4  5  6  7  8  9
          octal-digit: one of
                  0  1  2  3  4  5  6  7
          hexadecimal-digit: one of
                  0  1  2  3  4  5  6  7  8  9
                  a  b  c  d  e  f
                  A  B  C  D  E  F
          integer-suffix:
                  unsigned-suffix long-suffixopt
                  long-suffix unsigned-suffixopt
          unsigned-suffix: one of
                  u  U
          long-suffix: one of
                  l  L

1 An  integer  literal consisting of a sequence of digits is taken to be
  decimal (base ten) unless it begins with 0 (digit zero).   A  sequence
  of  digits  starting  with  0  is  taken  to be an octal integer (base
  eight).  The digits 8 and 9 are not octal digits.  A sequence of  dig
  its  preceded  by  0x or 0X is taken to be a hexadecimal integer (base
  sixteen).  The hexadecimal digits include a or A through f or  F  with
  decimal  values  ten  through fifteen.  For example, the number twelve
  can be written 12, 014, or 0XC.

2 The type of an integer literal depends on its form, value, and suffix.
  If it is decimal and has no suffix, it has the first of these types in
  which its value can be represented: int, long int, unsigned long  int.
  If  it  is octal or hexadecimal and has no suffix, it has the first of
  these types in which its value can be represented: int, unsigned  int,
  long int, unsigned long int.  If it is suffixed by u or U, its type is
  the first of these types  in  which  its  value  can  be  represented:
  unsigned  int,  unsigned  long  int.  If it is suffixed by l or L, its
  type is the first of these types in which  its  value  can  be  repre
  sented: long int, unsigned long int.  If it is suffixed by ul, lu, uL,
  Lu, Ul, lU, UL, or LU, its type is unsigned long int.

3 A program is ill-formed if it contains an integer literal that  cannot
  be represented by any of the allowed types.

  2.9.2  Character literals                                   [lex.ccon]
          character-literal:
                  'c-char-sequence'
                  L'c-char-sequence'
          c-char-sequence:
                  c-char
                  c-char-sequence c-char
          c-char:
                  any member of the source character set except
                          the single-quote ', backslash \, or new-line character
                  escape-sequence
          escape-sequence:
                  simple-escape-sequence
                  octal-escape-sequence
                  hexadecimal-escape-sequence
          simple-escape-sequence: one of
                  \'  \"  \?  \\
                  \a  \b  \f  \n  \r  \t  \v
          octal-escape-sequence:
                  \ octal-digit
                  octal-escape-sequence octal-digit
          hexadecimal-escape-sequence:
                  \x hexadecimal-digit
                  hexadecimal-escape-sequence hexadecimal-digit

1 A  character  literal  is  one  or  more characters enclosed in single
  quotes, as in 'x', optionally preceded by the letter L,  as  in  L'x'.
  Single  character  literals  that  do not begin with L have type char,
  with value equal to the  numerical  value  of  the  character  in  the
  machine's  character  set.   Multicharacter literals that do not begin
  with L have type int and implementation-defined value.

2 A character literal that begins with the letter L, such as L'ab', is a
  wide-character  literal.   Wide-character  literals have type wchar_t.
  They are intended for character sets where a character  does  not  fit
  into  a  single  byte.   Wide-character  literals have implementation-
  defined values, regardless of the number of characters in the literal.

3 Certain nongraphic characters, the single quote ', the double quote ",
  ?, and the backslash \, may be represented according to Table 9.

                        Table 9--escape sequences

                   +----------------------------------+
                   |new-line          NL (LF)   \n    |
                   |horizontal tab    HT        \t    |
                   |vertical tab      VT        \v    |
                   |backspace         BS        \b    |
                   |carriage return   CR        \r    |
                   |form feed         FF        \f    |
                   |alert             BEL       \a    |
                   |backslash         \         \\    |
                   |question mark     ?         \?    |
                   |single quote      '         \'    |
                   |double quote      "         \"    |
                   |octal number      ooo       \ooo  |
                   |hex number        hhh       \xhhh |
                   +----------------------------------+
  If the character following a backslash is not one of those  specified,
  the  behavior  is  undefined.   An  escape sequence specifies a single
  character.

4 The escape \ooo consists of the backslash  followed  by  one  or  more
  octal  digits that are taken to specify the value of the desired char
  acter.  The escape \xhhh consists of the backslash followed by x  fol
  lowed  by one or more hexadecimal digits that are taken to specify the
  value of the desired character.  There is no limit to  the  number  of
  digits  in either sequence.  A sequence of octal or hexadecimal digits
  is terminated by the first character that is not an octal digit  or  a
  hexadecimal  digit, respectively.  The value of a character literal is
  implementation dependent if it exceeds that of the largest  char  (for
  ordinary literals) or wchar_t (for wide literals).

  2.9.3  Floating literals                                    [lex.fcon]
          floating-constant:
                  fractional-constant exponent-partopt floating-suffixopt
                  digit-sequence exponent-part floating-suffixopt
          fractional-constant:
                  digit-sequenceopt . digit-sequence
                  digit-sequence .
          exponent-part:
                  e signopt digit-sequence
                  E signopt digit-sequence
          sign: one of
                  +  -
          digit-sequence:
                  digit
                  digit-sequence digit
          floating-suffix: one of
                  f  l  F  L

1 A  floating  literal  consists  of an integer part, a decimal point, a
  fraction part, an e or E, an optionally signed integer  exponent,  and
  an  optional type suffix.  The integer and fraction parts both consist
  of a sequence of decimal (base ten) digits.  Either the  integer  part
  or  the  fraction  part  (not both) may be missing; either the decimal
  point or the letter e (or E) and the exponent (not both) may be  miss
  ing.  The type of a floating literal is double unless explicitly spec
  ified by a suffix.  The suffixes f and F specify float, the suffixes l
  and L specify long double.

  2.9.4  String literals                                    [lex.string]
          string-literal:
                  "s-char-sequenceopt"
                  L"s-char-sequenceopt"
          s-char-sequence:
                  s-char
                  s-char-sequence s-char
          s-char:
                  any member of the source character set except
                          the double-quote ", backslash \, or new-line character
                  escape-sequence

1 A   string  literal  is  a  sequence  of  characters  (as  defined  in
  _lex.ccon_) surrounded by double quotes, optionally beginning with the
  letter L, as in "..." or L"...".  A string literal that does not begin
  with  L  has  type  array  of  n  char  and  static  storage  duration
  (_basic.stc_), where n is the size of the string as defined below, and
  is initialized with the given characters.  Whether all string literals
  are distinct (that is, are stored in nonoverlapping objects) is imple
  mentation dependent.  The effect of attempting to modify a string lit
  eral is undefined.

2 A  string  literal  that  begins  with  L, such as L"asdf", is a wide-
  character string.  A wide-character string  is  of  type  array  of  n
  wchar_t, where n is the size of the string as defined below.  Concate
  nation of ordinary and wide-character string literals is undefined.

  +-------                 BEGIN BOX 2                -------+
  Should this render the program  ill-formed?   Or  is  it  deliberately
  undefined to encourage extensions?
  +-------                  END BOX 2                 -------+

3 Adjacent string literals are concatenated.  Characters in concatenated
  strings are kept distinct.  For example,
          "\xA" "B"
  contains the two characters '\xA' and 'B' after concatenation (and not
  the single hexadecimal character '\xAB').

4 After  any  necessary  concatenation '\0' is appended so that programs
  that scan a string can find its end.  The size of a string is the num
  ber of its characters including this terminator.  Within a string, the
  double quote character " must be preceded by a \.

5 Escape sequences in string literals have the same meaning as in  char
  acter literals (_lex.ccon_).

  2.9.5  Boolean literals                                     [lex.bool]
          boolean-literal:
                  false
                  true

1 The  Boolean  literals are the keywords false and true.  Such literals
  have type bool and the given values.  They are not lvalues.