WG21/N0886 X3J16/96-0068 1996-03-13 Extended Identifiers and Extended Literals Thomas Plum, John Benito, Clark Nelson Move that we revise the Working Paper as follows: Item 1) In 2.1, Phases of Translation, add a new paragraph 1 to precede the existing first paragraph. Then add onto "phase 1" a new sentence, "Any source file character not in the basic source character set is replaced by the _universal-character-name_ that designates that character." so that the first two paragraphs would read as follows: 2.1 Phases of translation [lex.phases] 1 The _basic source character set_ consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters: a b c d e f g h i j k l m n o p q r s t u v w x y z (26) A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (26) 0 1 2 3 4 5 6 7 8 9 (10) _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ' (29) The _universal-character-name_ construct provides a way to name other characters. The character designated by the _universal-character-name_ ??UNNNNNNNN is that character whose encoding in ISO/IEC 10646 is the hexadecimal value NNNNNNNN; the character designated by the _universal-character-name_ ??uNNNN is that character whose encoding in ISO/IEC 10646 is the hexadecimal value 0000NNNN. hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit universal-character-name: ??u hex-quad ??U hex-quad hex-quad 2 The precedence among the syntax rules of translation is specified by the following phases. 1 Physical source file characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.2) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set is replaced by the _universal-character-name_ that designates that character. [Footnote -- The process of handling extended characters is specified in terms of mapping to an encoding that uses only the basic source character set, and, in the case of character literals and strings, further mapping to the execution character set. In practical terms, however, any internal encoding may be used, so long as an actual extended character encountered in the input, and the same extended character expressed in the input as an _universal-character-name_ (i.e. using the ??uXXXX notation), are handled equivalently.] [end of quote from revised WP] Item 2) In 2.1, Phases of Translation, revise "phase 5" as follows: 5 Each source character set member, escape sequence, or _universal-character-name_ in character literals and string literals is converted to a member of the execution character set. Item 3) In 2.8, Identifiers, add a new line into the definition of _nondigit_, and modify paragraph 1, so that the revised text of 2.8 reads as follows: 2.8 Identifiers [lex.name] identifier: nondigit identifier nondigit identifier digit nondigit: one of _universal-character-name_ _ a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z digit: one of 0 1 2 3 4 5 6 7 8 9 1 An identifier is an arbitrarily long sequence of nondigits and digits. Each _universal-character-name_ in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in Annex E. Upper- and lower-case letters are different. All characters are significant. * [*Footnote: On systems in which linkers cannot accept extended characters, an encoding of the universal-character-name may be used in forming valid external identifiers. For example, some otherwise unused character or sequence of characters may be used to encode the "??u" in a universal-character-name. Extended characters may produce a long external identifier, but C++ does not place a translation limit on significant characters for external identifiers. In C++, upper and lower case letters are considered different for all identifiers, including external identifiers.] Item 4) Augment the definition of _c-char_ in 2.10.2, Character Literals, as follows: c-char: any member of the source character set except the single-quote ', backslash \, or new-line character escape-sequence universal-character-name Then add a new paragraph 5, as follows: 5 A _universal-character-name_ is translated to the encoding, in the execution character set, of the character named. If there is no such encoding, the _universal-character-name_ is translated to an implementation-defined encoding. [Note: In translation phase 1 a _universal-character-name_ is introduced whenever an actual extended character is encountered in the source text. Therefore, all extended characters are described in terms of _universal-character-names_. However, the actual compiler implementation may use its own native character set, so long as the same results are obtained.] Item 5) Augment the definition of _s-char_ in 2.10.4, String Literals, as follows: s-char: any member of the source character set except the double-quote ", backslash \, or new-line character escape-sequence universal-character-name Then, in paragraph 5, change "Escape sequences" to "Escape sequences and _universal-character-names". Change the last sentence to read as follows: In a non-wide string literal, a _universal-character-name_ may map to more than one char element. The size of a wide string literal is the total number of escape sequence, _universal-character-names_, and other characters, plus one for the terminating L'\0'. The size of a non-wide string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each _universal-character-name_, plus one for the terminating '\0'. Item 6) Add an annex to list the universal-character-names for identifiers. ________________________________________________________________________________ Annex E (normative) Universal-character-names for Identifiers [extended-id] ________________________________________________________________________________ 1 This Clause lists the hexadecimal code values that are valid in _universal-character-names_ in C++ identifiers. 2 This table is reproduced unchanged from ISO/IEC PDTR 10176, produced by ISO/IEC JTC1/SC22/WG20, except that the ranges 0041-005a and 0061-007a designate the upper and lower case English alphabets, which are part of the basic source character set, and are not repeated in the table below. [Editorial Note: If PDTR 10176 is changed during its balloting and adoption as a TR, then this table should be changed to match its changes.] Latin: 00c0-00d6,00d8-00f6,00f8-01f5,01fa-0217, 0250-02a8,1e00-1e9a,1ea0-1ef9 Greek: 0384,0388-038a,038c,038e-03a1,03a3-03ce,03d0-03d6,03da,03dc,03de, 03e0,03e2-03f3, 1f00-1f15,1f18-1f1d,1f20-1f45,1f48-1f4d,1f50-1f57,1f59,1f5b,1f5d, 1f5f-1f7d,1f80-1fb4,1fb6-1fbc,1fc2-1fc4,1fc6-1fcc,1fd0-1fd3, 1fd6-1fdb,1fe0-1fec,1ff2-1ff4,1ff6-1ffc, Cyrilic: 0401-040d,040f-044f,0451-045c,045e-0481,0490-04c4,04c7-04c8, 04cb-04cc,04d0-04eb,04ee-04f5,04f8-04f9 Armenian: 0531-0556,0561-0587 Hebrew: 05d0-05ea,05f0-05f4 Arabic: 0621-063a,0640-0652,0670-06b7,06ba-06be,06c0-06ce,06e5-06e7, Devanagari: 0905-0939,0958-0962 Bengali: 0985-098c,098f-0990,0993-09a8,09aa-09b0,09b2,09b6-09b9, 09dc-09dd,09df-09e1,09f0-09f1 Gurmukhi: 0a05-0a0a,0a0f-0a10,0a13-0a28,0a2a-0a30,0a32-0a33, 0a35-0a36,0a38-0a39,0a59-0a5c,0a5e Gujarati: 0a85-0a8b,0a8d,0a8f-0a91,0a93-0aa8,0aaa-0ab0,0ab2-0ab3, 0ab5-0ab9,0ae0, Oriya: 0b05-0b0c,0b0f-0b10,0b13-0b28,0b2a-0b30,0b32-0b33,0b36-0b39, 0b5c-0b5d,0b5f-0b61, Tamil: 0b85-0b8a,0b8e-0b90,0b92-0b95,0b99-0b9a,0b9c,0b9e-0b9f,0ba3-0ba4, 0ba8-0baa,0bae-0bb5,0bb7-0bb9, Telugu: 0c05-0c0c,0c0e-0c10,0c12-0c28,0c2a-0c33,0c35-0c39,0c60-0c61, Kannada: 0c85-0c8c,0c8e-0c90,0c92-0ca8,0caa-0cb3,0cb5-0cb9,0ce0-0ce1, Malayalam: 0d05-0d0c,0d0e-0d10,0d12-0d28,0d2a-0d39,0d60-0d61, Thai: 0e01-0e30,0e32-0e33,0e40-0e46,0e4f-0e5b, Lao: 0e81-0e82,0e84,0e87,0e88,0e8a,0e0d,0e94-0e97,0e99-0e9f,0ea1-0ea3, 0ea5,0ea7,0eaa,0eab,0ead-0eb0,0eb2,0eb3,0ebd,0ec0-0ec4,0ec6, Georgian: 10a0-10c5,10d0-10f6, Hiragana: 3041-3094,309b-309e Katakana: 30a1-30fe, Bopmofo: 3105-312c, Hangul: 1100-1159,1161-11a2,11a8-11f9 CJK Unified Ideographs: f900-fa2d, fb1f-fb36,fb38-fb3c,fb3e,fb40-fb41,fb42-fb44,fb46-fbb1,fbd3-fd3f, fd50-fd8f,fd92-fdc7,fdf0-fdfb,fe70-fe72,fe74,5e76-fefc, ff21-ff3a,ff41-ff5a,ff66-ffbe,ffc2-ffc7,ffca-ffcf,ffd2-ffd7, ffda-ffdc,4e00-9fa5