From: kenw@sybase

SC22/WG20 N609

Proposed Defect Report re Identifiers in TR 10176

October 13, 1998

Title: Defect Report on Annex A of TR 10176

Source: Ken Whistler, Lisa Moore, and Rick McGowan

Status: Defect Report

Action: For the consideration of WG20

Distribution: UTC, L2 and WG20 members

1. Background

There continue to be differences between the repertoire of characters

included in the Unicode Standard's identifier syntax and those

documented in ISO/IEC TR 10176, second edition, Annex A.

The software industry and its users will not be well-served

if platforms and applications have different restrictions on

character behavior or usage with respect to identifiers.

Today, the character property information maintained as part of

the Unicode Standard is widely used as the source of character

semantic information in implementations of ISO/IEC 10646/Unicode.

By using these easy-to-access tables, implementors have

ensured consistent behavior across platforms and applications.

To enable consistent character usage in the future, we need to agree

on a set of characters appropriate for identifiers. Our goal is to

include in the Unicode definition all characters referenced by

TR 10176 as extended repertoire for user-defined identifiers in

programming languages. This is the only way to ensure portable,

consistent behavior in implementations of ISO/IEC 10646/Unicode.

However, because of the various mismatches between the recommendations

of Annex A in TR 10176 and the Unicode data tables, this is not

currently feasible.

2. Existing discrepancies

Among the characters included in the list recommended for identifiers

in TR 10176 Annex A, there are twenty-six that are not part of the

Unicode definition of identifiers:

06D4 ARABIC FULL STOP

0950 DEVANAGARI OM

0AD0 GUJARATI OM

0E2F THAI CHARACTER PAIYANNOI

0E4F THAI CHARACTER FONGMAN

0E5A THAI CHARACTER ANGKHANKHU

0E5B THAI CHARACTER KHOMUT

0F00 TIBETAN SYLLABLE OM

0F3E TIBETAN SIGN YAR TSHES

0F3F TIBETAN SIGN MAR TSHES

0F88 TIBETAN SIGN LCE TSA CAN

0F89 TIBETAN SIGN MCHU CAN

0F8A TIBETAN SIGN GRU CAN RGYINGS

0F8B TIBETAN SIGN GRU MED RGYINGS

0F3E TIBETAN SIGN YAR TSHES

0F33 TIBETAN DIGIT HALF ZERO

0F2A TIBETAN DIGIT HALF ONE

0F2B TIBETAN DIGIT HALF TWO

0F2C TIBETAN DIGIT HALF THREE

0F2D TIBETAN DIGIT HALF FOUR

0F2E TIBETAN DIGIT HALF FIVE

0F2F TIBETAN DIGIT HALF SIX

0F30 TIBETAN DIGIT HALF SEVEN

0F31 TIBETAN DIGIT HALF EIGHT

0F32 TIBETAN DIGIT HALF NINE

3006 IDEOGRAPHIC CLOSING MARK

30FB KATAKANA MIDDLE DOT

3. Changes to be made in the Unicode definition

After discussion among members of the UTC, we are in agreement that

the Unicode definition of identifiers should be modified to include

the following ten characters from the above list:

0950 DEVANAGARI OM

0AD0 GUJARATI OM

0E2F THAI CHARACTER PAIYANNOI

0F00 TIBETAN SYLLABLE OM

0F89 TIBETAN SIGN MCHU CAN

0F8A TIBETAN SIGN GRU CAN RGYINGS

0F8B TIBETAN SIGN GRU MED RGYINGS

0F88 TIBETAN SIGN LCE TSA CAN

3006 IDEOGRAPHIC CLOSING MARK

30FB KATAKANA MIDDLE DOT

4. Remaining discrepancies between the two definitions

The remaining sixteen characters are either punctuation, miscellaneous

symbols, or non-decimal numerics. All of these should be eliminated

from use as identifiers for the reasons given below:

- 06D4 ARABIC FULL STOP

This is terminal punctuation used in writing Urdu. It is not a letter.

- 0E4F THAI CHARACTER FONGMAN

This character is the Thai version of the "bullet" symbol, and its usage

is similar to that of the bullet. It is not a letter, and does not

constitute part of words of Thai.

- 0E5A THAI CHARACTER ANGKHANKHU

THAI CHARACTER ANGKHANKHU is a terminal punctuation character which

is used at the end of a verse in a poem. It is not a letter.

- 0E5B THAI CHARACTER KHOMUT

THAI CHARACTER KHOMUT is a terminal punctuation character which is

placed at the end of a verse in a poem. It is not a letter.

- 0F2A..0F33 TIBETAN DIGIT HALF ONE..TIBETAN DIGIT HALF ZERO

These are the only non-decimal numeric, non letter-like numbers included

in the list recommended in Annex A of TR 10176.

They are traditional and non-decimal in nature. We think it is a mistake to

include these along with decimal digits which are clearly used for identifiers.

- 0F3E..0F3F TIBETAN SIGN YAR TSHES..TIBETAN SIGN MAR TSHES

These two characters are traditional symbols which are used to bracket the

numeric items in an enumerated list. A series of numbers with these signs

beneath them are used as bullets. Their sole use is as punctuation under

these circumstances.

5. Recommendation

The Unicode Technical Committee and NCITS/L2 have reviewed the existing

differences between ISO/IEC TR 10176 and the Unicode Standard. We agree

that we should update the Unicode Standard identifier definition to include

ten of the disputed characters from TR 10176, Annex A.

On the other hand, since punctuation characters, non-decimal digits, and

miscellaneous symbols are inappropriate for use as identifiers, we ask

that WG20 review the remaining sixteen characters as specified in section 4

above and remove them from the list of characters to be used as extended

letters for identifiers in programming languages. As it currently stands,

the presence of these sixteen characters in Annex A constitutes a serious

defect in TR 10176.

end of document