SC22/WG20 N609
Proposed Defect Report re Identifiers in TR 10176
October 13, 1998
Title: Defect Report on Annex A of TR 10176
Source: Ken Whistler, Lisa Moore, and Rick McGowan
Status: Defect Report
Action: For the consideration of WG20
Distribution: UTC, L2 and WG20 members
1. Background
There continue to be differences between the repertoire of characters
included in the Unicode Standard's identifier syntax and those
documented in ISO/IEC TR 10176, second edition, Annex A.
The software industry and its users will not be well-served
if platforms and applications have different restrictions on
character behavior or usage with respect to identifiers.
Today, the character property information maintained as part of
the Unicode Standard is widely used as the source of character
semantic information in implementations of ISO/IEC 10646/Unicode.
By using these easy-to-access tables, implementors have
ensured consistent behavior across platforms and applications.
To enable consistent character usage in the future, we need to agree
on a set of characters appropriate for identifiers. Our goal is to
include in the Unicode definition all characters referenced by
TR 10176 as extended repertoire for user-defined identifiers in
programming languages. This is the only way to ensure portable,
consistent behavior in implementations of ISO/IEC 10646/Unicode.
However, because of the various mismatches between the recommendations
of Annex A in TR 10176 and the Unicode data tables, this is not
currently feasible.
2. Existing discrepancies
Among the characters included in the list recommended for identifiers
in TR 10176 Annex A, there are twenty-six that are not part of the
Unicode definition of identifiers:
06D4 ARABIC FULL STOP
0950 DEVANAGARI OM
0AD0 GUJARATI OM
0E2F THAI CHARACTER PAIYANNOI
0E4F THAI CHARACTER FONGMAN
0E5A THAI CHARACTER ANGKHANKHU
0E5B THAI CHARACTER KHOMUT
0F00 TIBETAN SYLLABLE OM
0F3E TIBETAN SIGN YAR TSHES
0F3F TIBETAN SIGN MAR TSHES
0F88 TIBETAN SIGN LCE TSA CAN
0F89 TIBETAN SIGN MCHU CAN
0F8A TIBETAN SIGN GRU CAN RGYINGS
0F8B TIBETAN SIGN GRU MED RGYINGS
0F3E TIBETAN SIGN YAR TSHES
0F33 TIBETAN DIGIT HALF ZERO
0F2A TIBETAN DIGIT HALF ONE
0F2B TIBETAN DIGIT HALF TWO
0F2C TIBETAN DIGIT HALF THREE
0F2D TIBETAN DIGIT HALF FOUR
0F2E TIBETAN DIGIT HALF FIVE
0F2F TIBETAN DIGIT HALF SIX
0F30 TIBETAN DIGIT HALF SEVEN
0F31 TIBETAN DIGIT HALF EIGHT
0F32 TIBETAN DIGIT HALF NINE
3006 IDEOGRAPHIC CLOSING MARK
30FB KATAKANA MIDDLE DOT
3. Changes to be made in the Unicode definition
After discussion among members of the UTC, we are in agreement that
the Unicode definition of identifiers should be modified to include
the following ten characters from the above list:
0950 DEVANAGARI OM
0AD0 GUJARATI OM
0E2F THAI CHARACTER PAIYANNOI
0F00 TIBETAN SYLLABLE OM
0F89 TIBETAN SIGN MCHU CAN
0F8A TIBETAN SIGN GRU CAN RGYINGS
0F8B TIBETAN SIGN GRU MED RGYINGS
0F88 TIBETAN SIGN LCE TSA CAN
3006 IDEOGRAPHIC CLOSING MARK
30FB KATAKANA MIDDLE DOT
4. Remaining discrepancies between the two definitions
The remaining sixteen characters are either punctuation, miscellaneous
symbols, or non-decimal numerics. All of these should be eliminated
from use as identifiers for the reasons given below:
- 06D4 ARABIC FULL STOP
This is terminal punctuation used in writing Urdu. It is not a letter.
- 0E4F THAI CHARACTER FONGMAN
This character is the Thai version of the "bullet" symbol, and its usage
is similar to that of the bullet. It is not a letter, and does not
constitute part of words of Thai.
- 0E5A THAI CHARACTER ANGKHANKHU
THAI CHARACTER ANGKHANKHU is a terminal punctuation character which
is used at the end of a verse in a poem. It is not a letter.
- 0E5B THAI CHARACTER KHOMUT
THAI CHARACTER KHOMUT is a terminal punctuation character which is
placed at the end of a verse in a poem. It is not a letter.
- 0F2A..0F33 TIBETAN DIGIT HALF ONE..TIBETAN DIGIT HALF ZERO
These are the only non-decimal numeric, non letter-like numbers included
in the list recommended in Annex A of TR 10176.
They are traditional and non-decimal in nature. We think it is a mistake to
include these along with decimal digits which are clearly used for identifiers.
- 0F3E..0F3F TIBETAN SIGN YAR TSHES..TIBETAN SIGN MAR TSHES
These two characters are traditional symbols which are used to bracket the
numeric items in an enumerated list. A series of numbers with these signs
beneath them are used as bullets. Their sole use is as punctuation under
these circumstances.
5. Recommendation
The Unicode Technical Committee and NCITS/L2 have reviewed the existing
differences between ISO/IEC TR 10176 and the Unicode Standard. We agree
that we should update the Unicode Standard identifier definition to include
ten of the disputed characters from TR 10176, Annex A.
On the other hand, since punctuation characters, non-decimal digits, and
miscellaneous symbols are inappropriate for use as identifiers, we ask
that WG20 review the remaining sixteen characters as specified in section 4
above and remove them from the list of characters to be used as extended
letters for identifiers in programming languages. As it currently stands,
the presence of these sixteen characters in Annex A constitutes a serious
defect in TR 10176.
end of document