SC22/WG20 N795 Known problems with case mapping tables in ISO/IEC TR 14652 From: Kenneth Whistler [kenw@sybase.com] Sent: Monday, October 23, 2000 11:10 PM Ann, > My very strong preference is that TR 14652 be brought in line with > UnidcodeData.txt, except for explicable differences. I think that WG20 should take this as a very clearly expressed preference. I have raised this issue in WG20 on more than one occasion, most recently at the Québec meeting in May. Unfortunately, Keld has bristled and balked on each occasion, giving me little confidence that this is something that WG20 can actually accomplish. The claim, in the most recent instance, was that the LC_CTYPE data had been "checked" already, by which I inferred that it had been run through a POSIX-compliant parser and had no syntax errors in it, but not that it had been validated against UnicodeData.txt, so that any differences could either be corrected or explained. In fact, Keld was very passionate on this issue, claiming that "for case mapping the data should be *ours* [i.e. SC22's] and [that] the UTC has no business defining it." I, on the other hand, feel that ISO technical committees should be in the business of producing consensus standards, and if there is preestablished practice in wide use by many implementers and vendors, there is a pretty good prima facie case for using that demonstrated consensus as the basis for developing any standard in that particular technical area. At any rate, we will argue this once again at the WG20 meeting next week. > In order for COBOL to use UnicodeData.txt, the database would have to be > copied into the COBOL standard and probably made reader-friendly, adding to > the size of the COBOL standard and introducing the opportunity for errors. > I would like to avoid that. The worst effect is that it would subject the > content to comment from a review audience having a lot less expertise than > WG20 and the Unicode consortium; this can only result in delay for COBOL > even if all the comments are invalid. I agree that this would be an undesirable outcome. And, if we can solve the particular problem of the case mapping tables for DTR 14652, this one reference problem for the COBOL standard can be avoided. However, I think this is only the tip of the iceberg for SC22 programming languages dealing with 10646/Unicode. The fact is that significant, widely implemented consensus standards regarding implementation details of 10646/Unicode (e.g., the bidi algorithm, the normalization algorithm, line breaking property tables, name preparation for IDN, etc.) are being developed by non-JTC1 standardization organizations like the UTC, the IETF, or the W3C. It is just not feasible (or desirable) for WG20 to try to replicate this work into ISO standards or to try to create competing ISO standards in the same area. So one of these days the SC22 language committees and JTC1 are going to have to come to grips with how to reference and make use of consensus, implemented standards that don't happen to have ISO labels on them. > > From my limited viewpoint, I cannot imagine inexplicable differences > between TR 14652 and UnicodeData.txt. I had understood there were errors > resulting from the difficulty of ensuring consistency between the two. Now > there are edge cases. I need to understand what these are and why there > are differences. Well, I guess I am going to have to catalog the entire list for the WG20 meeting, since it is apparent that the editor of DTR 14652 is not going to do so. Issues that I know of right now: 1. (major) The LC_CTYPE definition is based on the Unicode 2.0 repertoire, which is now 4 years old. The implementers are moving on to Unicode 3.0, and at this point, it makes sense for any LC_CTYPE definition for a not-yet-finished TR to be based on the *current* ISO/IEC 10646-1:2000 publication. One obvious omission that ought to give Europeans pause: U+20AC EURO SIGN is not included in the current 14652 table. 2. The table omits an uppercase mapping for 0131 and a lowercase mapping for 0130. This was a deliberate choice by the editor, to avoid the "Turkish case mapping problem". But of course, the problem is not avoided by omitting it. 3. The table erroneously includes an uppercase mapping for Mkhedruli Georgian characters, which are caseless. This error has been pointed out before, but the editor does not want to change it, since it would result in asymmetrical case mappings for the Georgian alphabets. 4. A lowercase mapping for 019F is missing from the table. 5. The case pair for 01A6/0280 is not recognized, and is missing from both tables. > It would be far better for COBOL to reference TR 14652 > and override the cases that might need to be different if there are such. I believe that if DTR 14652 were updated to match UnicodeData.txt, the only case that you might have to override would be the Turkish i's, depending on how you want the equivalence classes for i's to work out for COBOL identifiers. > > For Bill Klein: As currently specified, COBOL does not fold from or to > the small dotless I, the capital dotted I, or the final sigma - using TR > 14652 as a reference and folding tolower. > > > > No, 30 characters would mean a maximum of 120 bytes in the worst case, > > since all encodable characters are guaranteed to be in the range > > COBOL is counting in "code units". COBOL uses the term "character > position", which means the same. Each of the code elements of a combining > sequence are one character position. My IBM source on Unicode has said > that the industry direction for Unicode data is to treat each "character" > of a combining sequence as a separate character. I think there is still a confusion of terminology here. The last statement is correct: a combining character sequence is a sequence of characters, and most processes that count "characters" would count each of the characters of that sequence, rather than trying to do the high-level analysis to determine a graphemic count. However, "code unit" doesn't mean what you think it does. In Unicode-speak, "code point" is equivalent to the COBOL usage "character position". That is the term for the encoded character, regardless of the number of bytes required to express that character in a computer register in a particular encoding form. Code unit, on the other hand, refers to the integral data unit used as the basis for expressing the character in a particular encoding form. Here are three examples, to illustrate the distinctions. U+0041 LATIN CAPITAL LETTER A U+0041 the code point of the character (its encoding) 0x41 UTF-8 encoding form: 1 code unit (byte) 0x0041 UTF-16 encoding form: 1 code unit (wyde) 0x00000041 UTF-32 encoding form: 1 code unit (word) U+4E00 CJK UNIFIED IDEOGRAPH-4E00 (the Chinese character for 'one') U+4E00 the code point of the character (its encoding) 0xB4 0xB8 0x80 UTF-8 encoding form: 3 code units (byte) 0x4E00 UTF-16 encoding form: 1 code unit (wyde) 0x00004E00 UTF-32 encoding form: 1 code unit (word) U+1D103 MUSICAL SYMBOL DOUBLE SHARP U+1D103 the code point of the character (its encoding) 0xF1 0x9D 0x84 0x83 UTF-8 encoding form: 4 code units (byte) 0xD874 0xDD03 UTF-16 encoding form: 2 code units (wyde) 0x001D013 UTF-32 encoding form: 1 code unit (word) > I thought being > consistent for identifiers would be best for users. The COBOL spec doesn't > say how many bytes this takes. Each implementor can work this out for the > codeset(s) supported for source code. Most current COBOL implementations > count bytes for Kanji characters in identifiers, but I don't want that in > the standard. My original assumption about Japanese implementations is that they would be counting bytes for double-byte characters. And that is why I assumed you would want to be counting bytes (= code units) for support of UTF-8 in identifiers in COBOL. But if the intention is to define the extensions for identifiers in COBOL in terms of *characters* (character positions) instead of bytes, despite existing COBOL implementations for Japanese, then the right answer for UTF-8 as well would be to count characters (code points) -- the conclusion that I reached after the current intentions for the COBOL standard in this regard were explained. So once again, I will state the consequences for implementing Unicode in COBOL identifiers under that assumption. For UTF-8, the maximum storage needed for 30 characters is 120 bytes. (4 bytes each) For UTF-16, the maximum storage needed for 30 characters is 120 bytes. (2 2-byte wydes each) For UTF-32, the maximum (and minimum) storage needed for 30 characters is 120 bytes (1 4-byte word each). Easy, no? --Ken > > Thanks for all the time you're devoting to COBOL. > > _ > Ann Bennett 2