From owner-sc22wg5  Fri Sep  6 15:47:09 2002
Received: from nameserv.rl.ac.uk (nameserv.rl.ac.uk [130.246.135.129])
	by dkuug.dk (8.9.2/8.9.2) with ESMTP id PAA99666
	for <SC22WG5@dkuug.dk>; Fri, 6 Sep 2002 15:47:09 +0200 (CEST)
	(envelope-from jkr@jkr.cc.rl.ac.uk)
Received: from jkr.cc.rl.ac.uk (jkr.cc.rl.ac.uk [130.246.8.20])
	by nameserv.rl.ac.uk (8.8.8/8.8.8) with ESMTP id OAA27664
	for <SC22WG5@dkuug.dk>; Fri, 6 Sep 2002 14:48:03 +0100
Received: (from jkr@localhost)
	by jkr.cc.rl.ac.uk (8.8.8+Sun/8.8.8) id OAA21337
	for SC22WG5@dkuug.dk; Fri, 6 Sep 2002 14:51:49 +0100 (BST)
Date: Fri, 6 Sep 2002 14:51:49 +0100 (BST)
From: John Reid <jkr@rl.ac.uk>
Message-Id: <200209061351.OAA21337@jkr.cc.rl.ac.uk>
To: SC22WG5@dkuug.dk
Subject: Re: (SC22WG5.2546) SC22 meeting
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"

Dear Richard, Van, Malcolm,
 
> John Reid writes:
>  > 1. Rather than specifying ISO_10646 in our SELECTED_CHAR_KIND 
>  >    intrinsic, we should perhaps consider the three encodings of it:
>  >    UTF-8, UFT-16, UTF-32.  UTF-16 appears to have a lot of merit and is
>  >    catching on. It involves variable-width 16-bit strings....
> 
> I haven't spent much time (at all) thinking about this, but my initial
> reaction is that such a variable-length encoding isn't trivially
> compatable with our current definition of character kinds.  It would
> mess up anything relating to storage association.  I also don't
> see how any of our string declarations are going to be able to
> allocate space unless they allocate 32 bits for each character
> position, which seems to sort of defeat the whole idea.

I believe (/hope) you are wrong. Japanese 8-bit encodings of variable
length were in use at the time the nondefault character stuff in F90
was designed. I believe that an implementor can use a fixed-length
pointer to an allocated target and everything will work.

> In (SC22WG5.2544), John Reid remarked that other languages are beginning
> to allow international characters in identifiers.  I'm not convinced this
> is possible in Fortran, as the lexer and parser sometimes don't know
> where an identifier is until the end of a statement is reached, so it may
> not be possible to know when to switch alphabets.

If UFT-16 were to be used in the names, the whole source file
would be in UTF-16, so I do not think there would be any problem. 

> If this could be
> allowed, care must be taken in doing so, so as not to allow more than one
> set to be in use at one time.  The reason is that characters that have
> different encodings in different parts of ISO-10646 have the same
> appearance.  Consider Latin B, Russian B (named "vuh") and Greek B (named
> "veeta" -- digression:  we say "beta", but the Latin/English "b" sound is
> written in Greek as mu pi).  If I write identifiers BBB (Latin, Latin,
> Latin) and BBB (Latin, Greek, Russian), are they the same or different? 
> If they're different, this represents an enormous opportunity to multiply
> maintenance costs by a large factor, and leads one to suspect the
> proposal originally arose in the International Brotherhood of Maintenance
> Programmers.  It's bad enough allowing O and 0 in the same identifier.

These problems were mentioned in Saariselka. We would need to decide
which characters to allow in names. I would help if IMPLICIT NONE was
made a requirement whenever extended characters were in use.

> In any case, at this time, allowing other than the invariant set of
> ISO-646 goes against the guidelines proposed in the fourth edition of
> ISO/IEC JTC1/SC22 TR 10176 "Guidelines for the preparation of programming
> language standards," which is the subject of a current DTR ballot (paper
> JTC1 -- NOT WG5! -- N6815).  In 4.1.3.1.1, it says "As far  as possible,
> the language should be defined in terms only of the characters included
> within ISO/IEC 646, avoiding use of any that are in national use
> positions." It goes on to say "The guideline relates to the need for
> international interchange of programs, and hence is based on the
> principle of using a minimal set of characters which can be expected to
> be common to all systems likely to use the programs."

You are reading the bit on program text. Try reading  4.1.3.1.3.
 
> I also agree with Richard's reservations about UTF-16.  Many years ago,
> CDC had a "6-12" encoding system.  There were very complicated rules to
> compute the length of a character literal (actually a Hollerith literal
> at that time).  It was a mess that I don't want to duplicate.  It is
> remotely possible that strenuous pondering of the issue may find a way to
> support it, but I, for one, don't want to hurt myself in trying. 
> Besides, I have a long list of stuff I'd rather do.

This is not true of UTF-16. If there are m characters that are in
the main set and n characters that are in the set of extras, the
length is 8*m + 16*n.   

> > UTF-16 appears to have a lot of merit and is catching on.
> > It involves variable-width 16-bit strings.

> > There are
> >   2048 special 16-bit values, which allow the frequently-used
> >   characters to be represented directly in 16 bits and the rest
> >   (actually up to 1,048,576) to be represented as a pair of specials.
> >   No 'escape' mechanism is needed since the special characters may be
> >   recognized directly.
> 
> Well, that *is* an "escape" mechanism... just with 2048 escape
> characters. 

No. Each 16-bit value can be directly recognized as non-special,
a left special or a right special. 
 
> IMO the argument in favour of using UTF-16 for character
> variables (namely, that most people will only need half the storage) is
> fighting the battles of yesteryear (in particular, that memory is
> expensive).

Yes, I could support that position. I only suggested that we might
consider the other two encodings. We need to wait and see if any 
country asks for this.

John.