From owner-sc22wg5  Thu Sep  5 13:46:05 2002
Received: from smtp-6.star.net.uk (smtp-6.star.net.uk [212.125.75.75])
	by dkuug.dk (8.9.2/8.9.2) with SMTP id NAA91317
	for <SC22WG5@dkuug.dk>; Thu, 5 Sep 2002 13:46:05 +0200 (CEST)
	(envelope-from malcolm@brackley.nag.co.uk)
Received: (qmail 22668 invoked from network); 5 Sep 2002 11:47:02 -0000
Received: from nagmx1.nag.co.uk (HELO nag.co.uk) (62.231.145.242)
  by smtp-6.star.net.uk with SMTP; 5 Sep 2002 11:47:02 -0000
Received: from brackley.nag.co.uk (brackley.nag.co.uk [192.156.217.21])
	by nag.co.uk (8.9.3/8.9.3) with ESMTP id MAA19604
	for <SC22WG5@dkuug.dk>; Thu, 5 Sep 2002 12:46:58 +0100 (BST)
Received: (from malcolm@localhost)
	by brackley.nag.co.uk (8.11.1/8.11.1) id g85BmNK73453
	for SC22WG5@dkuug.dk; Thu, 5 Sep 2002 12:48:23 +0100 (BST)
	(envelope-from malcolm)
From: Malcolm Cohen <malcolm@nag.co.uk>
Message-Id: <200209051148.g85BmNK73453@brackley.nag.co.uk>
Subject: ISO 10646 and Unicode
To: SC22WG5@dkuug.dk
Date: Thu, 5 Sep 2002 12:48:23 +0100 (BST)
X-Mailer: ELM [version 2.4ME+ PL61 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

John Reid said:
>1. Rather than specifying ISO_10646 in our SELECTED_CHAR_KIND 
>   intrinsic, we should perhaps consider the three encodings of it:
>   UTF-8, UFT-16, UTF-32.

UTF-8, UTF-16 and UTF-32 are not defined by ISO 10646.
They are defined by the Unicode consortium.

There are two major character set encodings in ISO 10646: UCS-2 and UCS-4.

UCS-2 and UCS-4 are both fixed-width encodings; thus not all UCS-4
characters can be represented in UCS-2.

UTF-8 and UTF-16 are both variable-width encodings; all ISO 10646
characters may be represented in these two encodings.

UTF-32 is a fixed-width encoding that is almost identical to UCS-4.
(An amendment is in the works for ISO 10646 to make UCS-4 the same as
what UTF-32 is now.)

> UTF-16 appears to have a lot of merit and is catching on.
> It involves variable-width 16-bit strings.

As mentioned by others, this is a serious demerit!  Indeed, it is worse
than that: implementing Fortran substring semantics would require
scanning the whole string from the beginning, since each character may
occupy two or four bytes of memory.  Unless of course you take the view
that the user has to cope with the 2/4 question.  This would be most
unhelpful for e.g. Chinese/Japanese/Korean users, but appears to be the
solution favoured by the Unicode consortium.

UTF-8 and UTF-16 are more of use as file formats than for a CHARACTER
data type.  This is one of the notable omissions in our
internationalisation effort: there is no way of specifying in what
format a file should be written (viz processor-dependent, UTF8, UTF16,
UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE).

> There are
>   2048 special 16-bit values, which allow the frequently-used
>   characters to be represented directly in 16 bits and the rest
>   (actually up to 1,048,576) to be represented as a pair of specials.
>   No 'escape' mechanism is needed since the special characters may be
>   recognized directly.

Well, that *is* an "escape" mechanism... just with 2048 escape
characters.

> The Unicode Consortium wishes programming
> languages to support this data type.

Somehow, that does not surprise me...

It's great for those whose characters all lie in the basic multilingual
plane, and not so good for everyone else.  It gives rise to exactly the
same problems as using single-byte characters - it just affects fewer
characters.  So a broken program will fail less often (or only fail for
the Chinese, Japanese, Koreans, etc.) ...

If we're going to bother to solve the problem, we ought to do it right!

Back to the question:
>1. Rather than specifying ISO_10646 in our SELECTED_CHAR_KIND 

We should continue this specification.  Only UCS-4 provides all ISO 10646
characters together with Fortran substring semantics.

If at some future date some vendor or committee added a type of lesser
capability, e.g. UCS-2 or UTF-16 (or indeed UTF-8), there would appear
to be no problem with specifying the less capable encoding scheme by its
name.  IMO the argument in favour of using UTF-16 for character
variables (namely, that most people will only need half the storage) is
fighting the battles of yesteryear (in particular, that memory is
expensive).

Cheers,
-- 
...........................Malcolm Cohen, NAG Ltd., Oxford, U.K.
                           (malcolm@nag.co.uk)
