From owner-sc22wg5  Wed Sep  4 18:53:19 2002
Received: from mailhub.dfrc.nasa.gov (mailhub.dfrc.nasa.gov [130.134.81.12])
	by dkuug.dk (8.9.2/8.9.2) with ESMTP id SAA84252
	for <SC22WG5@dkuug.dk>; Wed, 4 Sep 2002 18:53:19 +0200 (CEST)
	(envelope-from maine@altair.dfrc.nasa.gov)
Received: from mail.dfrc.nasa.gov by mailhub.dfrc.nasa.gov with ESMTP for SC22WG5@dkuug.dk; Wed, 4 Sep 2002 09:34:12 -0700
Received: from altair.dfrc.nasa.gov ([130.134.164.107])
          by mail.dfrc.nasa.gov (Post.Office MTA v3.5.3 release 223
          ID# 0-71686U2500L200S0V35) with ESMTP id gov
          for <SC22WG5@dkuug.dk>; Wed, 4 Sep 2002 09:38:39 -0700
Received: (from maine@localhost)
	by altair.dfrc.nasa.gov (8.11.6/8.11.6) id g84GcbF02956;
	Wed, 4 Sep 2002 09:38:37 -0700
From: Richard Maine <maine@altair.dfrc.nasa.gov>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <15734.14092.907151.519624@altair.dfrc.nasa.gov>
Date: Wed, 4 Sep 2002 09:38:36 -0700
To: SC22WG5@dkuug.dk
Subject: (SC22WG5.2544) SC22 meeting
In-Reply-To: <200209041609.SAA83868@dkuug.dk>
References: <200209041609.SAA83868@dkuug.dk>
X-Mailer: VM 7.00 under 21.4 (patch 6) "Common Lisp" XEmacs Lucid

John Reid writes:
 > 1. Rather than specifying ISO_10646 in our SELECTED_CHAR_KIND 
 >    intrinsic, we should perhaps consider the three encodings of it:
 >    UTF-8, UFT-16, UTF-32.  UTF-16 appears to have a lot of merit and is
 >    catching on. It involves variable-width 16-bit strings....

I haven't spent much time (at all) thinking about this, but my initial
reaction is that such a variable-length encoding isn't trivially
compatable with our current definition of character kinds.  It would
mess up anything relating to storage association.  I also don't
see how any of our string declarations are going to be able to
allocate space unless they allocate 32 bits for each character
position, which seems to sort of defeat the whole idea.

Perhaps one could implement it by having all storage in memory be
32-bits per character of such a kind, with the variable-length
business only applying to formatted external I/O....maybe.  I'm
not sure whether that would meet the needs or not - might get
file interoperability, but have problems with interop with other
languages.

Well, I think my main message is that this would require non-trivial
thought....more than I've put into it.  Something might be doable,
but I doubt you want to rush in without considerable investigation.
Perhaps a subject more suitable for f2k+x than for f2k.

For f2k, we might consider something more modest - just renaming
our ISO_10646 stuff to reflect what it really is (we say it is UCS-4,
which I naively assume to be related to the above-referenced UTF-32).
If we might in the future support some other encoding of ISO_10646,
then perhaps we shouldn't co-opt the general name to mean only one
of the encodings.

-- 
Richard Maine                |  Good judgment comes from experience;
maine@altair.dfrc.nasa.gov   |  experience comes from bad judgment.
                             |        -- Mark Twain

