From owner-sc22wg5@open-std.org  Mon Nov  3 13:56:29 2008
Return-Path: <owner-sc22wg5@open-std.org>
X-Original-To: sc22wg5-dom7
Delivered-To: sc22wg5-dom7@www2.open-std.org
Received: by www2.open-std.org (Postfix, from userid 521)
	id E46B5C178E5; Mon,  3 Nov 2008 13:56:29 +0100 (CET)
X-Original-To: sc22wg5@open-std.org
Delivered-To: sc22wg5@open-std.org
X-Greylist: delayed 939 seconds by postgrey-1.18 at www2.open-std.org; Mon, 03 Nov 2008 13:56:29 CET
Received: from ppsw-0.csi.cam.ac.uk (ppsw-0.csi.cam.ac.uk [131.111.8.130])
	by www2.open-std.org (Postfix) with ESMTP id 311D0C178E1
	for <sc22wg5@open-std.org>; Mon,  3 Nov 2008 13:56:28 +0100 (CET)
X-Cam-AntiVirus: no malware found
X-Cam-SpamDetails: not scanned
X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/
Received: from hermes-1.csi.cam.ac.uk ([131.111.8.51]:53201)
	by ppsw-0.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.150]:25)
	with esmtpa (EXTERNAL:nmm1) id 1Kwyjg-0003Tj-2r (Exim 4.70)
	(return-path <nmm1@hermes.cam.ac.uk>); Mon, 03 Nov 2008 12:40:48 +0000
Received: from prayer by hermes-1.csi.cam.ac.uk (hermes.cam.ac.uk)
	with local (PRAYER:nmm1) id 1Kwyjg-0000sD-SE (Exim 4.67)
	(return-path <nmm1@hermes.cam.ac.uk>); Mon, 03 Nov 2008 12:40:48 +0000
Received: from [131.111.10.32] by webmail.hermes.cam.ac.uk
	with HTTP (Prayer-1.3.1); 03 Nov 2008 12:40:48 +0000
Date: 03 Nov 2008 12:40:48 +0000
From: "N.M. Maclaren" <nmm1@cam.ac.uk>
To: sc22wg5@open-std.org
Subject: Re: Preparing for the Tokyo meeting
Message-ID: <Prayer.1.3.1.0811031240480.8191@hermes-1.csi.cam.ac.uk>
X-Mailer: Prayer v1.3.1
Mime-Version: 1.0
Content-Type: text/plain; format=flowed; charset=ISO-8859-1
Sender: owner-sc22wg5@open-std.org
Precedence: bulk

Jim Xia wrote to the J3 list:
>
> Would you please share with us on what architectures the coarray will be 
> less than helpful.  I thought one strength of coarrays is they're 
> architecture neutral.  My imagination is limited by whatever machine 
> architectures we're having today but I'm interested in learning its 
> potential limitations in future, so I'd like to hear your opinion where 
> you can foresee the coarray feature will fail.

I don't get the J3 list, so have only just seen this.  I shall be in Tokyo,
with a pure coarray hat on, so please let us have some in-depth discussions.

I attach a copy of a long paper on implementation techniques that the UK
people and Bill have seen, and Bill and I have debated.  Despite its length,
it glosses over the technical aspects, as we start getting into interrupt,
memory and device handler designs (hardware, firmware and operating system).
I should be very happy to discuss these, preferably over a drink or two!

My personal executive summary is that, if we exclude VOLATILE, the only 
critical technical issue is what Fortran should say about progress. As 
N1744 says, there are several places where things need saying explicitly, 
but the issue there is wording rather than agreement on intent. Since 
writing N1744, I have had discussions with Aleksandar, and have realised 
that I underestimated its importance. Not that it is insoluble, more that 
it needs a hard decision (and then some wording to express that decision). 
I append (not attach) a short description of the issue, which may yet 
appear in a paper.

The killer is that, if Fortran requires 'transparent' access to coarrays on
other images (i.e. that proceed irrespective of what that image is doing),
it is implementable using DEFINED hardware and software facilities only if
the hardware, operating system and compiler are all provided by the same
organisation (or ones that collaborate so closely as to be almost one).
Of course, that is assuming that people want reliable implementations.

On the other hand, if it is to be implementable using only facilities that
are defined in formal or informal standards, it will be almost unusable. 
That's not nice, at all.  That is precisely why MPI has specified what it
has, and why UPC and POSIX threads do not work as many people claim that
they do.  And, yes, those problems arise in practice :-(

The worst systems are 'commodity clusters'.  I should be very interested
to talk to Toon about this, but the problem is one of very low-probability
failures, because the generated code relies on undefined behaviour which
really does mean that, and not processor dependent behaviour.  Memory race
conditions are the main (but not the only) one.

VOLATILE coarrays make these problems a hundred times worse - and, from my
experience with POSIX threads, RDMAs, OpenMP etc., I do mean a hundred times
and possibly more.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1@cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679





Explanation of the Progress Issue
---------------------------------

The question is whether images P and Q can communicate through a coarray
on image R, irrespective of what R is doing at the time.  This is
extremely hard to implement on some systems, at least when R is in a
call to a companion processor, performing I/O or in a long-running
'pure' CPU loop.

For example:

        PROGRAM Progress
            INTEGER :: one[*] = 0
            SELECT CASE (THIS_IMAGE())
        CASE(1)
                one[9] = 123+one[8]
                SYNC IMAGES ( (/ 2 /) )
        CASE(2)
                SYNC IMAGES ( (/ 1 /) )
                PRINT *, one[9]
        CASE(8)
                one[2] = 456+one[1]
                SYNC IMAGES ( (/ 9 /) )
        CASE(9)
                SYNC IMAGES ( (/ 8 /) )
                PRINT *, one[1]
            END SELECT
        END PROGRAM Progress
 
Consider a processor where an image services requests for coarray data
that it owns only when it reaches an image control statement; this is
common for MPI, and is also done by the reference implementation of UPC.
The above program will deadlock, because image 1 will not reach its SYNC
IMAGES until after images 8 and 9 have responded, and image 8 will not
reach its SYNC IMAGES until after images 1 and 2 have responded.

Obviously, that is a poor implementation of coarrays, but that is not
the point at issue.  The question is whether it is a conforming
processor in the sense of 1.4 paragraph 2.



