From owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org  Thu Mar 14 15:36:40 2013
Return-Path: <owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org>
X-Original-To: sc22wg5-dom8
Delivered-To: sc22wg5-dom8@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 98384356D91; Thu, 14 Mar 2013 15:36:40 +0100 (CET)
Delivered-To: sc22wg5@open-std.org
Received: from ppsw-51.csi.cam.ac.uk (ppsw-51.csi.cam.ac.uk [131.111.8.151])
	by www.open-std.org (Postfix) with ESMTP id 2DAE635689C
	for <sc22wg5@open-std.org>; Thu, 14 Mar 2013 15:36:39 +0100 (CET)
X-Cam-AntiVirus: no malware found
X-Cam-SpamDetails: not scanned
X-Cam-ScannerInfo: http://www.ucs.cam.ac.uk/email/scanner/
Received: from hermes-1.csi.cam.ac.uk ([131.111.8.51]:53197)
	by ppsw-51.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.158]:25)
	with esmtpa (EXTERNAL:nmm1) id 1UG9Gp-0000TD-Wt (Exim 4.72)
	(return-path <nmm1@hermes.cam.ac.uk>); Thu, 14 Mar 2013 14:36:39 +0000
Received: from prayer by hermes-1.csi.cam.ac.uk (hermes.cam.ac.uk)
	with local (PRAYER:nmm1) id 1UG9Gp-0003IO-4H (Exim 4.72)
	(return-path <nmm1@hermes.cam.ac.uk>); Thu, 14 Mar 2013 14:36:39 +0000
Received: from [131.111.10.113] by webmail.hermes.cam.ac.uk
	with HTTP (Prayer-1.3.5); 14 Mar 2013 14:36:39 +0000
Date: 14 Mar 2013 14:36:39 +0000
From: "N.M. Maclaren" <nmm1@cam.ac.uk>
To: sc22wg5 <sc22wg5@open-std.org>
Subject: Re: [ukfortran] (SC22WG5.4933) (j3.2006) WG5 ballot on first draft
 TS 18508, Additional Parallel Features in Fortran
Message-ID: <Prayer.1.3.5.1303141436390.24051@hermes-1.csi.cam.ac.uk>
In-Reply-To: <20130314131943.A733A356C23@www.open-std.org>
References: <20130308120458.DEF90356DB5@www.open-std.org>
 <20130313194312.8CB333569FF@www.open-std.org>
 <20130314131943.A733A356C23@www.open-std.org>
X-Mailer: Prayer v1.3.5
Mime-Version: 1.0
Content-Type: text/plain; format=flowed; charset=ISO-8859-1
Sender: owner-sc22wg5@open-std.org
Precedence: bulk

On Mar 14 2013, Bill Long wrote:
>>
>> Image Failure
>> -------------
>>
>>      7.1) This is not a minor addition.  No language has ever managed to
>> standardise recovery of an application from general system-generated
>> errors or infrastructure failure, and even POSIX does not attempt it.
>> There are fundamental reasons why this should not be attempted in a
>> portable language.
>
>I agree that this is not minor. We would not have included the 
>capability if the issue were not so urgent and important. I fear that 
>the explanation of the feature was not clear enough.
>
>What is proposed is very similar to the way we treat I/O errors.  There 
>is a mechanism for notification of a problem (STAT=,  like I/O) and a 
>way to identify where the error occurred (failed images index values; 
>the I/O unit number is already available to the users).  Unlike I/O 
>where we have singled out some failure modes (end-of-file, for example), 
>we did not specify particular modes of failure for images. In current 
>experience, it is almost always a non-recoverable memory error, but I 
>think we should wait for more data before being more specific.   The 
>current spec is intentionally minimal.

The major difference is that I/O errors affect just one file, and the
minor one is that many of them are actually recoverable (though not, at
present, in Fortran).  The killer about node failure is that they are
necessarily NOT so localised.

>The recovery aspect is almost secondary.  ...   For the failed image 
>case, the main option is to reform the current team omitting the failed 
>images and continuing after changing to the new team.  The facility is 
>included as part of the TS because this is the first time we had the 
>capability of changing the execution team.   It is likely that in many 
>cases the choice will be the same as for I/O failure - to write out 
>check-point date and abort. This is still a much better option than to 
>be killed without any option to do something.

Hmm.  That brings in my point that such a failure ALSO corrupts the
default output and error units.

>I strongly disagree with a proposal to have this be a "vendor 
>extension".   That is the enemy of portable code.   For implementations 
>that do not support any detection, we included the usual escape that the 
>set of conditions that constitute an image failure is processor 
>dependent. The set could be empty for some vendors.  But for the rest, 
>for whom this is an important issue, having a standard syntax is the 
>only way to prompt code portability.

I don't think that escape clause is anything like enough for something
that is so much more pervasive and serious than I/O errors.  At the
very least, there would have to be a much stronger one.


Regards,
Nick.

