From owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org  Thu Mar 14 14:19:43 2013
Return-Path: <owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org>
X-Original-To: sc22wg5-dom8
Delivered-To: sc22wg5-dom8@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 12523356D54; Thu, 14 Mar 2013 14:19:42 +0100 (CET)
Delivered-To: sc22wg5@open-std.org
X-Greylist: delayed 484 seconds by postgrey-1.34 at www5.open-std.org; Thu, 14 Mar 2013 14:19:41 CET
Received: from exprod6og103.obsmtp.com (exprod6og103.obsmtp.com [64.18.1.185])
	by www.open-std.org (Postfix) with ESMTP id 6ED75356666
	for <sc22wg5@open-std.org>; Thu, 14 Mar 2013 14:19:40 +0100 (CET)
Received: from CFWEX01.americas.cray.com ([136.162.34.11]) (using TLSv1) by exprod6ob103.postini.com ([64.18.5.12]) with SMTP
	ID DSNKUUHObHu5Hc+j/HsG5tS6LHlDR0XytGlU@postini.com; Thu, 14 Mar 2013 06:19:41 PDT
Received: from fortran.us.cray.com (172.31.19.200) by
 CFWEX01.americas.cray.com (172.30.88.25) with Microsoft SMTP Server id
 14.2.342.3; Thu, 14 Mar 2013 08:05:15 -0500
Message-ID: <5141CB6A.9030702@cray.com>
Date: Thu, 14 Mar 2013 08:06:50 -0500
From: Bill Long <longb@cray.com>
Reply-To: <longb@cray.com>
Organization: Cray Inc.
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: sc22wg5 <sc22wg5@open-std.org>
Subject: Re: (j3.2006) (SC22WG5.4931) WG5 ballot on first draft TS 18508,
 Additional Parallel Features in Fortran
References: <20130308120458.DEF90356DB5@www.open-std.org> <20130313194312.8CB333569FF@www.open-std.org>
In-Reply-To: <20130313194312.8CB333569FF@www.open-std.org>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: owner-sc22wg5@open-std.org
Precedence: bulk



On 3/13/13 2:43 PM, N.M. Maclaren wrote:
> Image Failure
> -------------
>
>      7.1) This is not a minor addition.  No language has ever managed to
> standardise recovery of an application from general system-generated
> errors or infrastructure failure, and even POSIX does not attempt it.
> There are fundamental reasons why this should not be attempted in a
> portable language.


I agree that this is not minor. We would not have included the 
capability if the issue were not so urgent and important. I fear that 
the explanation of the feature was not clear enough.

What is proposed is very similar to the way we treat I/O errors.  There 
is a mechanism for notification of a problem (STAT=,  like I/O) and a 
way to identify where the error occurred (failed images index values; 
the I/O unit number is already available to the users).  Unlike I/O 
where we have singled out some failure modes (end-of-file, for example), 
we did not specify particular modes of failure for images. In current 
experience, it is almost always a non-recoverable memory error, but I 
think we should wait for more data before being more specific.   The 
current spec is intentionally minimal.

The recovery aspect is almost secondary.  In the case of I/O, there is 
no general recovery option either.  The user may decide that the file 
that triggered an error was not that important, and the program can 
continue. Or they might intentionally read past the end of the file as a 
mechanism for stopping a loop, and have "recovery" implicit in the code. 
  Or in the case of a write failure, perhaps writing to a different file 
is an option.  In the worst case, writing out some check-point data and 
aborting is still better than an immediate abort.   For the failed image 
case, the main option is to reform the current team omitting the failed 
images and continuing after changing to the new team.  The facility is 
included as part of the TS because this is the first time we had the 
capability of changing the execution team.   It is likely that in many 
cases the choice will be the same as for I/O failure - to write out 
check-point date and abort. This is still a much better option than to 
be killed without any option to do something.

I strongly disagree with a proposal to have this be a "vendor 
extension".   That is the enemy of portable code.   For implementations 
that do not support any detection, we included the usual escape that the 
set of conditions that constitute an image failure is processor 
dependent. The set could be empty for some vendors.  But for the rest, 
for whom this is an important issue, having a standard syntax is the 
only way to prompt code portability.

Cheers,
Bill




-- 
Bill Long                                           longb@cray.com
Fortran Technical Support    &                 voice: 651-605-9024
Bioinformatics Software Development            fax:   651-605-9142
Cray Inc./Cray Plaza, Suite 210/380 Jackson St./St. Paul, MN 55101


