From owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org  Thu Mar 14 20:10:29 2013
Return-Path: <owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org>
X-Original-To: sc22wg5-dom8
Delivered-To: sc22wg5-dom8@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 94E67356DA7; Thu, 14 Mar 2013 20:10:29 +0100 (CET)
Delivered-To: sc22wg5@open-std.org
Received: from exprod6og109.obsmtp.com (exprod6og109.obsmtp.com [64.18.1.23])
	by www.open-std.org (Postfix) with ESMTP id 10F07356C23
	for <sc22wg5@open-std.org>; Thu, 14 Mar 2013 20:10:27 +0100 (CET)
Received: from CFWEX01.americas.cray.com ([136.162.34.11]) (using TLSv1) by exprod6ob109.postini.com ([64.18.5.12]) with SMTP
	ID DSNKUUIgotNLR4P9ZeVtM6soKCnb7a31zYPG@postini.com; Thu, 14 Mar 2013 12:10:28 PDT
Received: from fortran.us.cray.com (172.31.19.200) by
 CFWEX01.americas.cray.com (172.30.88.25) with Microsoft SMTP Server id
 14.2.342.3; Thu, 14 Mar 2013 13:57:00 -0500
Message-ID: <51421DDB.4080401@cray.com>
Date: Thu, 14 Mar 2013 13:58:35 -0500
From: Bill Long <longb@cray.com>
Reply-To: <longb@cray.com>
Organization: Cray Inc.
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: sc22wg5 <sc22wg5@open-std.org>
Subject: Re: (j3.2006) (SC22WG5.4934) [ukfortran] WG5 ballot on first draft
 TS 18508, Additional Parallel Features in Fortran
References: <20130308120458.DEF90356DB5@www.open-std.org> <20130313194312.8CB333569FF@www.open-std.org> <20130314131943.A733A356C23@www.open-std.org> <20130314143641.2081B356D4F@www.open-std.org>
In-Reply-To: <20130314143641.2081B356D4F@www.open-std.org>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: owner-sc22wg5@open-std.org
Precedence: bulk



On 3/14/13 9:36 AM, N.M. Maclaren wrote:
>> >What is proposed is very similar to the way we treat I/O errors.  There
>> >is a mechanism for notification of a problem (STAT=,  like I/O) and a
>> >way to identify where the error occurred (failed images index values;
>> >the I/O unit number is already available to the users).  Unlike I/O
>> >where we have singled out some failure modes (end-of-file, for example),
>> >we did not specify particular modes of failure for images. In current
>> >experience, it is almost always a non-recoverable memory error, but I
>> >think we should wait for more data before being more specific.   The
>> >current spec is intentionally minimal.

> The major difference is that I/O errors affect just one file, and the
> minor one is that many of them are actually recoverable (though not, at
> present, in Fortran).  The killer about node failure is that they are
> necessarily NOT so localised.
>

I/O errors like reading past the end of file will affect just that file. 
Errors related to hardware failure of a disk array might affect all of 
the files used by the program.

Any incomplete data transfer into or out of a failed image is probably 
corrupt, and the standard needs to written  assuming that is the case. 
How many other images are affected will depend highly on the nature of 
the program.

I would note that we already have STAT= specifiers on existing 
statements like SYNC ALL.  These already provide a means to register a 
failed image by defining the status variable with a processor-dependent 
value.  The new feature in the TS draft is to make that particular error 
status  equal to the value of a standard-defined named constant.   This 
change is motivated by the new capability of effectively changing the 
number of images in the job, so it is potentially possible for the 
program to actually do something about the problem.

Cheers,
Bill


-- 
Bill Long                                           longb@cray.com
Fortran Technical Support    &                 voice: 651-605-9024
Bioinformatics Software Development            fax:   651-605-9142
Cray Inc./Cray Plaza, Suite 210/380 Jackson St./St. Paul, MN 55101


