From owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org  Wed May 15 15:17:42 2013
Return-Path: <owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org>
X-Original-To: sc22wg5-dom8
Delivered-To: sc22wg5-dom8@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 415FE356E4A; Wed, 15 May 2013 15:17:42 +0200 (CEST)
Delivered-To: sc22wg5@open-std.org
Received: from exprod6og110.obsmtp.com (exprod6og110.obsmtp.com [64.18.1.25])
	by www.open-std.org (Postfix) with ESMTP id CCCE0356995
	for <sc22wg5@open-std.org>; Wed, 15 May 2013 15:16:56 +0200 (CEST)
Received: from CFWEX01.americas.cray.com ([136.162.34.11]) (using TLSv1) by exprod6ob110.postini.com ([64.18.5.12]) with SMTP
	ID DSNKUZOKxg722bdnD5DK71avRv0U5JFF+qRo@postini.com; Wed, 15 May 2013 06:17:41 PDT
Received: from fortran.us.cray.com (172.31.19.200) by
 CFWEX01.americas.cray.com (172.30.88.25) with Microsoft SMTP Server id
 14.2.342.3; Wed, 15 May 2013 08:14:55 -0500
Message-ID: <51938B16.6070903@cray.com>
Date: Wed, 15 May 2013 08:18:14 -0500
From: Bill Long <longb@cray.com>
Reply-To: <longb@cray.com>
Organization: Cray Inc.
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130509 Thunderbird/17.0.6
MIME-Version: 1.0
CC: Tom Clune <Thomas.L.Clune@nasa.gov>, Mark Batty <mbatty@cantab.net>,
	Daniel C Chen <cdchen@ca.ibm.com>, "Lionel, Steve" <steve.lionel@intel.com>,
	Lorri Menard <lorri.menard@intel.com>, "N.M. Maclaren" <nmm1@cam.ac.uk>,
	"sc22wg5@open-std.org" <sc22wg5@open-std.org>
Subject: Re: (j3.2006) (SC22WG5.4995) Existing support for uses of atomics
 in Fortran coarray codes
References: <Prayer.1.3.5.1305141957470.21184@hermes-2.csi.cam.ac.uk> <20130514201252.6BDAE356E8A@www.open-std.org> <20130514212423.4DB6D356E8B@www.open-std.org>
In-Reply-To: <20130514212423.4DB6D356E8B@www.open-std.org>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: owner-sc22wg5@open-std.org
Precedence: bulk

Hi Tom,

On 5/14/13 4:01 PM, Tom Clune wrote:
> Bill,
>
> As I frequently work with pseudospectral models (alas not with
> co-arrays), your second example intrigues _and_ confuses me.   In the
> all-to-all cases that I have, the section of the remote buffer that the
> local process writes to is invariant.   So I don't see what the atomic
> update is accomplishing.   Your example seems to be a
> generalization/replacement of all-to-all for when buffer sizes and
> ordering are not computable in advance and/or vary frequently.
>
> If the sendcount/recvcount are computable in advance, is there still a
> measurable performance to the use of co-arrays vs mpi?

It is certainly possible. If you ship off the data from one image while 
others might still be computing you can cut down on the overall 
execution time, and reduce congestion in the network.  It avoids having 
all of the images enter a barrier at the beginning of the process, as 
happens with MPI_Altoall.   At the end of the process, you could either 
have a SYNC ALL, forcing everyone to wait, or try something more 
fine-grained.  For example, have a coarray integer counter associated 
with each receiver that is atomically incremented by the senders 
following a SYNC MEMORY on the send side.  The SYNC MEMORY and atomic 
increment could be moved down the code in the sender if there is 
computation available to do that does not affect the transfer data.  On 
the receiving side, you will know all of you data has arrived in the 
buffer when the count gets to num_images() - 1 (assuming you are not 
sending to yourself), or to 0 if you start the counter with the value 1 
- num_images().  This might be useful if an image can go on with 
processing the buffer without the other images being done with receiving.

You do raise an interesting and important point.  Coarrays are *not* 
MPI.  It is often not a good idea to think in terms of MPI and then try 
to translate the MPI calls into similar coarray code. To get the most 
out of coarrays, you often benefit from "think different".  A lot of 
code involves MPI_Alltoall because the code was arranged to meet the 
constraints of that routine, which was all that was available.  The 
capabilities in Fortran now for parallel coding might enable a different 
scheme that is both faster and simpler.   In the same was as someone 
translating a C code into Fortran would always write a loop rather than 
a simple array assignment.   As experience grows, I envision people 
getting a lot better at thinking in terms of coarrays initially and 
getting better outcomes.

Cheers,
Bill


>
> - Tom
>
>
> On May 14, 2013, at 4:12 PM, Bill Long <longb@cray.com
> <mailto:longb@cray.com>> wrote:
>
>>
>>
>> On 5/14/13 1:57 PM, N.M. Maclaren wrote:
>>> Mark Batty, Peter Sewell and I had a discussion about atomic
>>> semantics and
>>> Fortran this morning, but there were a couple of things that were rather
>>> important and I didn't have a feel for the answers.  Specifically, what
>>> semantics are provided by existing coarray atomics, and how they are used
>>> in real programs.  We are definitely going to have to decide what to say
>>> about this at Delft, and the problem isn't simple :-(
>>>
>>> In particular:
>>>
>>>    Do implementations guarantee coherence of access to a single atomic
>>> location and, if not, what do they guarantee?
>>
>> As long as all of the accesses are done using atomic operations, there
>> should be no problem. If the network atomics and the local processor
>> atomics are not coherent (this depends on the hardware characteristics)
>> then atomics to local memory locations need to bounce off the NIC if
>> remote atomics are possible.
>>>
>>>    What, if anything, do they guarantee about the consistency of accesses
>>> to two different atomic locations?
>>
>> None without explicit SYNC operations. Similar to the current atomic
>> subroutines in Fortran.
>>
>>>
>>> We know what POWER and x86 hardware guarantee, but have no idea of what
>>> (more? less?) is guaranteed in coarray implementations.  Even with using
>>> MPI as a basis, it could be anything from sequential consistency to
>>> nothing, depending on the details of the implementation.  And the fancy
>>> RDMA networks are another matter entirely!
>>>
>>> It would also be useful if there were some examples of how they are used
>>> for ordering (i.e. in combination with SYNC MEMORY) and running totals
>>> etc.  Specifically, any use where the consistency semantics matter to the
>>> program.
>>
>> Two examples come to mind.
>>
>> 1) The famous (notorious ? ) Table Toy benchmark code, also known as the
>> "Random Access" benchmark.  It involves a large distributed table of
>> 64-bit integers spread across many images with all of those images
>> asynchronously replacing a randomly located table value with the XOR of
>> the current table value with a local value.  The "standard" version of
>> the code is several pages of MPI calls. Well beyond comprehension by
>> normal humans.   The coarray version is a simple loop of about 10-15
>> lines.  The loop has no explicit synchronizations.
>>
>> 2) By far the most common usage I've seen of atomics in coarray codes is
>> for buffer filling.  Suppose you have a "receiving" buffer that is a
>> coarray of globally know size on each image, and a separate coarray
>> integer equal to  the subscript value of the next free element in the
>> buffer array.   The goal is for other images to write data into buffers
>> on remote images.  The process is simple: If I want to write N elements
>> into the buffer in image T, I do a "fetch and add" atomic  of N on the
>> buffer subscript on image T.  The returned value is the old starting
>> point in the buffer on that image. If that value + N is still within the
>> buffer, do the assignment buffer(old_val:old_val+N-1)[T] = mydata(1:N).
>> Several images can be "attacking" image T asynchronously and each gets a
>> non-overlapping part of the buffer as the target of the assignment.
>> Basically no synchronization involved.   This code sequence is,
>> effectively, the guts of the "all-to-all" memory rearrangement that
>> seems to pop up in multiple codes.  In  practice it is about twice as
>> fast as the standard-distribution MPI_Alltoall routine for the same
>> operation. And it has the significant advantage that images can send the
>> data to remote locations as soon as it is ready, rather than waiting for
>> a global sync (as would be the case with the MPI call). This allows some
>> images to be sending data across the network while others are computing.
>> And if the implementation does the sends as non-blocking operations
>> (until the next image control statement) there is also overlap of local
>> computation and communication. Besides all-to-all, the scheme  can also
>> be used as a way to add items to a remote work queue without having to
>> deal with the overhead of locks.   In my experience, this is the "killer
>> app" for coarray atomics.
>>
>> Cheers,
>> Bill
>>
>>
>>>
>>> Any feedback appreciated.  Thanks.
>>>
>>>
>>> Regards,
>>> Nick Maclaren.
>>>
>>>
>>>
>>
>> --
>> Bill Long longb@cray.com <mailto:longb@cray.com>
>> Fortran Technical Support    &                 voice: 651-605-9024
>> Bioinformatics Software Development            fax:   651-605-9142
>> Cray Inc./Cray Plaza, Suite 210/380 Jackson St./St. Paul, MN 55101
>>
>>
>> _______________________________________________
>> J3 mailing list
>> J3@mailman.j3-fortran.org <mailto:J3@mailman.j3-fortran.org>
>> http://mailman.j3-fortran.org/mailman/listinfo/j3
>
> Thomas Clune, Ph. D. <Thomas.L.Clune@nasa.gov
> <mailto:Thomas.L.Clune@nasa.gov>>
> Chief, Software Systems Support OfficeCode 610.3
> NASA GSFC301-286-4635
> MS 610.8 B33-C128<http://ssso.gsfc.nasa.gov>
> Greenbelt, MD 20771
>
>
>
>
>
>
>
> _______________________________________________
> J3 mailing list
> J3@mailman.j3-fortran.org
> http://mailman.j3-fortran.org/mailman/listinfo/j3
>

-- 
Bill Long                                           longb@cray.com
Fortran Technical Support    &                 voice: 651-605-9024
Bioinformatics Software Development            fax:   651-605-9142
Cray Inc./Cray Plaza, Suite 210/380 Jackson St./St. Paul, MN 55101