From owner-sc22wg5@open-std.org  Mon Sep  6 11:32:16 2010
Return-Path: <owner-sc22wg5@open-std.org>
X-Original-To: sc22wg5-dom8
Delivered-To: sc22wg5-dom8@www2.open-std.org
Received: by www2.open-std.org (Postfix, from userid 521)
	id 621E3C178E1; Mon,  6 Sep 2010 11:32:16 +0200 (CET DST)
X-Original-To: sc22wg5@open-std.org
Delivered-To: sc22wg5@open-std.org
Received: from mx2.net.stfc.ac.uk (mx2.net.stfc.ac.uk [130.246.135.224])
	by www2.open-std.org (Postfix) with ESMTP id 9F6E5C178DA
	for <sc22wg5@open-std.org>; Mon,  6 Sep 2010 11:32:13 +0200 (CET DST)
X-RAL-MFrom: <John.Reid@stfc.ac.uk>
X-RAL-Connect: <jkr.cse.rl.ac.uk [130.246.9.202]>
Received: from jkr.cse.rl.ac.uk (jkr.cse.rl.ac.uk [130.246.9.202])
	by mx2.net.stfc.ac.uk (8.13.1/8.13.1) with ESMTP id o869UoZW006395;
	Mon, 6 Sep 2010 10:31:12 +0100
Received: from jkr.cse.rl.ac.uk (localhost.localdomain [127.0.0.1])
	by jkr.cse.rl.ac.uk (Postfix) with ESMTP id 067D7560D5;
	Mon,  6 Sep 2010 10:29:55 +0100 (BST)
Message-ID: <4C84B492.5050706@stfc.ac.uk>
Date: Mon, 06 Sep 2010 10:29:54 +0100
From: John Reid <John.Reid@stfc.ac.uk>
Reply-To: John.Reid@stfc.ac.uk
Organization: Rutherford Appleton Laboratory
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.23) Gecko/20090908 Fedora/1.1.18-1.fc10 SeaMonkey/1.1.18
MIME-Version: 1.0
To: WG5 <sc22wg5@open-std.org>
Subject: Requirements for the CAF TR
References: <4C56ECD3.90501@stfc.ac.uk> <4C56FAFD.60505@cray.com> <4C57F16D.8080505@stfc.ac.uk> <4C80BD1A.1050208@stfc.ac.uk> <4C83DDE1.1020501@lrz.de>
In-Reply-To: <4C83DDE1.1020501@lrz.de>
Content-Type: multipart/mixed;
 boundary="------------080608060500020805070507"
X-CCLRC-SPAM-report: -2.599 : BAYES_00
X-Scanned-By: MIMEDefang 2.63 on 130.246.135.224
Sender: owner-sc22wg5@open-std.org
Precedence: bulk

This is a multi-part message in MIME format.
--------------080608060500020805070507
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

WG5,

With the help of Bill Long, Reinhold Bader, Jim Xia, and Bob Numrich, I have 
constructed a draft paper on requirements for the TR of further coarray 
features, which is attached.

Our starting point for this TR is Resolution LV5 of the WG5 meeting in Las 
Vegas, Feb. 2008:
"That WG5 declares that the content of the Technical Report on Enhanced Coarray 
Facilities in Fortran is as shown in document J3/08-131r1. Further, WG5 expects 
the TR to be published during the second quarter of 2011."

At the WG5 meeting in Las Vegas, Feb. 2008, the target date for publication was 
changed to November 2012 (see resolution LV5 and document N1812). There has been 
no formal WG5 discussion since 2008 on the technical content, but an informal 
view has gathered that we should not stick rigidly to what we had in 2008. We 
need to have a full discussion of this at the meeting next June. N1835 is a 
first attempt at a "requirements document" that can inform this discussion.

Would anyone else like to contribute at this time? Responses to any of the 
proposals in the current version are welcome, too. I would like to issue the 
final version of this paper quite soon, but expect it to be superseded by new 
papers in the months ahead before we meet, so there is no immediate hurry.

With best wishes,

John.


--------------080608060500020805070507
Content-Type: text/plain;
 name="N1835-6.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="N1835-6.txt"

                                         ISO/IEC JTC1/SC22/WG5 N1835-6

           Requirements for TR of further coarray features 

                              John Reid

                          6 September 2010

Resolution LV5 of the WG5 meeting in Las Vegas, Feb. 2008 reads:
"LV5.  Content and processing of TR on Enhanced Coarray Facilities
That WG5 declares that the content of the Technical Report on Enhanced
Coarray Facilities in Fortran is as shown in document J3/08-131r1.
Further, WG5 expects the TR to be published during the second quarter
of 2011."  

At the WG5 meeting in Las Vegas, Feb. 2008, the target date for 
publication was changed to November 2012 (see resolution LV5 and document
N1812), but there has been no discussion since 2008 on the technical
content. The aim of this paper is to explore alternatives to the 
features of J3/08-131r1, which are:

1) Collective intrinsic subroutines:
   CO_ALL
   CO_ANY
   CO_COUNT
   CO_MAXLOC
   CO_MAXVAL
   CO_MINLOC
   CO_MINVAL
   CO_PRODUCT
   CO_SUM

2) Teams and features that require teams: 
   Team formation and inquiry; FORM_TEAM, TEAM_IMAGES intrisics,
         and the IMAGE_TEAM type.
   SYNC TEAM statement
   TEAM specifiers in I/O statements
      
3) The NOTIFY and QUERY statements.

4) File connected on more than one image, except for the files
   preconnected to the units specified by OUTPUT_UNIT and ERROR_UNIT.

A draft TR, containing exactly these features, is visible as J3/10-166,
and this paper will use J3/10-166 as its base document. 

I hope that the technical content of the TR can be decided at the WG5 
meeting in June 2011. 

This paper is arranged as a set of proposals, each with a summary and
(where appropriate) its technical details. The titles are

Proposal 1. Change the set of collectives
Proposal 2. Add atomic compare-and-swap and other atomic subroutines
Proposal 3. Proposals from Bob Numrich with comments
Proposal 4. Add coscalars
Proposal 5. Allow asynchronous execution of the collectives
Proposal 6. Reconsider the handling of coarrays in a team
Proposal 7. Add coarray pointers
Proposal 8. Allow asymmetric allocatable and pointer objects
Proposal 9. Suggestions for changes to the NOTIFY and QUERY statements
---------------------------------------------------------------------------

Proposal 1. Replace the set of collectives
Suggested by: Bill Long. 
Summary
Replace the collective intrinsic subroutines by the new set
   CO_BCAST
   CO_MAX
   CO_MIN
   CO_REDUCE
   CO_SUM
which are not image control statements.

The current collective subroutine CO_SUM lacks some
features that would improve its usability and improve performance. A
new version is desired with these enhancements:

1) The RESULT variable should be optional. If it is not present, the
result of the computation is assigned to the SOURCE argument.
Rationale: The current specification requires declaring a second
variable to be used for the RESULT, which is often unnecessary.

2) SOURCE, and RESULT if present, should be allowed to be
non-coarrays. Rationale: This significantly expands the potential
usability of the routine, particularly in the context of integrating
coarrays into existing codes. Internally, the routine could have a
coarray of a derived type with a component that is a pointer to the
supplied SOURCE or RESULT argument, and perform the computation using
that structure. As an optimization, for the case of scalar argument,
the routine could have an internal coarray into which the source is
copied at entry and from which the result is copied at completion.

3) Add a new optional argument, RESULT_IMAGE. If this is present, the
result is assigned only on the identified image, and not broadcast to
all the images. On all other images, the result variable becomes
undefined.  Rationale: This is a reasonably common usage, and
eliminating the broadcast improves performance.

The designs for CO_MAX and CO_MIN follow that of CO_SUM. Much of the
same infrastructure can be reused. CO_MAX and CO_MIN differ from
CO_SUM in that they allow arguments of type character, and do not
allow arguments of type complex.

The new intrinsic subroutine
    CO_BCAST (SOURCE, SOURCE_IMAGE [, TEAM])
would broadcast a value to all images of a team.

The new intrinsic subroutine
   CO_REDUCE (SOURCE, OPERATION [, RESULT, TEAM, RESULT_IMAGE]) 
would provide a general routine for operations not currently covered. 
This subroutine could also handle arguments of derived type as long as the 
specified operation was defined for the type. The specification follows 
that for CO_SUM, with the addition of an OPERATION argument. 

Technical details

CO_BCAST (SOURCE, SOURCE_IMAGE [, TEAM])

Description. Broadcast of a value to all images of a team.

Class. Collective subroutine.

Arguments.

SOURCE shall be a coarray. It is an INTENT(INOUT) argument. SOURCE
becomes defined on all images of the team with the value of SOURCE on
image SOURCE_IMAGE.

SOURCE_IMAGE shall be type integer. It is an INTENT(IN) argument. Its
value shall be the image number of one of the images in the team.

TEAM (optional) shall be a scalar of type IMAGE_TEAM. It is an
INTENT(IN) argument that specifies the team for which the broadcast is
performed. If TEAM is not present, the team consists of all images.


Example. If SOURCE is the array [1, 5, 3] on image one, after
execution of CALL CO_BCAST(SOURCE,1) the value of SOURCE on all images
is [1, 5, 3].

........................................................

CO_REDUCE (SOURCE, OPERATION [, RESULT, TEAM, RESULT_IMAGE])

Description. General reduction of elements on a team of images.

Class. Collective subroutine.

Arguments.

SOURCE shall be of a type for which the operation specified by the
OPERATION argument is defined. It is an INTENT(INOUT) argument. It may
be a scalar or an array. If it is a scalar, the computation result is
equal to a processor-dependent and image-dependent approximation to
the application of the operation specified by the OPERATION argument
to the values of SOURCE on all images of the team. If it is an array,
the value of the computation result is equal to a processor-dependent
and image-dependent approximation to the application of the operation
specified by the OPERATION argument to all the corresponding elements
of SOURCE on the images of the team. If RESULT is not present, value
of the computation result is assigned to SOURCE. If REULT is present,
SOURCE is not modified.

OPERATION shall be an external procedure that defines the binary,
commutative operation to be performed. The specified procedure shall
have two scalar arguments of the same type and type parameters as
SOURCE, and return a result of the same type and type parameters as
SOURCE. The result of executing the procedure is the value of
performing the intended operation with the two arguments as operands.

RESULT (optional) shall be of the same type, type parameters, and
shape as SOURCE. It is an INTENT(OUT) argument. If RESULT is present,
the value of the computation result is assigned to RESULT.

TEAM (optional) shall be a scalar of type IMAGE_TEAM(4.4.2). It is an
INTENT(IN) argument that specifies the team for which CO_SUM is
performed. If TEAM is not present, the team consists of all images.

RESULT_IMAGE (optional) shall be type integer. It is an INTENT(IN)
argument. Its value shall be the image number of one of the images in
the team. If RESULT_IMAGE is present and RESULT is present, the result
of the computation is assigned to RESULT on image RESULT_IMAGE and
RESULT on all other images becomes undefined. If RESULT_IMAGE is
present and RESULT is not present, the result of the computation is
assigned to SOURCE on image RESULT_IMAGE and SOURCE on all other
images becomes undefined.

Example. If the number of images is two and SOURCE is the array [1, 5,
3] on one image and [4, 1, 6] on the other image, and MyADD is a
function that returns the sum of its two integer arguments, the value
of RESULT after executing the statement CALL CO_REDUCE(SOURCE, MyADD,

RESULT) is [5,6,9]."

.................................................

CO_SUM (SOURCE [, RESULT, TEAM, RESULT_IMAGE])

Description. Sum elements on a team of images.

Class. Collective subroutine.

Arguments.

SOURCE shall be of numeric type. It is an INTENT(INOUT) argument. It
may be a scalar or an array. If it is a scalar, the computation result
is equal to a processor-dependent and image-dependent approximation to
the sum of the value of SOURCE on all images of the team. If it is an
array, the value of the computation result is equal to a
processor-dependent and image-dependent approximation to the sum of
all the corresponding elements of SOURCE on the images of the team. If
RESULT is not present, value of the computation result is assigned to
SOURCE. If REULT is present, SOURCE is not modified.

RESULT (optional) shall be of the same type, type parameters, and
shape as SOURCE. It is an INTENT(OUT) argument. If RESULT is present,
the value of the computation result is assigned to RESULT.

TEAM (optional) shall be a scalar of type IMAGE_TEAM(4.4.2). It is an
INTENT(IN) argument that specifies the team for which CO_SUM is
performed. If TEAM is not present, the team consists of all images.

RESULT_IMAGE (optional) shall be type integer. It is an INTENT(IN)
argument. Its value shall be the image number of one of the images in
the team. If RESULT_IMAGE is present and RESULT is present, the result
of the computation is assigned to RESULT on image RESULT_IMAGE and
RESULT on all other images becomes undefined. If RESULT_IMAGE is
present and RESULT is not present, the result of the computation is
assigned to SOURCE on image RESULT_IMAGE and SOURCE on all other
images becomes undefined.

Example. If the number of images is two and SOURCE is the array [1, 5,
3] on one image and [4, 1, 6] on the other image, the value of RESULT
after executing the statement CALL CO_SUM(SOURCE, RESULT) is [5,6,9]."

-------------------------------------------------------------------------

Proposal 2. Add atomic compare-and-swap and other atomic subroutines
Suggested by: Bill Long. 
Summary
Several people external to WG5, and several members of the committee,
have proposed that in addition to ATOMIC_DEFINE and ATOMIC_REF it is
very useful to add an atomic read-modify-write intrinsic. The most
basic of those in both theoretical works and also practical
implementations is the atomic compare-and-swap (CAS) intrinisc.
The basic operation of this intrinsic is:
   atomic_cas (atom, old, compare, new)
which performs atomically:
   old = atom
   if (old == compare) atom  = new

The following further atomic subroutines are suggested:
atomic_add
atomic_fadd
atomic_and
atomic_fand
atomic_or
atomic_for
atomic_xor
atomic_fxor
where the 'f' versions are the "fetch_and_" versions of the ones with 
out the 'f'.  These have existed (with different spelling) in the Cray 
coarray implementation from the beginning due to specific customer 
demands.  All take integer arguments.  Having standardized and portable 
names would be good. 

Technical Details

 ATOMIC_CAS (ATOM, OLD, COMPARE, NEW)

 Description. Conditionally swap values atomically.

 Class.  Atomic subroutine.

 Arguments.

 ATOM shall be scalar and of type integer with kind ATOMIC_INT_KIND or
           of type logical with kind ATOMIC_LOGICAL_KIND, where
           ATOMIC_INT_KIND and ATOMIC_LOGICAL_KIND are the named
           constants in the intrinsic module ISO_FORTRAN_ENV. It is an
           INTENT (INOUT) argument. If the value of ATOM is equal to
           the value of COMPARE, ATOM becomes defined with the value
           of INT (NEW, ATOMIC_INT_KIND) if it is of type integer, and
           with the value of NEW if it of type logical.

 OLD shall be scalar and of the same type as ATOM. It is an INTENT
           (OUT) argument. It becomes defined with the value of INT
           (ATOMC, KIND (OLD)) if ATOM is of type integer, and the
           value of ATOMC if ATOM is of type logical, where ATOMC has
           the same type and KIND as ATOM and has the value of ATOM
           used for the compare operation.

 COMPARE  shall be scalar and of the same type and kind as ATOM.
          It is an INTENT(IN) argument.

 NEW      shall be scalar and of the same type as ATOM. It is an
          INTENT(IN) argument.


 Example. CALL ATOMIC_CAS(I[3], OLD, Z, 1) causes I on image 3 to
          become defined with the value 1 if its value is that of Z,
          and OLD to become defined with the value of I on image 3
          prior to the comparison."

------------------------------------------------------------------------

Proposal 3. Proposals from Bob Numrich with comments
Suggested by: Bob Numrich 

a. The intrinsic function this_image()

The function this_image should allow a scalar return value for coarray 
arguments with just one codimension:

integer :: me
real    :: x[*]

me = this_image(x)

Internally the function may continue to think it is returning an array of 
length one, but the programmer should not be penalized for that.  Let the 
value on the left side of the assignment statement be a scalar.  At most, 
issue a warning at compile time.  I hit this problem every time I write 
new code.  It is embarrassing trying to explain it to a new coarray 
programmer.

b. Remove restrictions on derived types with coarray components

Remove most, if not all, the restrictions on derived types with coarray 
components. I can't remember all the restrictions, but I think there are 
lots of them. For those restrictions that absolutely cannot be removed, 
provide a clear explanation of why.

In particular, remove the restriction that a child type can add a coarray 
component only if its parent has a coarray component. It messes up 
inheritance by, for example, forcing every abstract type to contain a dummy 
coarray component just in case somebody wants to extend it by adding a 
coarray component, which will often be the reason it is being extended.

c. Alternative sync statement

There only needs to be one sync() statement with different arguments:

    integer :: list(:)

    sync() or sync   ! sync with all images; behaves just like sync all

    sync(list)       ! sync with images in list(:); 
                     ! behaves like sync images(list)

    sync(memory)     ! local memory sync; behaves like sync memory

Existing sync statements remain valid if the programmer wants to use them.  
See proposal f below for team sync.

d.  Functions with side effects

Programmers may write functions with side effects, such as internal syncs 
or allocation of coarrays, but they have no guarantee that they will be 
executed in the same order as listed on the program statement or executed 
at all.  The ordering of segments assumed by the programmer is therefore 
broken.  Functions with this kind of side effect need a new attribute that 
requires such functions to be executed and executed in the order written 
in the program statement. The attribute IMPURE would be a natural choice 
had it not already been used to mean something else.

Functions should be allowed to return objects with coarray components.
Constructors require this capability.  A workaround using an overloaded 
assignment statement is very awkward and frankly embarrassing.

e.  Collectives

Collectives should be part of a support library not part of the language.

If we make them part of the language, we need to be very careful how we 
define them.  The UPC people have been arguing about them for years.  
Duplicating MPI collectives should not be the goal.  If the coarray model 
is compatible with MPI, why not just use the MPI collectives?

If we do include them, collectives should be functions (with side effects 
as in proposal d):

  s = co_sum(x)

They should be simple, mimicking the normal functions

  s = sum(x)

No long list of arguments please.  See proposal f for collectives within 
teams.

Every image must invoke the function, and every image gets the result on 
return from the function.  The argument x need not be a coarray.  Since 
they are collective, they imply a segment boundary upon entry.  Each image 
can return from the function independently as soon as it receives the 
value of the result.

I would like to see some evidence that nonblocking collectives really make 
a difference in the overall performance of a real application, not just a 
kernel.

f.  Teams

Remove Teams completely from the proposed extensions.

The ability to couple two coarray codes already exists using MPI 
intercommunicators or one of many frameworks out there.  These frameworks 
just need to allow coarray codes as components.

Teams are a very big addition to the language, and we should hasten slowly. 
A coarray code should be the same whether it is run alone or run as a team 
coupled with another coarray code.  With the current definition of teams 
this is not true.  Both codes will need to be altered to run as teams.  
Dereferencing codimensions relative to a team is a very big problem.  
Leaving it up to the programmer is very, very difficult and very error 
prone.  All coarray references will need to be changed so they are relative 
to the team.  Symmetric memory will be broken if we allow allocation within 
teams. We should not do teams.

If we go ahead with teams, somebody has to figure out how codimensions are 
dereferenced relative to a team.  Otherwise, teams are pretty useless.

If a coarray is declared in code only executed by a team, is the coarray 
visible across all images or just within the team?

If we allow allocation within a team, the allocate should be a method 
associated with the team object:

Type(Team_Object) :: myTeam
real,allocatable  :: x[:,:]

stat = myTeam%allocate(x[p,*])

This allocate implies a sync within the team. A coarray allocated this way 
could be given a state that includes information on how to dereference 
codimensions. Do image indices then start with one or start with the first 
image in the team?  How do we deal with asymmetric heaps?  Are coarrays 
allocated by one team visible to other teams?  How?

Collective functions within a team should be associated with team objects:

    Type(Team_Object) :: myTeam

    s = myTeam%sum(x)

Synchronization within a team should be a procedure associated with the 
team object:

        Type(Team_Object)  :: myTeam
        myTeam%sync()        ! sync with images in myTeam
        myTeam%sync(list)    ! sync with a subset of images in myTeam

Other associated functions:

       myTeam%isMyTeam()
       myTeam%myTeamIndex()
       myTeam%teamList()

But I repeat, we should not add teams.

g.  Notify/Query

We should hasten slowly with these statements.  The current definition is 
probably wrong. There probably needs to be some sort of tag associated 
with these statements making them look more like events.

h. Locks

Add some logical functions associated with lock variables:

type(Lock_Object) :: lck[*]

if(lck%isMyLock()) then
  unlock(lck)
end if

Otherwise, all images must attempt to unlock the variable and deal with 
an error code.  In the same way, the function IsLocked() determines if a 
lock is already locked:

if(.not. lck%isLocked()) then
  lock(lck)
end if

This function also allows spinning on a lock until it is free.

i.  MPI, OpenMp, UPC, CUDA compatibility

Is the coarray model compatible with other programming models?

 
-------------------------------------------------------------------------

Proposal 4. Add coscalars
Suggested by: Reinhold Bader 
Summary
Add "coscalars". A coscalar exists on a single image and is referenced
by appending []. It may be a scalar or an array. It may be a pointer,
allocatable, or neither. When a pointer or allocatable coscalar is allocated, 
the programmer can choose the host image; otherwise, the host image is 
processor dependent and does not change during the lifetime of the coscalar. 
The main application is for program-wide linked lists that are modified 
rarely, but accessed frequently. There are advantages in having coscalar locks.

Technical Details

1. Introduction:
~~~~~~~~~~~~~~~~
 
The intent of this paper is to bring forward arguments in favour of
including unsymmetric shared entities (denoted as "coscalars") as part 
of the Technical Report on Enhanced Parallel Computing Facilities. 
An informal description of the desired features is provided, which
attempts to obey the following constraints:

* coscalar functionality is kept as orthogonal as possible to 
  coarrays. Having few interactions between the two features 
  should allow to minimize the implementation effort.
* coscalar syntax and semantics follows the design principles for 
  coarrays as far as is possible with respect to visual indication 
  of communication and with respect to synchronization semantics.

A key feature is the possibility to allow subobjects of a derived
type entity which may be hosted on an image different from that
hosting the parent object. This is achieved by introducing coscalar
pointers to coscalars; see sections 5 and 6.1 for details.

The suggested language elements are analogous to shared scalar 
entities and shared pointers to shared entities in UPC, however 
with additional provisions and restrictions to increase safety 
of use, as well as suggestions for performance tuning.


2. Complex data structures and their scalability limitations:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

The main scenario targeted by the feature described in this paper is 
the case of general, program-wide data structures like lists, deques, 
binary trees, oct-trees etc. which are modified once or only rarely, 
but traversed and referenced often throughout the execution of the
program. For example, MPI codes which need to perform dynamic load 
balancing during program execution typically implement this kind of
concept manually (and with great programming effort). 

While it is possible to implement such concepts using the coarray
facilities defined in the base language (for example, by using
allocatable components of a coarray of suitable derived type), this is 
still significantly more complex to program and maintain than 
e.g., an OpenMP tasking code, and furthermore may require repeated
reallocation of coarrays if the size of the data structure is not 
a priori known, thereby incurring repeated program-wide 
synchronization and hence scalability issues. 

For the scenario indicated above, the arguments against the use
of shared pointers becomes less relevant since 
* the workload should typically be considerably larger 
  than the latency for accessing a shared pointer is.
* the double latency incurred for dereferencing a shared pointer
  as well as accessing its target may be reduced by caching the
  descriptor information on all images requiring this information
  during synchronization phases.


3. Coscalar declaration, definition and reference:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A coscalar entity is declared in one of the following ways: 

real, codimension[] :: cs1    ! real scalar coscalar
real :: cs2[]                 ! alternative declaration
real, target :: ca1(ndim)[]   ! real array coscalar

Syntactically, the difference between coscalar and coarray is that
a coscalar does not specify a coshape; the corresponding semantics
is that the declared entity is shared between all images.
Notwithstanding, there is one image which is considered the "hosting
image" of the coscalar. The hosting image can be identified via the 
image_index() intrinsic, thereby allowing the programmer
to tune code for efficiency of access. For a statically declared
coscalar the hosting image is processor-dependent; it is 
the same image throughout the coscalar's existence. 

Every definition or reference of a coscalar requires specification 
of the angular brackets:

x = cs1[]
cs2[] = a[4]

This maintains the visual indication of the occurrence of 
communication already known from coarrays; for the purpose of
notation it is assumed that it is the exception rather than the
rule that the hosting image references or defines a coscalar.
On the other hand, an implementation may generate multiple 
code versions depending on whether the access is or is not
local, thereby assuring improved speed for local accesses.

The sequence of definitions and references of coscalars follows 
the same synchronization rules as coarrays do. For example, in

if (this_image() == 1) then
   cs1[] = ...
   sync images (*)
else
   sync images(1) 
   x = cs1[]
end if

the SYNC IMAGES statements are required to prevent a data race between
the single image which defines cs1 and all the others which reference
it. 


4. Allocatable coscalars:
~~~~~~~~~~~~~~~~~~~~~~~~~

To enable control of memory locality by the programmer, a coscalar
with the allocatable attribute can be allocated on a specific image:

real, allocatable, codimension[] :: cs3 
:
allocate(cs3, image=4)

where the IMAGE argument to the ALLOCATE statement is obligatory; on 
images other than the one specified the statement will have no effect
(and it is of course a violation of the synchronization rules if
 two images attempt to allocate an unallocated entity in unordered 
 segments). 
A subsequent call to this_image(cs3) or allocated(cs3) on any image
in a segment executed after the one during which the allocation is
performed will return the values 4 and .TRUE., respectively. 
It is required that the hosting image perform the deallocation:

deallocate(cs3)

Executing this statement on images other than that hosting
the coscalar will not have any effect. If applied to coscalars
(and local entities) only, neither the ALLOCATE nor the DEALLOCATE
statements will perform any synchronization. This improves 
scalability especially if only small subsets of images 
(or only teams) need to access the coscalar.

The ALLOCATED intrinsic may also be used on images which do not 
host the pointer or its target; it is atomic in that it may
be executed in a segment unordered with respect to that performing
the allocation or deallocation; however it is the programmer's
responsibility to properly deal with race conditions which
may result from such a use, especially in the case of 
deallocation. 


5. Pointers to coscalars:
~~~~~~~~~~~~~~~~~~~~~~~~~

A coscalar pointer to a shared entity is declared by specifying
the pointer attribute for a coscalar:

real, pointer :: cp(:)[]
type(team_array), pointer, codimension[] :: tp(:)

Such an entity is itself a coscalar (with a processor dependent
hosting image unless it is a type component), and 
it may be pointer associated with a shared entity with the 
target attribute:

if (this_image() == R) cp[] => ca1(:)

Since image R may be distinct from either the image hosting the
coscalar pointer, or the image hosting its target, the above
pointer assignment statement will involve up to three images;
note that a similar situation also can occur when using regular
assignment with differently coindexed objects on both sides of
the assignment. Image control statements to perform synchronization
prior to subsequent references or definitions only are required
against the image R executing the pointer assignment. Also, it
would be allowed to define the target in segments unordered with
respect to the one executing the above pointer assignment, 
since only transfer of a descriptor is required (for this reason
the angular brackets are omitted from the right hand side); 
in this case synchronization would need to include the image
performing such a definition. 

Subsequent references to cp[] then go to the target:

x(3) = cp(3)[]  ! same as x(3) = ca1(3)[] 

(One could consider also allowing coindexed objects as targets.)

It is possible to dynamically allocate the target on a single image: 

allocate(tp(num_images()), image=1)

As with allocatable entities, deallocation later must be performed 
on the hosting image.

The NULL() and NULLIFY() intrinsics are also available for coscalar
pointers; these may also be used on images which do not host the
pointer or its target; if they do so they must be called in a
segment ordered with respect to any segment changing the association
or definition status of the pointer. 
The ASSOCIATED() intrinsic, similar to ALLOCATED() is atomic; in 
its two argument form both arguments must be coscalars if one is. 
The target's hosting image may be determined using the IMAGE_INDEX()
intrinsic on the coscalar pointer associated with it. Finally, just
as for regular pointers, it is possible to specify the CONTIGUOUS 
attribute for a coscalar pointer, in which case its target must be
simply contiguous.


6. Derived types:
~~~~~~~~~~~~~~~~~

The desired properties of shared general data structures rests on the 
possibility to define coscalar subobjects which may be hosted on 
an image different from that hosting the parent data object. 

A number of restrictions are required to assure that no remote
allocations or deallocations are needed wherever dynamic type
components are involved.

Combinations of coscalars and coarrays are disallowed in the 
derived type context i.e. a coarray may not have coscalar type
components, and a coscalar may not have coarray type components. 

6.1 Distributed structures
~~~~~~~~~~~~~~~~~~~~~~~~~~

A coscalar may appear as a type component provided it has the POINTER
attribute. This allows for a directory-like programming style:

type :: team_array
  real, pointer, contiguous :: x(:)[]
end type

type(team_array), allocatable :: o(:)[]

allocate(o(num_images()), image=myteam_first_image)
sync team (myteam)
if (member_of(myteam)) allocate(o(this_image())%x(localsize), &
                                image=this_image())
sync team (myteam)   ! synchronize across team only

After the SYNC TEAM, an image in the team may now define 

o(any_other_team_index)[]%x(:)[] = ...

where the subobject is hosted on image any_other_team_index (which may be
an image other than that allocating the entity o). 

The image hosting the coscalar pointer component (not its target!)
is the image hosting the parent object. 

For simplicity of implementation it is suggested that the parent 
object of such a type is required to be a coscalar.


6.2 local (dynamic) type components
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Given a type definition with (regular) pointer or allocatable components, 
it is possible to declare coscalars of such a type; any subobject of 
such an entity is also a coscalar, and its allocation or association
status may only be changed on the image hosting its parent object:

type :: ptr_type
  real, pointer :: x(:)
end type

type(ptr_type) :: o[]
real, target :: y(5)

if (image_index(o) == this_image()) then
  y = ...
  o[]%p => y
end if
sync all
y = o[]%p  ! scatter

For pointer components this ensures that pointer association with
a local object is well defined.


type :: alloc_type
  real, allocatable :: x(:)
end type

type(alloc_type) :: a[]

allocate(a%x(5), image=image_index(a)) ! a%x is a coscalar
sync all
if (this_image() == 1) then
  a(:)[] = ...
end if

For allocatable components this ensures that no remote deallocation
is required when the object goes out of scope.


6.3 polymorphic entities/subobjects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The case of either polymorphic coscalars, or polymorphic components
of coscalars presumably will require similar restrictions as the 
coarray case. Details of these will need to be figured out.


7. Tuning considerations
~~~~~~~~~~~~~~~~~~~~~~~~

The envisaged main usage scenario for coscalars is the situation 
where an object is generated once and then referenced multiple times on 
some or all images (one or few writes, many reads). This section
contains some thoughts on how tuning of code using coscalars could be 
performed.

7.1 Caching
~~~~~~~~~~~

Therefore, while a coscalar always has a hosting image, an implementation
may choose to cache the entity or parts of an entity on (some) other images.
If so, some garbage collection scheme will be required to dispose of 
the cached copies once the coscalar is deallocated or leaves scope. 

The manner in which caching is performed will be processor dependent, 
but the expectation will be that a high quality implementation will
perform caching
* on coscalar pointers to reduce access latency
* on sufficiently small items

One could consider providing an additional collective intrinsic
to enforce caching. Otherwise, implementation-dependent caching
would be controlled by execution of an image control statement. 
Finally, especially for the case in which locality control is exerted
by the programmer (see below), or if it is known that an entity 
requires a large amount of memory resources (perhaps only available
to a subset of images), an attribute can be specified to 
suppress caching:

type(alloc_type), uncached :: a[]


7.2 Locality control
~~~~~~~~~~~~~~~~~~~~

As a convenience for implementation of load-balancing algorithms, 
a statement for changing the hosting image of an allocatable
coscalar or a coscalar pointer target is provided:

relocate (a, image=4[, team=...][, sync='YES|NO'])

would change the hosting image of a to 4. This statement must be
executed collectively by all images (of a team). If a team argument 
is present, the specified image as well as the image hosting the
entity to be relocated must be a member of the team.  
The statement implies synchronization of all images executing it
unless the SYNC argument is specified with a value of 'NO'.
If this is the case, it is the programmer's responsibility to
insert synchronization statements before subsequent 
references/definitions of the entity. 

Relocation only applies to the parent object in case the object 
has coscalar pointer components; the latter's targets remain on 
their hosting images. 

Regular pointer components of a relocated coscalar become undefined, 
and execution of a relocate statement may in this case induce a 
memory leak. 

An allocatable component of a relocated coscalar is reallocated
on the new hosting image, and its content is transferred to 
the relocated entity.
 

7.3 Note on symmetric heaps
~~~~~~~~~~~~~~~~~~~~~~~~~~~

In many cases, the intent for distributed structures is to achieve
a balanced filling of the memory across images. Hence, an 
implementation might be able to use a symmetric heap even in 
this case, and allocate this in moderately large blocks, with 
an additional level of indirection for accessing the data items
in the structure. For such implementations, a compiler directive
allowing the programmer to indicate a suitable block size might 
be useful. 


8. Coscalars and subprograms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

8.1 subprogram-local coscalars
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Similar to coarrays, coscalars declared inside subprograms are
required to have the SAVE attribute, and automatic (array) coscalars
are not allowed, since the need to have a well-defined hosting image
would imply a need for synchronization. 

However, coscalars may be locally declared in a subprogram
without SAVE if they are allocatable or have the pointer attribute; 
it is then the programmer's responsibility to ensure valid accesses by
performing allocation, deallocation and inserting image control statements.
An allocatable coscalar is automatically deallocated once the image 
hosting it completes execution of the subprogram.

8.2 coscalar dummy arguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A subprogram dummy argument may be a coscalar, in which case
the actual argument must be a coscalar (the latter is then provided
without the angular brackets). Similar to the coarray case, restrictions
are in place which assure that no copyin/out occurs. The dummy arguments' 
hosting image is the same as that of the actual argument.

If a coscalar dummy argument has the POINTER or ALLOCATABLE attribute,
the actual argument must be a coscalar with the same attribute.

8.3 Coscalar actual arguments matching a non-coscalar dummy argument
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the dummy argument is not a coscalar, the actual argument may
be a coscalar anyway, in which case typically copy-in/out will
be required. In this case the same additional synchronization rule
applies for modifiable arguments as for the corresponding case
of an coindexed actual argument. The actual argument must specify
the angular brackets. 

8.4 Generic interfaces
~~~~~~~~~~~~~~~~~~~~~~

Similar to coarrays, no generic disambiguation is possible with 
respect to coscalar arguments.


9. Application to locks, teams and atomic procedures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the present standard, lock variables are required to be coarrays, 
which may lead to misinterpretations (typically, only a subset of
the num_images() lock variables of a scalar coarray are actually required). 
Using coscalars, use of locks as well as the team abstraction could be
handled more elegantly:

type(lock_type) :: my_lock[]
type(image_team) :: my_team[]

Furthermore, this also facilitates using locks as components of 
data structures:

type :: container
  type(lock_type) :: lk
  type(data) :: protected_stuff
end type

in which case any entity x of type container must be a coscalar, so 
that x[]%lk also is a coscalar. 

Similarly, atomic subroutines should be extended to allow scalar coscalars
as arguments.

Requiring teams to be coscalars instead of regular local entities
should provide advantages to both implementors and programmers:
* for good scalability, teams can internally make use of coscalar 
  pointer components especially in the case of large image counts
* handling teams is much more transparent and intuitive if they're
  coscalars; the usage pattern (write once, use often) fits perfectly, 
  and if cross-team communication should be supported, say

  with team(t1)
    a[i] = b[j]@t2
  end with team

  where t1 and t2 are teams not sharing any image, the shared semantics
  allows one access team information across team boundaries, something
  not provided by the present draft.


10. An example: binary tree
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Using the following type definition 

type :: tree
  type(lock_type) :: lk
  type(content) :: entry         
! entities of type content have < and possibly assignment overloaded
  logical :: defined = .false.
  type(tree), pointer :: left[] => null()
  type(tree), pointer :: right[] => null()
end type


Concurrent population of such an entity might be performed using
the following subprogram, which must be called with the same
"this" argument on each image:

recursive subroutine insert(this, stuff)
  type(tree), intent(inout) :: this[]  ! must be a coscalar
                                       ! since we hand coscalars in
                                       ! and so that this[]%lk is
  type(content), intent(in) :: stuff  
  lock(this%lk)
  if (this[]%defined) then
    unlock(this%lk)
    if (this[]%entry > stuff) then
      call insert(this%left, stuff)
    else
      call insert(this%right, stuff)
    end if
  else
! stuff goes to an entry possibly hosted by another image ...
    this[]%entry = stuff
    this[]%defined = .true.
! ... but I get to host the siblings ...
    allocate(this%left, image=this_image())
    allocate(this%right, image=this_image())
! caching of this[]%left and this[]%right to the new target is 
! probably a good idea
    unlock(this%lk)
  end if
end subroutine insert


After populating the data structure the workload can be processed via

recursive subroutine traverse(this, p)
  type(tree), intent(inout) :: this[]
  type(params), intent(in) :: p
! uses a subroutine operation() with non-coscalar dummy arguments to modify entries.
  
  if (this[]%defined) then
    if (image_index(this) == this_image()) call operation(this%entry, p) 
    call traverse(this%left, p)
    call traverse(this%right, p)
  end if
end subroutine
  
Note that if the calls to traverse() occur in segments ordered 
with respect to the ones calling insert(), no race conditions occur.
Since each image only performs computation on the part of the tree 
hosted by it, traverse() should scale well if operation() is 
sufficiently expensive compared to the coscalar pointer lookup. 
For complete processing, traverse() must be called by all images
which previously called insert(). It is not required that insert() 
be executed by all images.


Acknowledgement: 
~~~~~~~~~~~~~~~~

Apart from the conceptual derivation from UPC, the basic ideas 
presented here are a subset from John Mellor-Crummey's papers 
on his "CAF 2.0" vision; some modifications were done to 
improve the integration with the language, as well as to enable
the programmer to perform optimization through locality control.


Comment from Jim Xia
I don't like this name.  It's very confusing as people might think 
this refers to a coarray being scalar.  So how about a new attribute 
SINGLE or SHARED?  I know SHARED is going to be confusing as well 
to people who are familiar with UPC.
Reply from Reinhold
In choosing this, I started from the assumption that co- always refers to
something "shared" or "sharable". Since coarrays have a corank, it seemed
quite natural to call a corank zero entity a coscalar.

-------------------------------------------------------------------------

Proposal 5. Allow asynchronous execution of the collectives
Suggested by: Reinhold Bader
Allow asynchronous execution of the collectives. This would be redundant
if proposal 1 is adopted. 

-------------------------------------------------------------------------

Proposal 6. Reconsider the handling of coarrays in a team
Suggested by: Reinhold Bader
Reconsider the handling of coarrays in a team. Desirable features are 
allocation within a team and a construct that establishes an execution 
context for a team.

-------------------------------------------------------------------------

Proposal 7. Add coarray pointers
Suggested by: Jim Xia 
Add coarray pointers, requiring that the target of the pointer be a local 
coarray.

The primary motivation for this item is to allow coarrays to be used in a 
function result.  One example is to allow a derived type with allocatable 
coarray components to be used as a target to be associated with a pointer.  
It seems allowing the POINTER attribute on coarrays is a reasonable solution.

I consider this proposal comprised of two separate parts.  The first part 
is to allow a derived type with allocatable coarray components to be used 
as a target to be associated with a pointer.  The following is the original 
example when I began to think of allowing pointer coarrays:

From a user point of view, I'd like to allow the following practice

TYPE global_field
    REAL, allocatable :: f(:)[:]
END TYPE

TYPE my_field_type
    type(global_field), pointer :: global => null()
    REAL, allocatable :: local(:)
...
   ! type bound operations
END TYPE

Where my_field_type stores a local copy of global field and can be updated 
frequently (e.g. intermediate computational results etc).  The global field 
(as a coarray) is only updated whenever there is a need.  The type bound 
operations can be functions returning objects of this data type as long as 
there is no update on the global field (i,e. there is no violation of 
segmentation rules).  Note this can also be used as a strategy to re-mesh 
the global field when it is required.  The remeshing is encapsulated by 
my_field_type to hide the information from users (e.g. whenever do the 
global field update).  This declaration, however, is currently not allowed.


The second part is the coarray pointers: I'd like to suggest the following 
syntax

REAL, POINTER :: X(:)[:]

X can be allocated, or be associated with another coarray target.

Allocating X is the same as allocating an allocatable coarray.

ALLOCATE (X(M)[*])

ALLOCATE and DEALLOCATE of X is considered collective operations and 
same synchronizations for allocatable coarrays apply here.

X can also be assigned to a coarray target as in

X => Y

where Y is required to be a coarray target.  In concept, each image has 
its own X associated with a target of its own Y, so there shouldn't be 
any problems. 

-------------------------------------------------------------------------

Proposal 8. Allow asymmetric allocatable and pointer objects
Suggested by: Bill Long 
Allow asymmetric allocatable and pointer objects, declared with deferred 
shape and explicit coshape, e.g.
     REAL, ALLOCATABLE :: A(:)[*]
This provides a mechanism for avoiding the artificial structure workaround
and gives users a way to create coarrays that are restricted to a team.

The down side is that you cannot call this thing an "allocatable coarray" 
without having significant side effects elsewhere in the standard. 
[This was a major reason the idea was dropped previously.] 
Basically, the object is an orphaned component, but there are no terms 
for that either.

-------------------------------------------------------------------------

Proposal 9. Suggestions for changes to the NOTIFY and QUERY statements
Suggested by: Reinhold Bader 

Introduction
~~~~~~~~~~~~

In his critique of the coarray features in the Fortran 2008 draft (J3/08-126), 
John Mellor-Crummey et al specifically mention issues with the NOTIFY and
QUERY statements. This paper attempts to introduce changes to the feature
which remove these issues. 


1. Properties of image control statements 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Both NOTIFY and QUERY are image control statements, but there are
circumstances under which execution of these statements should not
include the effect of a SYNC MEMORY statement:
* execution of a NOTIFY should not have the effect of a 
  SYNC MEMORY statement. Similar to LOCK and UNLOCK a one-way
  ordering of the segments with respect to the target image 
  executing the corresponding QUERY should be sufficient
* execution of a non-blocking QUERY statement with the resulting
  READY value being .FALSE. should have no influence on segment 
  ordering.

2. Notification Events
~~~~~~~~~~~~~~~~~~~~~~

The number of invocations N(M --> T) and Q(M <-- T) is not part
of the global state of the program, but always refers to a
notification event: N(M --> T, E), Q(M <-- T, E). Such an event
is an entity of a derived type EVENT_TYPE defined in the ISO_FORTRAN_ENV
intrinsic module, and such an entity - similar to a lock - must always
be a coarray (or, if coscalars make it into the TR, a coscalar).

The programmer must declare an event and use this as an argument to
both NOTIFY and QUERY, thereby assuring that existing notifications 
do not interfere with notifications and queries in library code, 
which would use distinct events.

Hence, the example from 10-166, NOTE 2.5 could be modified as follows


SUBROUTINE PROCESS(...)
  ... ! declarations
  TYPE(EVENT_TYPE), SAVE :: PROCESS_EVENT[]

  IF (THIS_IMAGE()==1) THEN
    DO I=1,100
       ... ! Primary processing of column I
       NOTIFY(2, EVENT=PROCESS_EVENT) ! Done with column I
    END DO
    SYNC IMAGES(2)
  ELSE IF (THIS_IMAGE()==2) THEN
    DO I=1,100
      QUERY(1, EVENT=PROCESS_EVENT)    
                ! Wait until image 1 is done with column I
       ... ! Secondary processing of column I
    END DO
    SYNC IMAGES(1)
  END IF
END SUBROUTINE PROCESS

3. Excess notifications
~~~~~~~~~~~~~~~~~~~~~~~

The excess of notifications over queries for a given event and a given 
pair of images should be limited to one. That is, while a program may 
complete with an excess of notifications, it would be disallowed to 
invoke a new N(M-->T,E) on an event while the corresponding query is 
still outstanding. Any situation where subsequent NOTIFY statements
(without interleaved queries) are required on the same image pair
can be treated by introducing multiple events, typically responsible
for protecting different coarray entities from unsynchronized access.


4. Using team arguments
~~~~~~~~~~~~~~~~~~~~~~~

For conciseness (and if teams make it into the TR), it should be allowed
to also use arguments of type IMAGE_TEAM instead of the image set in 
NOTIFY and QUERY statements.


5. Some final remarks
~~~~~~~~~~~~~~~~~~~~~

The NOTIFY and QUERY statements provide a more general load balancing
synchronization facility than the corresponding UPC construct. 
In UPC, the upc_notify and upc_query are always collective; to 
avoid deadlocks it is not allowed to start a new notification while
a previous one is still open. In Fortran, apart from the possibility 
to perform NOTIFY and QUERY for arbitrary subsets of images, it is 
also possible to construct a split-phase barrier by executing NOTIFY
and QUERY with the same subset of images and that subset as 
image-set argument. By using different event variables, new 
notifications may be started before previous ones have completed
without incurring deadlocks. In particular, using a split-phase barrier
together with collective functions may provide improved performance if
the collectives do not enforce synchronization at entry.


--------------080608060500020805070507--