From owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org  Tue Apr  9 11:27:15 2013
Return-Path: <owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org>
X-Original-To: sc22wg5-dom8
Delivered-To: sc22wg5-dom8@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 83F1C3569B3; Tue,  9 Apr 2013 11:27:15 +0200 (CEST)
Delivered-To: sc22wg5@open-std.org
Received: from mk-filter-1-a-1.mail.uk.tiscali.com (mk-filter-1-a-1.mail.tiscali.co.uk [212.74.100.52])
	by www.open-std.org (Postfix) with ESMTP id 260DC35689D
	for <sc22wg5@open-std.org>; Tue,  9 Apr 2013 11:27:09 +0200 (CEST)
X-Trace: 855511410/mk-filter-1.mail.uk.tiscali.com/B2C/$THROTTLED_STATIC/TalkTalk_Customer/92.16.213.213/None/John.Reid@stfc.ac.uk
X-SBRS: None
X-RemoteIP: 92.16.213.213
X-IP-MAIL-FROM: John.Reid@stfc.ac.uk
X-SMTP-AUTH: 
X-Originating-Country: GB/UNITED KINGDOM
X-MUA: Mozilla/5.0 (Windows NT 5.1;
 rv:19.0) Gecko/20100101 Firefox/19.0 SeaMonkey/2.16.2
X-IP-BHB: Once
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApoBAJfeY1FcENXV/2dsb2JhbAANOgEBCBaDJokJuDeBKIMUAQECAg4MAQwsEgYHAgQRLAwKDwkDAgECAUUQAwQEAgUSiAWqVoMxgVKFV4kRjVURAQUEgTkKgzcDj0qDYoNIgSGET4YFhm2BNz+BKgIHGwI
X-IronPort-AV: E=Sophos;i="4.87,438,1363132800"; 
   d="txt'?scan'208";a="855511410"
Received: from host-92-16-213-213.as13285.net (HELO [127.0.0.1]) ([92.16.213.213])
  by smtp.tiscali.co.uk with ESMTP; 09 Apr 2013 10:27:01 +0100
Message-ID: <5163DFB7.6040504@stfc.ac.uk>
Date: Tue, 09 Apr 2013 10:30:31 +0100
From: John Reid <John.Reid@stfc.ac.uk>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0 SeaMonkey/2.16.2
MIME-Version: 1.0
To: sc22wg5@open-std.org
Subject: Result of WG5 ballot on first draft TS 18508, Additional Parallel
 Features in Fortran
References: <20130407094254.CB1A6356B54@www.open-std.org>
In-Reply-To: <20130407094254.CB1A6356B54@www.open-std.org>
Content-Type: multipart/mixed;
 boundary="------------050203040506080908050508"
Sender: owner-sc22wg5@open-std.org
Precedence: bulk

This is a multi-part message in MIME format.
--------------050203040506080908050508
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

WG5

Here is a draft of the result of our ballot. Please let me know by 
Friday if I have omitted your ballot or made any mistake in transcribing 
it.

I don't think anyone expected that the first draft would be acceptable 
for submission to SC22 and indeed it clearly is not. We need now to 
think about all the comments and hopefully produce a better version 
during the Delft meeting.

By the way, if you post a message to WG5, it is automatically copied to 
J3. Please don't explicitly copy to J3, because that means that those of 
us on J3 get two copies.

Best wishes,

John.

--------------050203040506080908050508
Content-Type: text/plain; charset=windows-1252;
 name="N1971-1.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="N1971-1.txt"

                                             ISO/IEC JTC1/SC22/WG5 N1971-1

             Result of the WG5 letter ballot on N1967

                         John Reid

N1968 asked this question


Please answer the following question "Is N1967 ready for forwarding to 
SC22 as the DTS?" in one of these ways. 

1) Yes.
2) Yes, but I recommend the following changes. 
3) No, for the following reasons.
4) Abstain.

The numbers of answers in each category were:
1 for 1) Yes (Whitlock). 
0 for 2) Yes, but I recommend the following changes 
8 for 3) No, for the following reasons (Bader, Chen, Cohen, Long, Maclaren, 
         Moene, Muxworthy, Reid, Snyder)
1 for 4) Abstain (Corbett)

The ballot has failed. J3 is requested to prepare a revised version that
takes the comments into account. 

Here are the responses in detail

Reinhold Bader

There are a number of design problems that must be fixed. These impact
internal consistency, usability, as well as performance of programs that 
use the new features described in the TS.

(A) Comments on N1967:
~~~~~~~~~~~~~~~~~~~~~~

(A.1) SYNC TEAM: 

Instead of adding this feature, I suggest deleting the words
", involving the synchronization of all members of the teams
 at the beginning and end of the construct." from T4 of N1930.
This would not only make addition of SYNC TEAM superfluous, but
also allow writing much more scalable programs, at the cost of
some redesign work on the teams feature. More detail is provided
in section (B.1), below.

(A.2) LOCAL_EVENT_TYPE:

I consider this idea to be good; in fact I think that this provides the 
foundation for adding certain asynchronous features without 
much additional effort. Some (well, tongue-in-cheek, because likely
exceeding WG5 mandate) suggestions along these lines are made further 
down.


(A.3) Resiliency:

This has lots of ramifications. Here two gut reactions that pull in 
opposite directions:
* It is to be expected that on large-scale systems or cloud-like
  infrastructures failures will happen, so some facility to deal
  with this without terminating the program as a whole would be nice.
* After forcing parallelism down all programmer's throats, the HPC 
  industry now follows on with an even bigger toad: Using this feature 
  to build resilient programs will in general make them slower and more 
  resource-hungry; and designing resilient algorithms in many cases
  is as much work as restructuring for parallelization.
Quite apart of this there exists the issue of how to make failure a
well-defined concept. On balance I believe this feature should
be deferred to post-TS consideration. Some experience with actual
FT-MPI implementations (still quite a bit in the future) is needed, 
and HPC programmer feedback should be collected and evaluated 
before this stuff is developed, let alone shipped. I also fully admit
to a preference in fixing the scalability (and other) problems before
proceeding to this non-trivial task. A fast program is less likely 
to run into a hardware failure!


(B) N1967 / Teams:
~~~~~~~~~~~~~~~~~~

(B.1) Performance impact of CHANGE TEAM

    The biggest problem I see in the TS are the performance issues 
    resulting from the synchronization properties of the CHANGE TEAM 
    construct. Since synchronization is enforced across all images of 
    the (ancestor) team that invoke the construct, the following 
    program - a simple representative for a large number of possible 
    interesting scenarios - does not allow for overlapping of 
    communication and computation:

program data_feeder
  use, intrinsic :: iso_fortran_env
  implicit none
  type(team_type) :: role
  integer :: i, iter, m, id_role
! declarations for array b(:) and coarray a(:)[*]

! create three teams
  if (this_image() == 1) then
    id_role = 3                        ! master
  else
    id_role = 2 - mod(this_image(),2)) ! two slave teams
  end if
  form subteam ( id_role, role )
! iterate using the same team decomposition 
  do i=1, iter             
!   calculation phase on team IDs 1 and 2:
    change team ( role )
       select case (id_role)
       case (1)
         :                  ! do work on b(:) 
       case (2)
         :                  ! do different work on b(:)
         :                  ! (Statement X) - see discussion below
    end team

    if (this_image() == 1)
! the statements inside this block with present semantics CANNOT
! be done concurrently with the execution of above CHANGE TEAM,
! potentially destroying the scalability of the program.
      :  ! prepare local data (could be done as case (3) above, 
      :  ! but in the general case one typically can't move this)
      do m=2, num_images()
         a(:)[m] = ...      ! push local data
      end do      
      sync images (*)
    else
      sync images (1)
      b(:) = a(:)
    end if
  end do                    ! next iteration uses updated b(:)
  : 
end program

  Such overlapping is considered essential for scalable implementation of  
  a very large class of parallel algorithms.

  If the CHANGE TEAM only had the effect of SYNC MEMORY, image 1 would 
  fall through the block, and would hence be able to perform data 
  transfers to other images concurrently with the calculation phase. 
  Such data transfers would be fine as long as the accesses to the 
  coarray "a" obey the usual rules (some words may be needed to indicate
  that the rules apply for all images of the initial team even
  across team execution context changes). Of course, the 
  SYNC IMAGES statements near the end of the code are necessary
  whatever the synchronization semantics of CHANGE TEAM are. 

  For the above scenario, it would also be sufficient to require 
  synchronization only within each of the three teams defined by the 
  decomposition stored in "role". However, for algorithms that are 
  strongly load imbalanced within each subteam, while not requiring
  allocation or deallocation of coarrays within the subteam, an
  only slightly reduced synchronization requirement may still lead to 
  significant performance degradation that cannot be worked around 
  by the programmer (e.g., via the use of events). The performance 
  impact will typically correlate with team size.

  Therefore, my strong preference is to retain consistency with the 
  loosely asychronous image execution model on the level of teams by 
  only imposing the effect of SYNC MEMORY at entry and exit of a 
  CHANGE TEAM block, while still requiring that all parent team 
  images must execute the CHANGE TEAM. 

  The consequences of this change must of course be considered and 
  taken care of. As far as I can see, the following issues arise:

  (1) A subteam-allocated coarray must be implicitly deallocated 
      In this case, synchronisation must occur upon encountering the
      END TEAM statement, but only on any subteam that does
      a deallocation. This is analogous to the established
      behaviour for unsaved local allocatable coarrays in block 
      constructs or subprograms. It may be useful to add  
      a diagnostic for this via an optional argument 
      SYNCED_DEALLOCATION of END TEAM 
      (and perhaps other END <...> statements) that returns the 
      number of implicit coarray deallocations that have occurred 
      on the executing image.

  (2) Definition status of team argument
      This issue is dealt with via changes to the semantics of 
      FORM TEAM described in (B.3.2) below.

  (3) Consistency issues with memory model?
      Assume that, in the above code, the unspecified statement
      in the line commented with (statement X) reads
      ... = a(:)
      With the presently defined synchronization semantics, this
      would be (formally) fine. With the loosened semantics suggested 
      by me, a race condition could manifest. The debate here is whether 
      this must be considered a consistency issue with the memory 
      model because
      * within the construct, the coindices of a are different 
        than those outside, and
      * it should be enforced that no modifications from outside
        the current team should be possible on any data object
        defined inside the current team
      My opinion is that the image index remapping is a purely
      virtual process that does not impact object identity, and that
      the difference with respect to safety against modification from
      "hidden" images between subteam-local and parent-team-inherited
      coarrays should be tolerated for the sake of being able to 
      write efficiently executing code. Even in the present coarray
      model, there are many ways to write code with race conditions, 
      and the problem described above is not in any fundamental way 
      different from the usual ones arising from incorrect or 
      insufficient use of synchronization statements. It must be 
      solved by a combination of "established best practices" and 
      tools that allow to identify and isolate race conditions.
      "Best practices" will for CHANGE TEAM read: "If you access
      coarrays defined in a parent team (or pointers associated
      with such coarrays or subobjects of them), sandwich your 
      CHANGE TEAM construct between two SYNC ALL statements. 
      Otherwise - hands off them."
   
      Presently, I do not believe that there are any fundamental 
      problems with coarray allocations across CHANGE TEAM boundaries
      (the problem discussed in (B.6) below is of a different nature).
      The same applies for collective synchronization statements.
      However, ALL partial-synchronization constructs should be 
      checked; possibly some additional restrictions need to be added 
      in order to avoid semantic inconsistencies that might arise if 
      such constructs cross CHANGE TEAM boundaries. A discussion of 
      this is in section (F) below.

  Note that if the synchronisation requirements on CHANGE TEAM are 
  loosened as indicated above, the statement

  SYNC TEAM (M) 

  would be equivalent to 

  CHANGE TEAM (M) ; SYNC ALL ; END TEAM


(B.2) Missing support for computational domains

  The FORM SUBTEAM statement presently allows to establish subteams 
  via purely algorithm-driven methods. However, for performance 
  optimization purposes it would be very useful to be able to 
  generate subteams optimized for specific machine architectures. 
  For example, on a cluster of SMPs it will often be more efficient to 
  use teams whose member images exactly match the cores in an SMP, 
  most especially so if the Fortran run time is aware of the 
  difference between communication (as well as synchronization) across 
  and within SMPs.

  The simplest possible abstraction for this might be the use
  of an optional argument DOMAIN to the FORM SUBTEAM statement. 
  The values allowed for the argument would be default integers 
  between 1 and an implementation- and environment-dependent maximum
  DOMAIN_LEVELS (a protected integer accessible via ISO_FORTRAN_ENV). 
  If DOMAIN is specified, the <subteam-id> would need to take a 
  definable entity as argument, which is provided a return value. 
  Increasing values of DOMAIN should correspond to decreasing 
  performance efficiency of data transfers as well as synchronization
  statements (corresponding to decreasing bandwidths and 
  increasing latencies). The teams are given the IDs 1,...,
  NUM_TEAMS(<team object>), where NUM_TEAMS() is a new intrinsic.

  The reason that an environment dependency must be tolerated is
  that it is expected that coarray programs should in practice be 
  able to interoperate with other parallel paradigms, possibly in 
  various manners. Furthermore, additional hardware aspects 
  (like use of hyperthreading cores) or the used batch queueing 
  system may have an impact. None of these details should of 
  course be referred to in normative text.
  

(B.3) Sharpening of TEAM_TYPE object semantics is needed

  The definitions of TEAM_TYPE, FORM SUBTEAM and CHANGE TEAM appear to
  imply that an object of type TEAM_TYPE is, in a sense, an object
  distributed among all images of the ancestor team that describes
  a team decomposition. However, this is not explicitly spelled out, 
  and I suspect that the semantics are at this point too loosely  
  specified, inviting a number of misuses.

  For example, it appears to be permitted to write the following:

  type(team_type) :: t(2)
  integer :: id

  id = 2 - mod(this_image(),2)
  form subteam ( id, t(id) )

  which just about may make sense (because the team variable used
  is consistent with the identifier), but it is easy to generate
  setups where different team variables are associated with 
  different images of the same team. Furthermore, a statement

  change team (t(1)) 

  following the above team formation is not conforming since 
  t(1) is undefined on every second image, and for more 
  colorful setups (using many team objects) it will 
  be even easier to produce incorrect CHANGE TEAM statements.
  Also, from the implementation point of view, I would expect that
  (for scalability reasons) not all information about a team should
  be required to be stored on each image, so the implementation 
  should have the freedom to use something similar to a coarray 
  type component for TEAM_TYPE (maybe resulting in a requirement
  that teams must be scalars, or at most arrays with a statically
  defined size). 

  Therefore, I suggest adding some additional properties and 
  restrictions to TEAM_TYPE and its usage:

  (B.3.1) each FORM SUBTEAM must reference the same object of 
      type TEAM_TYPE on every image. The same applies 
      for each CHANGE TEAM statement. 
      FORM SUBTEAM defines a decomposition, and CHANGE TEAM 
      activates the executing image's team (locally) as soon 
      as it enters the construct.
      (This feature is also needed to make NUM_TEAMS() - see
      (B.2) above - well-defined). 

  (B.3.2) If the synchronization requirements of CHANGE TEAM are 
      loosened as described in (B.1) above, FORM SUBTEAM must 
      perform synchronization of all executing images at the
      end of its invocation in order to assure the decomposition
      is fully defined when the CHANGE TEAM construct is first 
      encountered by an image; it may be more appropriate to
      convert the statement into an impure elemental collective 
      subroutine, because synchronization is then only incurred
      once even if a larger number of subteam decompositions is
      needed. (It may be sufficient to synchronize on a per-
      subteam basis, but this probably complicates the 
      specification).
      For analogous reason, synchronization must occur before a
      team object is finalized by going out of scope. There is 
      now no explicit facility in place to do this, but I consider 
      it useful to define a DELETE SUBTEAM in order to, say, be able 
      to recycle a single team object if, e.g., team sizes are
      supposed to vary an indefinite number of times throughout 
      iterated execution of part of the program. Also, the
      necessary synchronization is then explicitly visible.

  Since a team decomposition usually creates more than one subteam, 
  I also suggest changing the FORM SUBTEAM nomenclature to 
  FORM SUBTEAMS.

(B.4) addressing coarrays defined in ancestor team

Assuming the initial team executing the following code contains four images. 

integer :: a[*]
type(team_type) :: t

a = this_image()
sync all

id = 1
if (this_image() == 3) id = 2
form subteam(id, t)

change team(t)
  select case(id)
  case (1)
    if (this_image() == 2) write(*,*) a, a[3]
    if (this_image() == 3) write(*,*) a, a[2]
  end select
end team

Questions:

(B.4.1) Is the coindexed reference to the coarray "a" inherited from the
    ancestor team intended to be conforming?
If yes, 
(B.4.2) What exactly will the write statements print?
(B.4.3) Assuming, that from any image executing the case(1) block, 
    one wishes to access that object corresponding to a[2] in the ancestor
    team. How can this be done?
Furthermore, 
(B.4.4) Is it allowed to use the DISTANCE argument in THIS_IMAGE() if a coarray
    argument is also specified?

Then, consider the following code (4 images):

integer, allocatable :: b[:,:]
type(team_type) :: u

allocate(b[2,*], source=this_image())

id = 1
if (this_image() == 3) id = 2
form subteam(id, u)

change team(u)
  select case(id)
  case (1)
    if (this_image() == 3) write(*,*) b[1,2]
  end select
end team

Question: 
(B.4.5) Assuming (as part of the answer to question B.4.2 above) that 
        the image indices are mapped into the team as 1 -> 1, 2 -> 2, 
        4 -> 3, how are the coindices of the corank 2 coarray "b" mapped? 

One could simply compress the coindices into a flattened sequence: 
[1,1] -> [1,1], [2,1] -> [2,1], [2,2] -> [1,2]. However, this does not 
preserve the cartesian communication structure that was intended by this 
feature, and is therefore bound to become pretty confusing with growing 
corank. An alternative would be to retain coindexing of
a coarray as if accesses happen in the team it was created in:
[1,1] -> [1,1], [2,1] -> [2,1], [2,2] -> [2,2]
Would an access to [1,2] (updating an object outside the current team) 
then be valid or invalid (T2 would indicate the latter)? In any case, 
the additional bookkeeping needed to keep track of coarray distance 
and image indices would appear to make coding of communication rather 
complicated.

It may save J3 as well as programmers a lot of grief to simply disallow 
coindexing on coarrays inherited from an ancestor team; due to the 
existence of global variables this would need to be a restriction that in
general requires a run-time check. Therefore, it would be useful to also 
allow a coarray argument for the new TEAM_DEPTH intrinsic, in order to 
guard code such as

if (team_depth() == team_depth(a)) then
  a[i] = ...  ! coarray a has the SAVE attribute.
else
  error stop 'Executing team is not the one that created a.'
end if

If the above restriction is introduced, also dummy coarray arguments 
should not be allowed to be associated with a coarray that is 
inherited from an ancestor team. Reason: This avoids
the need for the above guard code for such dummy arguments.

The highest possible price application programmers need to pay for this 
restriction is allocation (and deallocation) of a team-local coarray and 
a memory copy (or two). The significant benefit is that clearer coding
is enforced.


(B.5) Propagation of normal termination

  If WG5 decides to keep the synchronization requirements for the
  CHANGE TEAM construct, END TEAM must also be able to specify a 
  <sync-stat-list>.

  Conversely, if the synchronization requirement is removed, the 
  <sync-stat-list> should be deleted from the CHANGE TEAM statement.
  (looking at SYNC MEMORY, I see that this may need to be retained after all, 
   but wonder what it does ... checking itself for having been stopped?)

(B.6) Potential problems due to fanciful block structure nesting

Consider the following (I hope, conforming according to TS draft) example:

change team (m)
  block
    real, allocatable :: a(:)[:], b(:)[:]

    allocate(a(5)[*])
    select case (subteam_id())
    case (1)
      allocate(b(3)[*])
      : ! calculate
      deallocate(b)
    end select
    deallocate(a)
  end block
end team

The (de)allocation statements, while syntactically identical, are doing
semantically different things here. Namely, coarray "a" is being allocated
on each subteam stored in the team decomposition "m", while coarray "b"
is only being allocated on subteam with id 1. Apart from the slight 
tummy-ache that the context-dependent meaning of this induces, there
is potential for easily introducing bugs in more complex code that 
would cause the application to hang or crash. For example, "deallocate(b)" 
might be placed outside the select case block by mistake. Furthermore, 
the synchronization semantics appear to be unclear: does the allocation
of "a" synchronize across the union of all subteam images, or only over
each subteam individually?
Perhaps it is necessary after all to enforce selection semantics on 
"change team" itself, thereby disallowing the possibility to 
interleave a block construct (or, worse, a library call containing 
explicit or implicit synchronizations) in the manner illustrated above:

change team (m) ! must have a subteam or default statement following
subteam (1) 
  block
    real, allocatable :: a(:)[:], b(:)[:]
    allocate(a(5)[*],b(3)[*])
    : ! do stuff   
    deallocate(a, b) ! or rely on automatic deallocation
  end block
default         ! guarantee separate context for each id 
  block    
    real, allocatable :: a(:)[:]
    allocate(a(5)[*])  
    : ! do stuff
    deallocate(a)    ! or rely on automatic deallocation
  end block
end team


(C) N1967 / Events:
~~~~~~~~~~~~~~~~~~~

(C.1) Atomicity of event count

  The description of events appears to imply that it is allowed to do
  multiple posts on a given event. However, given the synchronization 
  rules, the following seems to be disallowed if executed with
  more than one image:

  type(event_type) :: p[*]

  event post (p[1]) ! Updating p[1] in unordered segments

  But since a statement is used for event updates anyway, why not let
  them act atomically (i.e., effectively use ATOMIC_ADD for the updates)? 
  In particular, this would imply that 

  type(event_type) :: q[*]

  select case(this_image())
    case(1)
    event post (q)  ! (1)
    : ! do something that executes for a considerable length of time
    event post (q)  ! (2)
    case(2:3)
    event wait (q[1])
  end select

  is conforming, that exactly one of the images 2 or 3 (which one is 
  undetermined) will continue executing immediately after image 1
  executes statement (1), and the other one will continue executing
  after image 1 executes statement (2).

  See also the split phase barrier below for an application of this
  that does not need an inflation of event variables.

(C.2) Split phase barrier

  Using LOCAL_EVENT_TYPE objects and assuming that posts and waits act 
  atomically, it is possible to write a split-phase barrier as follows 

  type(local_event_type) :: barrier[*]

  do i=1, num_images()
    event post( barrier[i] )
  end do	
  : ! do work that does not violate rules
  do i=1, num_images()
    event wait( barrier )
  end do

  An implementation would be capable of doing the above much more efficiently
  if a collective facility like

  event postall (barrier) 
  : ! do work that does not violate rules
  event waitall (barrier) 

  were available.

(C.3) Events as type components

  Constraint C603 appears to have been mangled, transforming its meaning to
  the opposite of what it presumably should be. 

  Edit [13.22]: In "variable definition context," replace the comma by
  " except".


(D) N1967 / Collectives:
~~~~~~~~~~~~~~~~~~~~~~~~

(D.1) Add CO_MULT for efficiency

Nowadays interconnects have support for offloading certain operations
to the infrastructure (e.g., FCA aka "fabric collective acceleration"), 
thereby considerably improving performance.
However, it appears unlikely that the relatively general CO_REDUCE 
facility would be able to support this facility. Therefore, it may
be desirable to also provide a CO_MULT collective for arguments of 
numeric type that supports multiplicative reductions, in order to 
obtain the same level of performance for all basic numeric operations.

(D.2) Asynchronous execution

By using local_event_type and possibly the ASYNCHRONOUS attribute, the
collective functions could be made to support asychronous execution. 
This would allow overlap of communication and computation also for
the collective functions. For example, 

subroutine foo(ev, redu, ...)
  type(local_event_type), intent(inout) :: ev[*]
  real, intent(out) :: redu
  :
  call co_sum(source=x, result=redu, posted_event=ev)
! redu and x implicitly have the ASYNCHRONOUS attribute
! because co_sum takes a POSTED_EVENT argument
  :
end subroutine foo

subroutine bar(ev, redu, ...)
  type(local_event_type), intent(inout) :: ev[*]
  real, asynchronous :: redu
  :
  event waitall (ev)  ! Cf (C.2) above
  ... = redu ...      ! may now be able to safely reference redu
end subroutine
  
The program invoking the two above would need to look like this:

type(local_event_type) :: myev
real, asynchronous :: x

call foo(myev, x, ...)
: ! do other computations
call bar(myev, x, ...)

(E) N1967 / Atomic functions:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

(E.1) OLD argument in atomic functions

It should be clarified that this argument of an atomic function
is not updated atomically. Perhaps using coindexed entities should be
prohibited here?


(F) Partial synchronization and teams
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This section attempts a discussion of the consequences of relaxing
the synchronization properties of CHANGE TEAM with respect to
partial-synchronization image control statements. The following
text, unless explicitly stated otherwise, assumes that the global 
barrier at the beginning and end of a CHANGE TEAM block construct 
is removed, and only SYNC MEMORY is performed on all images.

In order to achieve the aim stated in N1930/T1, "When a block of 
code is executed on images executing as a team, it should execute on
those images as if the program contained no other images", the 
following third sub-item needs to be added to T1: 

- Activities that involve partial synchronization of images inside a 
  team, such as SYNC IMAGES, events and locks, need to be clearly 
  separated from any such activities that are invoked in a parent
  team.

The following subsections suggest usage restrictions on the partial
synchronization constructs that are necessary to guarantee this.

(F.1) Events

For a first example, assume that the following program is executed by
3 images:

integer :: data[*], id
type(local_event_type) :: ev[*]
type(team_type) :: tm

data = this_image()

id = 1 ; if (data == 3) id = 2 
form team ( id, tm ) 
! one subteam, assume image numbering 1 --> 1, 2 --> 2 for first team 

if (data == 1) then 
  data[3] = 4
  event post (ev[2])               ! (1)
end if

change team (tm)
  if (data == 2) event wait(ev)    ! (2)
end team

if (data == 2) write(*,*) data[3]  ! (3)

Two questions arise here: 

* Is the event post / wait sequence (1), (2) that crosses team 
  execution contexts valid?
* If yes, is the coindexed access in (3) valid?

I think that the answer to the first question should be "no", because 
otherwise either the answer to the second question would be "no"
(somewhat counterintuitive for the programmer), or statement (2)
would effectively need to perform synchronization that extends outside
the currently executing team. 

Therefore, the following restriction should be added:

* The event variable used in EVENT POST or EVENT WAIT statements must 
  be associated with the team that executes these statements.

As a consequence, an EVENT WAIT statement that can match an 
EVENT POST statement must be executed in the same current team.
Note that "associated with" as used above will need a proper 
definition. Also, the restriction would essentially be implied if the 
coindexing suppression suggested in (B.4) is accepted. 

Note that the following two usage patterns would be permitted

if () event post              if () event wait
change team                   change team
  :                             :
end team                      end team
if () event wait              if () event post

(The first would also be permitted with the big barrier in place, but 
 would have no effect. The second one would always deadlock with the 
 present semantics; it may or may not deadlock under relaxed synchroni-
 zation, depending on what syncs are applied inside the CHANGE TEAM 
 block. A deadlock detection tool will be helpful in identifying
 such issues.)

(F.2) Locks and CRITICAL blocks

For locks, the same restriction will be needed as for events:

* A lock variable used in LOCK or UNLOCK statements must 
  be associated with the team that executes these statements.

Since CHANGE TEAM is a (collectively executed) image control statement, 
its appearance inside a CRITICAL block is already prohibited via the 
rules in section 8.1.5 of N1830. The general rules on block structure 
nesting prevent undesirable interleaving of CHANGE TEAM and CRITICAL.


(F.3) SYNC IMAGES

This statement can be understood in term of pairwise notifying 
events; in Fortran 2008 a single global event variable would be 
available. In order to avoid interference of SYNC IMAGES statements
that appear outside and inside a CHANGE TEAM block, it must be 
explicitly spelled out that on each team, SYNC IMAGES gets  its 
own synchronization context (i.e. a team-specific event variable, 
that might be created when FORM TEAM is executed).

As an example, consider the following code, run with 3 images and 
(in order to avoid complications due to image reindexing) a 
subteam decomposition that contains a single team with the same 
3 images:

me = this_image()

if (me == 1) sync images ([2])
change team (...)
  sync images ([1,3])  ! (X)
end team
if (me == 2) sync images ([1])

This code would execute just fine, while it would deadlock if the 
barrier on CHANGE TEAM is in place. It would also be considered
"bad practice", because it is very likely to produce deadlocks, 
for example by replacing [1,3] by [1,2] in statement (X). Note 
that without separated contexts, the latter would not deadlock, 
but very likely not produce the desired results.


(F.4) Team variables and CHANGE TEAM construct

For proper nesting of CHANGE TEAM constructs the statement sequence

form subteam (a)
change team (a)
  form subteam (b)
  change team (b)
  end team
end team

should be enforced by requiring that a team decomposition is created 
(and perhaps even declared) in the same context (current team) that 
uses it. For nested team use this effectively brings back the big
barrier, unless FORM SUBTEAMS only synchronizes by subteam (cf 
(B.3.2)); it may be more appropriate to consider a more powerful 
form of FORM SUBTEAMS in the future that is capable of generating
nested subteam sequences with a single invocation.

Finally, consider the following situation:

real, allocatable, save :: a(:)[*]
type(team_type) :: t
integer :: id

if (this_image() == 1) then
   id = 1
else
   id = 2
end if
form subteams(id, t) 

change team (t)
 if (id == 2) then
   allocate(a(1000)[*]) ! subteam 2 only
   : ! work with a
   deallocate(a)
 end if
end team

allocate(a(2000)[*])    ! all images

Here image 1 proceeds to the second ALLOCATE statement, while 
all other images execute the code inside the CHANGE TEAM
block.
From the application point of view this is fine: since the 
second ALLOCATE statement performs synchronization on exit, 
it will only complete once all images have executed it. 
However, there may (depending on the implementation) exist 
a race condition on the descriptor for a, which can only be 
prevented by also synchronizing upon entry to ALLOCATE.
This however is an implementation issue, and prescribing 
CHANGE TEAM to be executed collectively gives the implementation 
the opportunity to do the necessary (image-local) bookkeeping
that helps to decide whether or not to perform such an 
extra synchronization.

_______________________________________________________________________

Daniel Chen

There are a few technical issues raised by others that need more 
discussion and consideration.

_______________________________________________________________________

Malcolm Cohen

It would be slightly nicer if the text describing the features indicated whether 
certain things were or were not image control statements.  I understand that 
this would complicate the edits, but perhaps there could be an overview at an 
earlier stage.

It seems to be possible to copy event variables by argument association, e.g.
  CALL sub(event,(event))
This should probably be prevented by requiring INTENT(INOUT) on event dummy 
arguments.

6.4 says
  "If the count of a event variable increases through the execution of an EVENT 
POST statement on image M and later decreases through the execution of an EVENT 
WAIT statement on image T, the segments preceding the EVENT POST statement on 
image M precede the segments following the EVENT WAIT statement on image T."
which is all very well, but the very definition of "later" can only be 
interpreted as have segments already ordered, i.e. it is assuming a stronger 
fact than the result that it requires.

Consider
  image 1 segment i does POST EVENT(x)
  image 2 segment j does POST EVENT(x)
  image 3 segment k does WAIT EVENT(x)
for unordered segments i, j, k;
then image 3 segment k+1 follows image 1 segment i or image 2 segment j, but 
which?  Both?  Neither?  One of them but no-one knows which?  The obvious 
semantics would be that it follows both, but that i,i+1,j,j+1,k are all 
unordered.  That is, if the event counter has value N, posted by segments ii(1) 
to ii(N), then image k+1 follows all of ii(1) to ii(N).

Obviously this needs to be rewritten to avoid assuming linear time ("later" 
forsooth) and to clarify the ordering that results.

I would slightly prefer C605 to reworded as "An <event-variable> in an 
<event-wait-stmt> that is".  Yes, it has the BNF rule number on it anyway, but 
we have gotten that wrong so often in the past that it is best to spend the ink 
to make it more readable.

7.2 STAT para has "argument" twice and "variable" twice.  Please be consistent.

An "unsuccssful" collective with no STAT= does not cause error termination. 
Why?  If STAT_STOPPED_IMAGE is good enough to terminate SYNC ALL, it should be 
good enough to terminate a CO_BROADCAST.

Why permit STAT to be present on some images and not others?

Are there any error conditions for collectives apart from FAILED/STOPPED image? 
I see nothing about this being processor dependent.  If there are possible error 
conditions, the draft requires the processor to compute the correct result and 
perform the correct action regardless ... surely some mistake.

The stated design goal for performance is that collectives are not required 
"wait" for completion, except on the image receiving the result.  However, if 
there are error conditions, presence of STAT (and maybe ERRMSG) will surely 
force such a wait to occur.  It seems unsatisfactory to have go-faster features 
that don't work if one uses the reliability features.

CO_BROADCAST does not require the same type parameters for SOURCE on all images.
Also, VARIABLE would be a better name than SOURCE since it is INTENT(INOUT) and 
receives the result.

CO_MAX of a scalar does not require the same type parameters for SOURCE on all 
images.

It is unsatisfactory for CO_MAX et al to require the SOURCE to be a definable 
variable, thus preventing collective max/sum/etc. of INTENT(IN) or PROTECTED 
variables (or indeed, of expressions).  If we can't have unambiguous syntax that 
handles both inplace collectives and result collectives, perhaps we should have 
two names, e.g.
  CO_SUM(SOURCE [ ,RESULT_IMAGE,STAT,ERRMSG ])
  CO_SUM_RESULT(SOURCE,RESULT [ ,RESULT_IMAGE,STAT,ERRMSG ])

As specified, EVENT_QUERY seems completely useless in that one would not be 
permitted to use it from a segment that is unordered with respect to any EVENT 
POST statement that updates it.  Indeed, events seem useless if there are 
multiple images that might want to post an event, since it would modify the 
variable from an unordered segment.  Presumably events are meant to be excluded 
from the unordered modification rules, but I see no text that describes such 
semantics.

_______________________________________________________________________

Bill Long

No, N1967 is not ready for a DTS ballot because, based on the ballots 
submitted so far, it would likely fail such a ballot.  Several people 
have raised issues that require more discussion and consensus before the 
TS is ready for a DTS vote.  I'll not repeat all of the other ballot 
comments here, but would like to point out a few -

1)  Should we add a CO_PRODUCT collective subroutine?  Editorial 
disruption for this is minimal, so the question is between need/value 
and additional clutter in Clause 13.

2) A proposed modification to the TEAM facility needs discussion.  If we 
adopt the idea, there appear to be material side-effects to the base 
memory model (such as SYNC ALL statements not executing on all images 
that could affect local variable values).

3) There are general concerns that the memory model aspects of the new 
features are not adequately specified.

4) Additional examples in the Annex would be helpful. (This was a known 
deficiency going into the ballot.)

_______________________________________________________________________

Nick Maclaren

I have not had time to cross-check on all of the details of N1967
against Fortran 2008, so these are not necessarily all of my objections.

At the end of my reasons, I append some proposals for improvement, but
the largest one is a rough draft.

REASONS FOR VOTING NO
---------------------
---------------------

Generic
-------

    1.1) The wording refers to cases when the execution of a statement
is not successful, but Fortran 2008 refers to error conditions.  This is
confusing, at best, and they should use compatible terminology.  It is
more serious when one considers node failure.

    1.2) That is not the only aspect in which the details differ.
The wording and other details need a systematic check and improvement.

    1.3) I am distinctly unhappy about the number of places where
semantics are defined for error conditions that are caused by
infrastructure failure, which is not in accordance with the Fortran
standard's previous practice.  STAT_FAILED_IMAGE is mentioned later, but
this is also done for events.

    1.4) The current dominating standard for parallel processing is MPI,
and its basic model has proven to be solid over many years.  This TS
provides many comparable facilities, but does not seem to have included
the comparable constraints needed for correctness and implementability.
This applies particularly to teams, but also to collectives.


Teams
-----

I have serious difficulty even understanding the basic model, and it
appears to make little sense.  FORM SUBTEAM is specified to be an
ordinary statement creating a variable, and all synchronisation is in
CHANGE TEAM, using a variable defined by a previous FORM SUBTREAM
statement.  All of the descriptions of which team is being referred to
are in terms of a variable, and not a value.  The following are a few of
the issues this causes.

    2.1) What happens if only some images in the current team have
called FORM SUBTEAM?  How does CHANGE TEAM know which other images to
wait on?

    2.2) In the following, do alf and bert indicate the same subteam?
And is it allowed to create two different teams at the same level, as in
bert and colin?  And how do other images know which of these FORM
SUBTEAM statements matches the FORM SUBTEAM statement on their image?

    TEAM_TYPE alf, bert, colin, dave
    FORM SUBTEAM (13, alf)
    FORM SUBTEAM (13, bert)
    FORM SUBTEAM (666, colin)

    2.3) Fortran defines intrinsic assignment for derived types; even if
that were locked out, several argument passing mechanisms imply implicit
copying.  The nearest that Fortran has to the concept of two variables
being the same is association.  It is not within the remit of this TS to
add a major new, fundamental semantic concepts to Fortran, such as
unassociatable objects.  For example, in:

    TEAM_TYPE alf, bert
    FORM SUBTEAM (13, alf)
    bert = alf
    FORM SUBTEAM (42, bert)

or:

    TEAM_TYPE alf
    FORM SUBTEAM (13, alf)
    CALL ugh(alf)

    SUBROUTINE ugh (TEAM_TYPE arg)
        FORM SUBTEAM (42, arg)
    END SUBROUTINE ugh

    2.4) The following is allowed by the specification, but it makes no
sense.  Specifying synchronisation by how often CHANGE TEAM is called
doesn't work if its argument may be variable and there are no further
constraints.

    TEAM_TYPE array(NUM_IMAGES())
    // Set up somehow
    REAL :: junk
    CALL RANDOM_NUMBER(junk)
    CHANGE TEAM (array(junk*THIS_IMAGE()+1))
        ...
    END TEAM

    2.5) Allowing subteam variables in CHANGE TEAM with no further
constraints allows non-hierarchical team usage, which was not the intent
of N1930 T3.

    TEAM_TYPE alf, bert, colin
    FORM SUBTEAM (13, alf)
    CHANGE TEAM (alf)
        FORM SUBTEAM (42, bert)
    END TEAM
    FORM SUBTEAM (666, colin)
    CHANGE TEAM (colin)
        CHANGE TEAM (bert)
            ...
        END TEAM
    END TEAM
   
    2.6) The issue described in 2.4 also allows SYNC TEAM to synchronise
teams which are not the current team or one of its descendants.  This
is, at best, a recipe for deadlock.  Even allowing it on ancestors
introduces a conflict with N1930 T1 and T2.  Also, I cannot see that the
statement is required by N1930, or actually necessary.  It can be done
by temporarily changing team and calling SYNC ALL.

There are other problems, too, such as:

    2.7) In the specification of CHANGE TEAM, the current team when the
CHANGE TEAM was executed is not necessarily the parent of the team that
is being changed to, so specifying synchronisation of the parent team is
incorrect.

    2.8) I have tried to convince myself that correct programs will not
deadlock, and I have tried to convince myself that correct programs can
deadlock, and have failed with both.  The design is just too complicated
to be sure it is correct.

    2.9) The design very dubiously meets the requirement N1930 T2,
because an image belongs to all of the teams that it has formed and can
use them, which is the cause of the SYNC TEAM problems.

    2.10) There are some very nasty issues to do with resource leakage
if these facilities are used in a library.  FORM SUBTEAMS creates a
handle to something or other, but there is no way to release that
handle.  This would be easily soluble only if its function were subsumed
into CHANGE TEAM.

    2.11) It has omitted the qualification in LOCK and UNLOCK that
semantics are defined only for successful execution of the statements.
This is a variant of reason 1.1.


Conclusion: the constraints on team actions and the semantics of teams
need a complete rethink.



Collectives
-----------

    3.1) CO_REDUCE requires commutativity but not associativity of
OPERATOR, which makes no sense.  MPI requires associativity but not
commutativity, which at least makes sense.  It should require both.

    3.2) Also, it does not specify anything about the consistency of
OPERATOR, which is a recipe for problems.  I have serious difficulty in
understanding the combination of C730, C1218, C1220, C1234 and 12.4.3.6
paragraph 7, but can believe that the requirement for an elemental
function means that it must be the same function.  However, that is not
enough (semantically) because of global or parent scope variables and
THIS_IMAGE().  This should be improved.

    3.3) The specification of the ordering of collective subroutines
makes sense and is what we agreed, but remains confusing.  A NOTE should
be added to clarify our intent.

    3.4) There is a potentially serious interaction between collectives
and atomics, as far as consistency goes, because both can be used to
pass information between unordered segments.  See Data Consistency
below.



Events
------

    4.1) I am baffled by the reference to INTENT(INOUT) in C602 and
C603.  In particular, both EVENT POST and EVENT WAIT necessarily both
read and write the variable, so it seems bizarre to lock out the only
case that makes semantic sense.  Neither of those statements make any
reference to whether their event-variable may be INTENT(IN), INTENT(OUT)
or PROTECTED, none of which make semantic sense.  The only thing that I
am assume is that the sense of the condition has got inverted.  This
needs fixing.

    4.2) Page 14 (6.4 EVENT WAIT) lines 7-11 are surely erroneous in the
case where the EVENT POST fails, and probably when the EVENT WAIT does.
This matter is not as simple as it appears to be, because it has a
significant impact on permitted serial optimisations.  See Data
Consistency below.

    4.3) The word 'later' is thoroughly ill-defined in a parallel
context, especially when it is applied to general semaphores.  In
particular, it begs the question of which one of several possible uses
of EVENT POST does the EVENT WAIT synchronise with?  As the
specification stands, this means that they must NOT be image control
statements, because that would introduce a logically recursive
definition into the standard.  I.e. the sequence of their execution
controls the ordering, but the ordering controls the sequence of their
execution!

This needs specifying properly, and would be vastly simplified if events
were changed from being general semaphores to being binary ones, though
that would conflict with N1930.  See Data Consistency below.

    4.4) There is nothing said about global consistency, which is
well-known to be a potential problem with such actions (as with
atomics).  In particular, it might seem obvious to assume sequential
consistency, but that does NOT immediately follow.  Whatever model is
chosen needs specifying.  See Data Consistency below.

Obviously, that choice has a major impact on the EVENT_QUERY intrinsic,
especially as it is defined only when it is ordered with respect to all
EVENT POST and EVENT WAIT statements.



Atomic Intrinsics
-----------------

There are at least two structural problems with these.

    5.1) The first is that their ATOM argument is not required to be a
coarray, unlike ATOMIC_DEFINE and ATOMIC_REF, which is undefined if an
atomic coarray object is an actual argument to a procedure which does
not define it as a coarray but then calls these procedures.  That needs
fixing.

    5.2) The second is that these extensions are truly baffling in the
context of Fortran 2008 13.1 paragraph 3 and Note 13.1.  I am not sure
what to do, but supporting the fetch-and-operate paradigm means that the
global data consistency problem simply has to be addressed.  There are
several options, but all have extremely unobvious and serious
consequences.  See Data Consistency below.


Data Consistency
----------------

    6.1) This is not a simple matter, and WG5 will be making a serious
mistake if it proceeds to add facilities of the nature proposed in the
TS without putting some serious thought into the data consistency model.
In Fortran 2008, we evaded this by selecting a simple and extremely
proscriptive model for SYNC IMAGES and kicking the consistency of
atomics into the long grass.  This is no longer viable, for three
reasons:

    1) Upon doing a bit more research, I realise that I may have been
wrong in believing that all DAGs can be serialised, though I have so far
been unable to work out or track down an example of a DAG that cannot
be.  It turns out that this is a currently active area in computer
science, and is known (at least by some people) as DAG consistency.
Fortran's model is what the following papers define as the WW model;
Fortran is not fixated about determinism, unlike most modern computer
science.  See, for example:

    http://supertech.csail.mit.edu/papers/frigo-ms-thesis.pdf
    http://www.fftw.org/~athena/papers/ipps96.ps.gz
    http://www.fftw.org/~athena/papers/spaa98-memory.ps.gz
    http://www.fftw.org/~athena/papers/spaa96.ps.gz
 
My concern is that events may be the straw that breaks the camel's
back, and we may have stepped over the boundary into an inconsistent
model.  Unfortunately, I am not an expert in this area, though I
know enough to know that apparently obvious truths are often false.

    2) Fortran events are general semaphores.  While these are
well-studied, they are nowadays usually avoided in favour of other
mechanisms.  However, I have so far been unable to find any precise
description of the ordering semantics for general semaphores, or
convince myself that I understand that aspect.  Dijkstra himself
pointed out that they are no more powerful than binary semaphores.
See 4.2 in:

    http://www.cs.utexas.edu/users/EWD/transcriptions/EWD01xx/EWD123.html

Worse, the current specification defines behaviour if they are
unsuccessful, and I have absolutely no idea what implications that might
have.

In particular, events affect segment ordering and, if we do not specify
anything, that is going to break the properties of the data consistency
model that we agreed (after much discussion!) in Fortran 2008.  Either
we need to simplify them very considerably (probably to binary,
unconditional semaphores), or we need to call in some specialist
expertise, or both.  I cannot convince myself that the current
specification is consistent.

    3) I really can't see any way to make sense of these atomics except
by enforcing consistency.  In particular, it is NOT automatic that
operations on even a single atomic variable are sequentially consistent,
which was the topic of such debate in Fortran 2008.  However, thinking
about what various forms of relaxed consistency might mean with the
fetch-and-operate intrinsics makes my head hurt.  However, recently,
I have found some leads.

There is also the point that the simple, inconsistent atomics that we
defined in Fortran 2008 are extremely useful on systems that have no
hardware or operating system support for global consistency, because
they can often be implemented efficiently, whereas consistent atomics
need to be emulated by using locks or equivalent.  There are also a lot
of uses for atomics that do not require any particular consistency.

    6.2) When it comes to consistency, there is no logical difference
between a PGAS model and shared memory, and one of the few good designs
I have seen is the C++ memory model.  For a clear and fairly simple
description of the issues, see sections 1 to 3 of:

    http://www.hpl.hp.com/techreports/2008/HPL-2008-56.html
 in:
    http://www.hpl.hp.com/personal/Hans_Boehm/pubs.html

Evidence of the model's consistency is in:

    http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf
 in:
    http://www.cl.cam.ac.uk/~pes20/

Regrettably, I cannot claim to be enough of an expert to guarantee to
validate such a design for Fortran, though I am enough of one to spot
obvious flaws.

There is the question of whether the atomics should also synchronise
other data uses.  I believe that our assumption is that they should NOT
necessarily do so, which is a significant difference from the C++ model.
This is the point at which I get out of my depth.  I am almost certain
that the Fortran design makes sense, but parallel semantics is so
deceptive that I am chary of assuming I am right.

    6.3) While I believe that the time has come to do it properly, I
accept the argument that this would derail the schedule.  However, I
believe that it is essential that we (a) integrate events with the basic
DAG consistent segment model, (b) decide and define something about
atomics and (c) try to avoid taking decisions that will prevent a proper
solution later.

    6.4) Arising from this, I believe that the easiest solution to the
event problem is to simplify them to be binary semaphores and explicitly
require all image control statements to be DAG consistent.  This last is
not a change, but merely a more explicit statement of what the standard
currently says.

I am a bit nervous about even allowing EVENT POST on an already posted
event variable to return an indication of the fact, but suspect that it
will be so often demanded that it is unavoidable, and it does not seem
to add any problems that have not already been introduced by the
ACQUIRED_LOCK= specifier in the LOCK statement.

    6.5) I previously believed that the simplest solution to the atomics
problem is to explicitly define sequential consistency on a single
atomic variable for the new atomics, and to explicitly state that the
sequence for two variables need not be consistent.  However, since then
I have thought of realistic examples where that will not work, and Mark
Batty has pointed out others to me.  He has, however, suggested an
alternative model that might work, which he refers to as the
release/acquire model.  I am continuing my discussions with him.

This leaves ATOMIC_DEFINE and ATOMIC_REF as anomalies, and I believe
that we should provide alternative store and fetch atomics that are
consistent with the new ones, and leave ATOMIC_DEFINE and ATOMIC_REF as
processor-dependent.  Alternatively, we could use those names for the
consistent versions, and add new names for the relaxed ones.



Image Failure
-------------

    7.1) This is not a minor addition.  No language has ever managed to
standardise recovery of an application from general system-generated
errors or infrastructure failure, and even POSIX does not attempt it.
There are fundamental reasons why this should not be attempted in a
portable language.

Fortran has not so far defined any recovery facilities from even the
simplest cases of I/O errors (e.g. erroneous data in formatted input),
which have been supported by many languages for the past 40 years.
Image failure is significantly harder to recover from than even
system-generated I/O errors such as a disk failure.  Furthermore, I/O
error recovery is required by a vastly larger number of programs than
even use coarrays, let alone those that want this facility.

Fortran has not defined any form of the now-conventional exception
handling, because of difficulties in integrating it with the existing
language.  Even Ada has not attempted to define recovery from
system-generated errors, let alone infrastructure failure.  C's signal
facility permits this sort of failure trapping, but is merely a
standardised syntax to entirely undefined (sic) semantics and
implementations that simply crash exist and are conforming.

When an image fails, it will usually be while it is active, which at the
very least means that any data it might have defined in its active
segment (including coarray data on other images) becomes undefined.  Any
other images that may have been interacting with the image when it
failed (whether via coarray data that it owns, collectives, events or
other) also reach an undefined state.  Because Fortran permits the
buffering of file operations over image control statements, the output
and error units, and any shared files also become undefined, even if
they were not being accessed at the time or even used in the image that
failed.

Lastly, there is nothing stopping processors from providing this
facility as an extension, and that would be a far better way to do it,
at least until such time as it this is shown to be feasible in at
least most processors.

This feature should be removed.



PROPOSALS
---------
---------

This is mainly in the form of changes to the main body of N1967,
excluding the editorial changes, but the teams proposal is much
looser.


Generic
-------

A.1) Throughout the document, the wording should be changed to be the
same as in Fortran 2008, and any missing constraints, restrictions and
similar wording added.


Image Failure
-------------

B.1) This should be removed, as should all references to it.  If it is
to be presenved, it is critical that it spells out exactly what a
processor must do, and exactly what a programmer may assume, when STAT
FAILED IMAGE is returned.  This may well be "nothing" and "nothing", but
it must be in normative text.


Collectives
-----------

C.1) On page 19, lines 28 and 28, "mathematically commutative operation.
If the operation implemented by OPERATOR is not associative, the
computed value of the reduction is processor dependent" should be
replaced by "mathematically associative and commutative operation".

C.2) On page 19, line 36, the following should be added:

    If the same pair of values is passed as arguments to OPERATOR on two
    images, the results shall be equivalent.  If SOURCE is numeric,
    equivalent means mathematically equivalent [Fortran 2008 7.1.5.2.4];
    otherwise, its meaning is processor-dependent.

C.3) On page 15, after line 16, the following should be added:

    NOTE
    Calls to collective subroutines are not image control statements
    and do not perform any synchronisation, to allow maximum scope
    for optimisation.  If synchronisation is needed, programs must
    call SYNC ALL explicitly.

C.4) It is critical that we say something in normative code about the
consistency they require, to prevent optimisation creating causal loops.


Events
------

I am proposing a lesser facility than the one agreed in N1930, to avoid
the synchronisation definition problems I mention above.

D.1) On page 13, lines 10 to 13 should be replaced by:

    A scalar variable of type EVENT TYPE or LOCAL EVENT TYPE is an event
    variable.  An event variable includes a boolean state of whether an
    event has been posted.  The initial value of the state of an event
    variable is that no event has been posted.

D.2) On page 13, lines 20 and 23 (C602 and C603), "where" should be
replaced by "unless".

D.3) On page 13, line 27 should be replaced by:

    R601  event-post-stmt  is  EVENT POST( event-variable
                               [POSTED=event-posted-variable,
                               sync-stat-list] )

After line 28, the following should be added:

    R60x  event-posted-variable   is   scalar-logical-variable

Line 31 should be replaced by:

    The EVENT POST statement is an image control statement.

    If the state of the event variable is not posted, successful
    execution of an EVENT POST statement causes its state to be set to
    posted.  If a POSTED= specifier is present, it also causes the
    scalar logical variable to become defined with the value true.

    If the state of the event variable is posted, successful
    execution of an EVENT POST statement without a POSTED= specifier
    causes the executing image to wait until it is not posted, and then
    causes its state to be set to not posted.

[[[ This is not quite the wording used in LOCK, but this situation is
a bit more complicated. ]]]

    If the state of the event variable is posted, successful
    execution of an EVENT POST statement with a POSTED= specifier does
    not change its state and causes the scalar logical variable to
    become defined with the value false.

D.4) On page 14, lines 5 to 6 should be replaced by:

    The EVENT WAIT statement is an image control statement.

    If the state of the event variable is not posted, successful
    execution of an EVENT WAIT statement causes the executing image
    to wait until it is set to posted, and then causes its state to
    be set to not posted.

    If the state of the event variable is posted, successful execution
    of an EVENT WAIT statement causes causes its state to be set to not
    posted.

D.5) In Fortran 2008, on page 189, after paragraph 2, the following
should be added:

    During the execution of a program, a processor shall ensure that
    all uses of image control statements are consistent with themselves
    and with the program execution order on each image.  [[[ This is
    not sufficient, unfortunately - see below. ]]]

    NOTE
    Successful execution of a LOCK statement with an ACQUIRE_LOCK=
    specifier or an EVENT POST statement with a POSTED= specifier that
    set the scalar logical variable to false have the semantics of a
    SYNC MEMORY statement.

[[[ Unfortunately, the explicit requirement for consistency is
necessary.  At best, parallel time has all of the event ordering
problems of special relativity - and when one allows optimisation, as is
critical for the larger systems, it's nearly as bad as general
relativity, including the Cauchy problem.  And, yes, I do mean apparent
time warps.  We swept the problem under the carpet in Fortran 2008,
because the only way that problems could be caused was by direct use of
SYNC MEMORY or by arcane uses of locks, but unfortunately we can do so
no longer.

Also, though it is not spelled out here, we need to specify what we
mean by consistent. ]]]

D.6) On page 14, lines 7 to 11 should be replaced by:

    During the execution of the program, the state of a event variable
    is changed by the execution of EVENT POST and EVENT WAIT statements
    in a sequentially consistent order [Fortran 2008 8.5.2].  If the
    state of a event variable is set to posted through the execution of
    an EVENT POST statement on image M and is subsequently in that order
    set to not posted through the execution of an EVENT WAIT statement
    on image T, the segments preceding the EVENT POST statement on
    image M precede the segments following the EVENT WAIT statement on
    image T.

    NOTE
    When there are more than two images using EVENT POST and EVENT WAIT
    statements on the same event variable, programs should regard the
    precise execution order as being unpredictable.

D.7) On page 21, after line 2, the following should be added:

    NOTE
    A use of EVENT_QUERY is defined only if its segment is ordered
    against all uses of EVENT POST and EVENT QUERY on its event variable
    argument, and should not be used to track the progress of other
    images.


Atomics
-------

E.1) The ATOM argument should be required to be a coarray.

E.2) The wording should be adapted from that in Fortran 2008.

E.3) On page 15, lines 304, the procedure names ATOMIC_SET and
ATOMIC_VALUE should be added.

E.4) Somewhere on page 15, the following should be added:

    A processor shall ensure that all uses of ATOMIC_ADD, ATOMIC_AND,
    ATOMIC_CAS, ATOMIC_OR, ATOMIC_SET, ATOMIC_VALUE and ATOMIC_XOR on a
    single atomic variable are consistent, and are  consistent with the
    program execution order on each image and with segment order
    [8.5.2].  [[[ This is not sufficient, unfortunately - see below. ]]]

    NOTE
    Programmers should not assume that the apparent sequences of
    actions on two different atomic variables are compatible, nor
    that any uses of ATOMIC_DEFINE or ATOMIC_REF are consistent
    with uses of the other atomic procedures, not even on the same
    variable.  Also, using changes in the value of atomic variables
    together with the SYNC MEMORY statement will not create a
    well-defined segment ordering, though it may appear to.

[[[ My remarks under D.5 are equally relevant here.  We need to specify
what we mean by consistent in this context as well. ]]]

E.5) On page 17, after line 13, definitions of ATOMIC_SET and
ATOMIC_VALUE should be added.  These are essentially replications
of ATOMIC_DEFINE and ATOMIC_REF with the semantic difference
described above.


TEAMS
-----

This is very much a rough draft.  I have following N1967 as closely as
possible, and it is largely in the form of differences from the current
specification.

Much of the syntax is the same, but I proposing changing the constraints
and semantics quite drastically in order to solve the problems I and
some other people raised.  It models itself on the proven design of MPI
(and especially communicators and MPI_Split).

Unlike in N1967, the parent team of a subteam is the current team at the
time the CHANGE TEAM statement that created it was executed, and not the
FORM_SUBTEAMS intrinsic.  The FORM_SUBTEAMS intrinsic is a collective
that returns a handle that describes a team, and is nothing more.

The only actions that can be performed on any team other than the
current one are to return to the parent by executing an END TEAM
statement, to change to a descendent by executing a CHANGE TEAM
statement, and to call inquiry functions (N1930 T6) on a local object,
which can be guaranteed not to cause synchronisation problems.

I have not provided a collective to release a subteam created by
FORM_SUBTEAMS, as I can't see how to specify it at all concisely.  This
is something that needs discussing.  There are many possibilities, but
all of the ones I can see involve drastic changes to N1967's model,
or would be simply documentating that a program can use Fortran's
scoping rules and DEALLOCATE to release resources.

The TEAM_NONE constant
----------------------

    The value of the TEAM_TYPE constant TEAM_NONE shall indicate a
    conceptual team containing no images.

The CHANGE TEAM construct
-------------------------

The only syntactic change is:

    R502 change-team-stmt is [team-construct-name:] CHANGE TEAM (
                                 subteam [, sync-stat-list] )

This has extra restrictions:

    The subteam shall be of type TEAM_TYPE, shall have been set up by
    the FORM_SUBTEAMS intrinsic, shall not be TEAM_NONE and shall be a
    descendent of the current team.

    In matching CHANGE TEAM statements, all values of subteam shall
    describe the same team.  [[[ This needs phrasing properly. ]]]

    The synchronisation at both the CHANGE TEAM and END TEAM statements
    is between all images of the team represented by the value of the
    subteam variable.  [[[ This needs phrasing properly. ]]]

Note that the the current team or any parent team are not involved.
This provides a considerable performance enhancement in the case where a
subset of the teams needs to call a collective without interacting with
the rest of the current team.

The FORM_SUBTEAMS intrinsic
---------------------------

This is now a collective subroutine and not a statement, with a
specification like:

    FORM_SUBTEAMS ( IMAGE-ID , RESULT )

    Description. Create values for subteams.

    Class. Collective subroutine.

    Arguments.

    IMAGE-ID shall be scalar and of type integer and shall be
    non-negative.  Two images will be in the same subteam if and only if
    IMAGE-ID is the same on those images, except that IMAGE-ID may be
    zero, which indicates that the invoking image is not involved in any
    subteam.

    RESULT shall be scalar and of type TEAM_TYPE.  It is an INTENT(OUT)
    argument.

    If IMAGE_ID is zero, RESULT becomes defined with the value
    TEAM_NONE.  Otherwise, RESULT becomes defined with a value that
    identifies the subteam that includes the calling image.
 
I have omitted STAT and ERRMSG, as I don't see what use they are.  They
could be restored, as for ALLOCATE.

The SYNC TEAM statement
-----------------------

This is removed, on the grounds that it is almost impossible to
specify constraints on its use that make it implementable and give it
well-defined semantics.  The only use that is definable is that of
using it for a descendent team, where it would simply be syntactic
sugar for:

    CHANGE TEAM (team)
        SYNC ALL
    END TEAM

The SUBTEAM_ID intrinsic
------------------------

This is removed.

The NUM_IMAGES and THIS_IMAGE intrinsic
---------------------------------------

    The descriptions of the intrinsic functions NUM_IMAGES() and
    THIS_IMAGE() in ISO/IEC 1539-1:2010 are changed by adding optional
    arguments DISTANCE and TEAM and a modified result if either is present.

    The DISTANCE argument shall be a scalar integer.  It shall be
    nonnegative.

    The TEAM argument shall be a scalar of type TEAM_TYPE, and shall
    have a value that was returned by the FORM_SUBTEAMS intrinsic.
    If DISTANCE is present, TEAM shall not be present.

    If neither argument is not present, the result value is the image
    index of the invoking image in the current team.

    If DISTANCE is present with a value less than or equal to the team
    distance between the current team and the initial team, the result
    has the value of the image index in the team of which the invoking
    image was last a member with a team distance of DISTANCE from the
    current team; otherwise, the result has the value -1.

    If TEAM is present and the invoking image is a member of that team,
    the result has the value of the image index of the invoking image in
    that team; otherwise, the result has the value -1.

[[[ Returning -1 is cleaner than choosing some random team, whether
or not that is the current one. ]]]

_______________________________________________________________________

Toon Moene

I have to vote no.  In addition to all the arguments to not pass this TS 
is that I reconsidered my example A2. Clause 6 notes.

"Example 2: Producer consumer program."

As far as I can see, it is correct with

TYPE(LOCAL_EVENT_TYPE) :: EVENT[*]

As I tried very hard to come up with an example to show the need for 
(non_local) EVENT_TYPE, I question whether we need two event types.

Please give TS 18508 another round at the meeting in Delft.

_______________________________________________________________________

David Muxworthy

Technical:
Given the issues already raised by others, the document is clearly
not yet ready for forwarding.

Editorial:
Subclause 2
At [3:5+] add:
ISO/IEC 1539-1:2010/Cor 1:2012
ISO/IEC 1539-1:2010/Cor 2:2013

Subclause 3
The 'ISO_FORTRAN_ENV' module referenced at 3.2 and 3.3 is not the
module from Fortran 2008.  At [5:3] add sentence "The intrinsic
module ISO_FORTRAN_ENV is extended by this document."

At 3.3 'team variable' should either follow 3.4 'team' or be
subsumed under 'team'.

Subclause 4
[7:8] Replace "This" by "Except as identified in 4.1 above, this".

Subclause 8.3
The items should be in numerical order.

______________________________________________________________________

John Reid

1. The introduction does not mention failed images.  

2. There is no explanation of how codes might be written to continue execution
in the presence of failed images.

3. I do not understand why SOURCE is required to be a coarray for CO_BROADCAST
but not for the other collectives.

4. The operation defined by OPERATOR of CO_REDUCE should be required to be
mathematically associative, because the sequence of partial results to which 
it is applied is (purposely) undefined.  

5. There need to be more examples and notes to explain the features. For 
instance, there are no examples of the use of SYNC TEAM. 

6. The complexity of the whole feature is greater than was envisioned in 
Markham. Some reduction would be desirable. 

_______________________________________________________________________


Van Snyder

{The constraint against branching out of a CHANGE TEAM construct (C501
in 13-251/N1967), and parallel constraints on CRITICAL (C811 in 12-007)
and DO CONCURRENT (C824 in 12-007) ought to be gathered into a single
constraint on branching, e.g.,

C845a A branch within a CHANGE TEAM, CRITICAL, or DO CONCURRENT
      construct shall not have a branch target that is outside the
      construct.

but that can be done during integration.}

[13-251/N1968:9:32+ C501+] Insert a constraint

"C501a A RETURN statement shall not appear within a CHANGE TEAM
       construct."

{This, and similar constraints on CRITICAL (C810 in 12-007) and DO
CONCURRENT (C822 in 12-007), ought to be gathered into a single
constraint on the RETURN statement, e.g.

C1269a A <return-stmt> shall not appear within a CHANGE TEAM, CRITICAL,
       or DO CONCURRENT construct.

but that can be done during integration.}

[13-251/N1967:10:21 5.4p1] Replace "greater than zero" by "not be
negative".  But why even that?  Better yet, delete "greater ... is"

[13-251/N1967:10 Note 5.2] After the previous change, replace
"2-MOD(ME,2)" by "MOD(ME,2)".

[13-251/N1967:10 5.4]

{The description of FORM SUBTEAM, and its relationship to CHANGE TEAM,
are inadequate.  Is it necessary for every image of the current team to
execute a FORM SUBTEAM statement, even though it's not an image control
statement?  What happens if only, say, odd-numbered images do it?  And
if so, then what happens when a CHANGE TEAM statement is executed?  The
CHANGE TEAM statement is an image control statement, so one could not
have, e.g.

  IF ( MOD(ME,2) /= 0 ) THEN
    FORM SUBTEAM ( 2-MOD(ME,2), ODD_EVEN )
    CHANGE TEAM ( ODD_EVEN )
    etc.
  END IF

Is prohibition against, e.g.,

  IF ( MOD(ME,2) /= 0 ) FORM SUBTEAM ( 2-MOD(ME,2), ODD_EVEN )
  CHANGE TEAM ( ODD_EVEN )
  etc.

implied by the prohibition to reference an undefined variable?  What if
ODD_EVEN isn't undefined, but inconsistent on even-numbered images with
the result of executing FORM SUBTEAM on odd-numbered images?  E.g.,

  FORM SUBTEAM ( 1+MOD(ME,4), ODD_EVEN )
  IF ( MOD(ME,2) /= 0 ) FORM SUBTEAM ( 2-MOD(ME,2), ODD_EVEN )
  CHANGE TEAM ( ODD_EVEN )
  etc.

I have no suggestions how to repair this problem, because I don't know
the answers to the questions.
}

[13-251/N1967:10 5.5]

{Description of SYNC TEAM is confusing.  What does "images of the team
specified by <team-variable>" mean?  What does "each other image of the
specified team" mean?  Does it mean something like "If image M executes
a SYNC TEAM statement, and the value of SUBTEAM_ID() for image M is S,
then execution of the segment on image M of the segment following the
SYNC TEAM statement is delayed until every image of the current team for
which the value of SUBTEAM_ID() is S executes a SYNC TEAM statement
specifying the same team...."?  Since a team variable cannot be a
coindexed object, is that what "specifying the same team" means?

Suppose one has:

  FORM SUBTEAM ( 1+MOD(ME,4), ODD_EVEN )
  IF ( MOD(ME,2) /= 0 ) SYNC TEAM ( ODD_EVEN )

What happens?  Who waits?

I have no suggestions how to repair this problem, because I don't know
the answers to the questions.
}

[13-251/N1967:15:14 7.2p1] Replace "calls to" by "invocations of".

[13-251/N1967:15:15-16 7.2p1] Delete "A call ... image control
statement."  The effect will reappear in an edit for page 27 below.

[13-251/N1967:27:3+] Insert edit to 12-007:C821 in 8.1.6.6.3 CYCLE
statement

"{Replace C821 in 8.1.6.6.3 CYCLE statement}

"C821 (R831) A <cycle-stmt> within a CHANGE TEAM, CRITICAL, or DO
      CONCURRENT construct shall not belong to an outer construct."

[13-251/N1967:27:3+] Insert edit to replace 12-007:C845 in 8.1.10 EXIT
statement (see also 201-wvs-002).

"{Replace C845 in 8.1.10 EXIT statement}

"C845 An <exit-stmt> within a DO CONCURRENT construct shall not belong
      to that construct or an outer construct; an <exit-stmt> within a
      CHANGE TEAM or CRITICAL construct shall not belong to an outer
      construct."

{This is a slight extension of 12-007, in that it allows an EXIT
statement to belong to a CRITICAL construct but not to an outer
construct.  12-007 prohibits an EXIT statement to belong to a CRITICAL
construct or an outer one.  In e-mail discussion of CHANGE TEAM, it was
argued that it is better to allow an EXIT statement to belong to a
CHANGE TEAM construct than to expect one to put a label on the END TEAM
statement and branch to it, or to enclose the <block> of the CHANGE TEAM
construct in a BLOCK construct and exit the BLOCK construct.  The same
analysis applies to the CRITICAL construct, and for consistency the same
should be allowed for it.
}

[13-251/N1967:27:9+ or thereabouts] Insert a bullet (compare to edits
for F08/0040 concerning MOVE_ALLOC)

"o  a CALL statement that invokes a collective intrinsic subroutine;"

This could be combined with the edit from F08/0040

"o  a CALL statement that invokes the intrinsic subroutine MOVE_ALLOC
    with coarray arguments, or a collective intrinsic subroutine;"

--------------050203040506080908050508--

