From owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org Tue Apr 9 11:27:15 2013 Return-Path: X-Original-To: sc22wg5-dom8 Delivered-To: sc22wg5-dom8@www.open-std.org Received: by www.open-std.org (Postfix, from userid 521) id 83F1C3569B3; Tue, 9 Apr 2013 11:27:15 +0200 (CEST) Delivered-To: sc22wg5@open-std.org Received: from mk-filter-1-a-1.mail.uk.tiscali.com (mk-filter-1-a-1.mail.tiscali.co.uk [212.74.100.52]) by www.open-std.org (Postfix) with ESMTP id 260DC35689D for ; Tue, 9 Apr 2013 11:27:09 +0200 (CEST) X-Trace: 855511410/mk-filter-1.mail.uk.tiscali.com/B2C/$THROTTLED_STATIC/TalkTalk_Customer/92.16.213.213/None/John.Reid@stfc.ac.uk X-SBRS: None X-RemoteIP: 92.16.213.213 X-IP-MAIL-FROM: John.Reid@stfc.ac.uk X-SMTP-AUTH: X-Originating-Country: GB/UNITED KINGDOM X-MUA: Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0 SeaMonkey/2.16.2 X-IP-BHB: Once X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApoBAJfeY1FcENXV/2dsb2JhbAANOgEBCBaDJokJuDeBKIMUAQECAg4MAQwsEgYHAgQRLAwKDwkDAgECAUUQAwQEAgUSiAWqVoMxgVKFV4kRjVURAQUEgTkKgzcDj0qDYoNIgSGET4YFhm2BNz+BKgIHGwI X-IronPort-AV: E=Sophos;i="4.87,438,1363132800"; d="txt'?scan'208";a="855511410" Received: from host-92-16-213-213.as13285.net (HELO [127.0.0.1]) ([92.16.213.213]) by smtp.tiscali.co.uk with ESMTP; 09 Apr 2013 10:27:01 +0100 Message-ID: <5163DFB7.6040504@stfc.ac.uk> Date: Tue, 09 Apr 2013 10:30:31 +0100 From: John Reid User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0 SeaMonkey/2.16.2 MIME-Version: 1.0 To: sc22wg5@open-std.org Subject: Result of WG5 ballot on first draft TS 18508, Additional Parallel Features in Fortran References: <20130407094254.CB1A6356B54@www.open-std.org> In-Reply-To: <20130407094254.CB1A6356B54@www.open-std.org> Content-Type: multipart/mixed; boundary="------------050203040506080908050508" Sender: owner-sc22wg5@open-std.org Precedence: bulk This is a multi-part message in MIME format. --------------050203040506080908050508 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit WG5 Here is a draft of the result of our ballot. Please let me know by Friday if I have omitted your ballot or made any mistake in transcribing it. I don't think anyone expected that the first draft would be acceptable for submission to SC22 and indeed it clearly is not. We need now to think about all the comments and hopefully produce a better version during the Delft meeting. By the way, if you post a message to WG5, it is automatically copied to J3. Please don't explicitly copy to J3, because that means that those of us on J3 get two copies. Best wishes, John. --------------050203040506080908050508 Content-Type: text/plain; charset=windows-1252; name="N1971-1.txt" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="N1971-1.txt" ISO/IEC JTC1/SC22/WG5 N1971-1 Result of the WG5 letter ballot on N1967 John Reid N1968 asked this question Please answer the following question "Is N1967 ready for forwarding to SC22 as the DTS?" in one of these ways. 1) Yes. 2) Yes, but I recommend the following changes. 3) No, for the following reasons. 4) Abstain. The numbers of answers in each category were: 1 for 1) Yes (Whitlock). 0 for 2) Yes, but I recommend the following changes 8 for 3) No, for the following reasons (Bader, Chen, Cohen, Long, Maclaren, Moene, Muxworthy, Reid, Snyder) 1 for 4) Abstain (Corbett) The ballot has failed. J3 is requested to prepare a revised version that takes the comments into account. Here are the responses in detail Reinhold Bader There are a number of design problems that must be fixed. These impact internal consistency, usability, as well as performance of programs that use the new features described in the TS. (A) Comments on N1967: ~~~~~~~~~~~~~~~~~~~~~~ (A.1) SYNC TEAM: Instead of adding this feature, I suggest deleting the words ", involving the synchronization of all members of the teams at the beginning and end of the construct." from T4 of N1930. This would not only make addition of SYNC TEAM superfluous, but also allow writing much more scalable programs, at the cost of some redesign work on the teams feature. More detail is provided in section (B.1), below. (A.2) LOCAL_EVENT_TYPE: I consider this idea to be good; in fact I think that this provides the foundation for adding certain asynchronous features without much additional effort. Some (well, tongue-in-cheek, because likely exceeding WG5 mandate) suggestions along these lines are made further down. (A.3) Resiliency: This has lots of ramifications. Here two gut reactions that pull in opposite directions: * It is to be expected that on large-scale systems or cloud-like infrastructures failures will happen, so some facility to deal with this without terminating the program as a whole would be nice. * After forcing parallelism down all programmer's throats, the HPC industry now follows on with an even bigger toad: Using this feature to build resilient programs will in general make them slower and more resource-hungry; and designing resilient algorithms in many cases is as much work as restructuring for parallelization. Quite apart of this there exists the issue of how to make failure a well-defined concept. On balance I believe this feature should be deferred to post-TS consideration. Some experience with actual FT-MPI implementations (still quite a bit in the future) is needed, and HPC programmer feedback should be collected and evaluated before this stuff is developed, let alone shipped. I also fully admit to a preference in fixing the scalability (and other) problems before proceeding to this non-trivial task. A fast program is less likely to run into a hardware failure! (B) N1967 / Teams: ~~~~~~~~~~~~~~~~~~ (B.1) Performance impact of CHANGE TEAM The biggest problem I see in the TS are the performance issues resulting from the synchronization properties of the CHANGE TEAM construct. Since synchronization is enforced across all images of the (ancestor) team that invoke the construct, the following program - a simple representative for a large number of possible interesting scenarios - does not allow for overlapping of communication and computation: program data_feeder use, intrinsic :: iso_fortran_env implicit none type(team_type) :: role integer :: i, iter, m, id_role ! declarations for array b(:) and coarray a(:)[*] ! create three teams if (this_image() == 1) then id_role = 3 ! master else id_role = 2 - mod(this_image(),2)) ! two slave teams end if form subteam ( id_role, role ) ! iterate using the same team decomposition do i=1, iter ! calculation phase on team IDs 1 and 2: change team ( role ) select case (id_role) case (1) : ! do work on b(:) case (2) : ! do different work on b(:) : ! (Statement X) - see discussion below end team if (this_image() == 1) ! the statements inside this block with present semantics CANNOT ! be done concurrently with the execution of above CHANGE TEAM, ! potentially destroying the scalability of the program. : ! prepare local data (could be done as case (3) above, : ! but in the general case one typically can't move this) do m=2, num_images() a(:)[m] = ... ! push local data end do sync images (*) else sync images (1) b(:) = a(:) end if end do ! next iteration uses updated b(:) : end program Such overlapping is considered essential for scalable implementation of a very large class of parallel algorithms. If the CHANGE TEAM only had the effect of SYNC MEMORY, image 1 would fall through the block, and would hence be able to perform data transfers to other images concurrently with the calculation phase. Such data transfers would be fine as long as the accesses to the coarray "a" obey the usual rules (some words may be needed to indicate that the rules apply for all images of the initial team even across team execution context changes). Of course, the SYNC IMAGES statements near the end of the code are necessary whatever the synchronization semantics of CHANGE TEAM are. For the above scenario, it would also be sufficient to require synchronization only within each of the three teams defined by the decomposition stored in "role". However, for algorithms that are strongly load imbalanced within each subteam, while not requiring allocation or deallocation of coarrays within the subteam, an only slightly reduced synchronization requirement may still lead to significant performance degradation that cannot be worked around by the programmer (e.g., via the use of events). The performance impact will typically correlate with team size. Therefore, my strong preference is to retain consistency with the loosely asychronous image execution model on the level of teams by only imposing the effect of SYNC MEMORY at entry and exit of a CHANGE TEAM block, while still requiring that all parent team images must execute the CHANGE TEAM. The consequences of this change must of course be considered and taken care of. As far as I can see, the following issues arise: (1) A subteam-allocated coarray must be implicitly deallocated In this case, synchronisation must occur upon encountering the END TEAM statement, but only on any subteam that does a deallocation. This is analogous to the established behaviour for unsaved local allocatable coarrays in block constructs or subprograms. It may be useful to add a diagnostic for this via an optional argument SYNCED_DEALLOCATION of END TEAM (and perhaps other END <...> statements) that returns the number of implicit coarray deallocations that have occurred on the executing image. (2) Definition status of team argument This issue is dealt with via changes to the semantics of FORM TEAM described in (B.3.2) below. (3) Consistency issues with memory model? Assume that, in the above code, the unspecified statement in the line commented with (statement X) reads ... = a(:) With the presently defined synchronization semantics, this would be (formally) fine. With the loosened semantics suggested by me, a race condition could manifest. The debate here is whether this must be considered a consistency issue with the memory model because * within the construct, the coindices of a are different than those outside, and * it should be enforced that no modifications from outside the current team should be possible on any data object defined inside the current team My opinion is that the image index remapping is a purely virtual process that does not impact object identity, and that the difference with respect to safety against modification from "hidden" images between subteam-local and parent-team-inherited coarrays should be tolerated for the sake of being able to write efficiently executing code. Even in the present coarray model, there are many ways to write code with race conditions, and the problem described above is not in any fundamental way different from the usual ones arising from incorrect or insufficient use of synchronization statements. It must be solved by a combination of "established best practices" and tools that allow to identify and isolate race conditions. "Best practices" will for CHANGE TEAM read: "If you access coarrays defined in a parent team (or pointers associated with such coarrays or subobjects of them), sandwich your CHANGE TEAM construct between two SYNC ALL statements. Otherwise - hands off them." Presently, I do not believe that there are any fundamental problems with coarray allocations across CHANGE TEAM boundaries (the problem discussed in (B.6) below is of a different nature). The same applies for collective synchronization statements. However, ALL partial-synchronization constructs should be checked; possibly some additional restrictions need to be added in order to avoid semantic inconsistencies that might arise if such constructs cross CHANGE TEAM boundaries. A discussion of this is in section (F) below. Note that if the synchronisation requirements on CHANGE TEAM are loosened as indicated above, the statement SYNC TEAM (M) would be equivalent to CHANGE TEAM (M) ; SYNC ALL ; END TEAM (B.2) Missing support for computational domains The FORM SUBTEAM statement presently allows to establish subteams via purely algorithm-driven methods. However, for performance optimization purposes it would be very useful to be able to generate subteams optimized for specific machine architectures. For example, on a cluster of SMPs it will often be more efficient to use teams whose member images exactly match the cores in an SMP, most especially so if the Fortran run time is aware of the difference between communication (as well as synchronization) across and within SMPs. The simplest possible abstraction for this might be the use of an optional argument DOMAIN to the FORM SUBTEAM statement. The values allowed for the argument would be default integers between 1 and an implementation- and environment-dependent maximum DOMAIN_LEVELS (a protected integer accessible via ISO_FORTRAN_ENV). If DOMAIN is specified, the would need to take a definable entity as argument, which is provided a return value. Increasing values of DOMAIN should correspond to decreasing performance efficiency of data transfers as well as synchronization statements (corresponding to decreasing bandwidths and increasing latencies). The teams are given the IDs 1,..., NUM_TEAMS(), where NUM_TEAMS() is a new intrinsic. The reason that an environment dependency must be tolerated is that it is expected that coarray programs should in practice be able to interoperate with other parallel paradigms, possibly in various manners. Furthermore, additional hardware aspects (like use of hyperthreading cores) or the used batch queueing system may have an impact. None of these details should of course be referred to in normative text. (B.3) Sharpening of TEAM_TYPE object semantics is needed The definitions of TEAM_TYPE, FORM SUBTEAM and CHANGE TEAM appear to imply that an object of type TEAM_TYPE is, in a sense, an object distributed among all images of the ancestor team that describes a team decomposition. However, this is not explicitly spelled out, and I suspect that the semantics are at this point too loosely specified, inviting a number of misuses. For example, it appears to be permitted to write the following: type(team_type) :: t(2) integer :: id id = 2 - mod(this_image(),2) form subteam ( id, t(id) ) which just about may make sense (because the team variable used is consistent with the identifier), but it is easy to generate setups where different team variables are associated with different images of the same team. Furthermore, a statement change team (t(1)) following the above team formation is not conforming since t(1) is undefined on every second image, and for more colorful setups (using many team objects) it will be even easier to produce incorrect CHANGE TEAM statements. Also, from the implementation point of view, I would expect that (for scalability reasons) not all information about a team should be required to be stored on each image, so the implementation should have the freedom to use something similar to a coarray type component for TEAM_TYPE (maybe resulting in a requirement that teams must be scalars, or at most arrays with a statically defined size). Therefore, I suggest adding some additional properties and restrictions to TEAM_TYPE and its usage: (B.3.1) each FORM SUBTEAM must reference the same object of type TEAM_TYPE on every image. The same applies for each CHANGE TEAM statement. FORM SUBTEAM defines a decomposition, and CHANGE TEAM activates the executing image's team (locally) as soon as it enters the construct. (This feature is also needed to make NUM_TEAMS() - see (B.2) above - well-defined). (B.3.2) If the synchronization requirements of CHANGE TEAM are loosened as described in (B.1) above, FORM SUBTEAM must perform synchronization of all executing images at the end of its invocation in order to assure the decomposition is fully defined when the CHANGE TEAM construct is first encountered by an image; it may be more appropriate to convert the statement into an impure elemental collective subroutine, because synchronization is then only incurred once even if a larger number of subteam decompositions is needed. (It may be sufficient to synchronize on a per- subteam basis, but this probably complicates the specification). For analogous reason, synchronization must occur before a team object is finalized by going out of scope. There is now no explicit facility in place to do this, but I consider it useful to define a DELETE SUBTEAM in order to, say, be able to recycle a single team object if, e.g., team sizes are supposed to vary an indefinite number of times throughout iterated execution of part of the program. Also, the necessary synchronization is then explicitly visible. Since a team decomposition usually creates more than one subteam, I also suggest changing the FORM SUBTEAM nomenclature to FORM SUBTEAMS. (B.4) addressing coarrays defined in ancestor team Assuming the initial team executing the following code contains four images. integer :: a[*] type(team_type) :: t a = this_image() sync all id = 1 if (this_image() == 3) id = 2 form subteam(id, t) change team(t) select case(id) case (1) if (this_image() == 2) write(*,*) a, a[3] if (this_image() == 3) write(*,*) a, a[2] end select end team Questions: (B.4.1) Is the coindexed reference to the coarray "a" inherited from the ancestor team intended to be conforming? If yes, (B.4.2) What exactly will the write statements print? (B.4.3) Assuming, that from any image executing the case(1) block, one wishes to access that object corresponding to a[2] in the ancestor team. How can this be done? Furthermore, (B.4.4) Is it allowed to use the DISTANCE argument in THIS_IMAGE() if a coarray argument is also specified? Then, consider the following code (4 images): integer, allocatable :: b[:,:] type(team_type) :: u allocate(b[2,*], source=this_image()) id = 1 if (this_image() == 3) id = 2 form subteam(id, u) change team(u) select case(id) case (1) if (this_image() == 3) write(*,*) b[1,2] end select end team Question: (B.4.5) Assuming (as part of the answer to question B.4.2 above) that the image indices are mapped into the team as 1 -> 1, 2 -> 2, 4 -> 3, how are the coindices of the corank 2 coarray "b" mapped? One could simply compress the coindices into a flattened sequence: [1,1] -> [1,1], [2,1] -> [2,1], [2,2] -> [1,2]. However, this does not preserve the cartesian communication structure that was intended by this feature, and is therefore bound to become pretty confusing with growing corank. An alternative would be to retain coindexing of a coarray as if accesses happen in the team it was created in: [1,1] -> [1,1], [2,1] -> [2,1], [2,2] -> [2,2] Would an access to [1,2] (updating an object outside the current team) then be valid or invalid (T2 would indicate the latter)? In any case, the additional bookkeeping needed to keep track of coarray distance and image indices would appear to make coding of communication rather complicated. It may save J3 as well as programmers a lot of grief to simply disallow coindexing on coarrays inherited from an ancestor team; due to the existence of global variables this would need to be a restriction that in general requires a run-time check. Therefore, it would be useful to also allow a coarray argument for the new TEAM_DEPTH intrinsic, in order to guard code such as if (team_depth() == team_depth(a)) then a[i] = ... ! coarray a has the SAVE attribute. else error stop 'Executing team is not the one that created a.' end if If the above restriction is introduced, also dummy coarray arguments should not be allowed to be associated with a coarray that is inherited from an ancestor team. Reason: This avoids the need for the above guard code for such dummy arguments. The highest possible price application programmers need to pay for this restriction is allocation (and deallocation) of a team-local coarray and a memory copy (or two). The significant benefit is that clearer coding is enforced. (B.5) Propagation of normal termination If WG5 decides to keep the synchronization requirements for the CHANGE TEAM construct, END TEAM must also be able to specify a . Conversely, if the synchronization requirement is removed, the should be deleted from the CHANGE TEAM statement. (looking at SYNC MEMORY, I see that this may need to be retained after all, but wonder what it does ... checking itself for having been stopped?) (B.6) Potential problems due to fanciful block structure nesting Consider the following (I hope, conforming according to TS draft) example: change team (m) block real, allocatable :: a(:)[:], b(:)[:] allocate(a(5)[*]) select case (subteam_id()) case (1) allocate(b(3)[*]) : ! calculate deallocate(b) end select deallocate(a) end block end team The (de)allocation statements, while syntactically identical, are doing semantically different things here. Namely, coarray "a" is being allocated on each subteam stored in the team decomposition "m", while coarray "b" is only being allocated on subteam with id 1. Apart from the slight tummy-ache that the context-dependent meaning of this induces, there is potential for easily introducing bugs in more complex code that would cause the application to hang or crash. For example, "deallocate(b)" might be placed outside the select case block by mistake. Furthermore, the synchronization semantics appear to be unclear: does the allocation of "a" synchronize across the union of all subteam images, or only over each subteam individually? Perhaps it is necessary after all to enforce selection semantics on "change team" itself, thereby disallowing the possibility to interleave a block construct (or, worse, a library call containing explicit or implicit synchronizations) in the manner illustrated above: change team (m) ! must have a subteam or default statement following subteam (1) block real, allocatable :: a(:)[:], b(:)[:] allocate(a(5)[*],b(3)[*]) : ! do stuff deallocate(a, b) ! or rely on automatic deallocation end block default ! guarantee separate context for each id block real, allocatable :: a(:)[:] allocate(a(5)[*]) : ! do stuff deallocate(a) ! or rely on automatic deallocation end block end team (C) N1967 / Events: ~~~~~~~~~~~~~~~~~~~ (C.1) Atomicity of event count The description of events appears to imply that it is allowed to do multiple posts on a given event. However, given the synchronization rules, the following seems to be disallowed if executed with more than one image: type(event_type) :: p[*] event post (p[1]) ! Updating p[1] in unordered segments But since a statement is used for event updates anyway, why not let them act atomically (i.e., effectively use ATOMIC_ADD for the updates)? In particular, this would imply that type(event_type) :: q[*] select case(this_image()) case(1) event post (q) ! (1) : ! do something that executes for a considerable length of time event post (q) ! (2) case(2:3) event wait (q[1]) end select is conforming, that exactly one of the images 2 or 3 (which one is undetermined) will continue executing immediately after image 1 executes statement (1), and the other one will continue executing after image 1 executes statement (2). See also the split phase barrier below for an application of this that does not need an inflation of event variables. (C.2) Split phase barrier Using LOCAL_EVENT_TYPE objects and assuming that posts and waits act atomically, it is possible to write a split-phase barrier as follows type(local_event_type) :: barrier[*] do i=1, num_images() event post( barrier[i] ) end do : ! do work that does not violate rules do i=1, num_images() event wait( barrier ) end do An implementation would be capable of doing the above much more efficiently if a collective facility like event postall (barrier) : ! do work that does not violate rules event waitall (barrier) were available. (C.3) Events as type components Constraint C603 appears to have been mangled, transforming its meaning to the opposite of what it presumably should be. Edit [13.22]: In "variable definition context," replace the comma by " except". (D) N1967 / Collectives: ~~~~~~~~~~~~~~~~~~~~~~~~ (D.1) Add CO_MULT for efficiency Nowadays interconnects have support for offloading certain operations to the infrastructure (e.g., FCA aka "fabric collective acceleration"), thereby considerably improving performance. However, it appears unlikely that the relatively general CO_REDUCE facility would be able to support this facility. Therefore, it may be desirable to also provide a CO_MULT collective for arguments of numeric type that supports multiplicative reductions, in order to obtain the same level of performance for all basic numeric operations. (D.2) Asynchronous execution By using local_event_type and possibly the ASYNCHRONOUS attribute, the collective functions could be made to support asychronous execution. This would allow overlap of communication and computation also for the collective functions. For example, subroutine foo(ev, redu, ...) type(local_event_type), intent(inout) :: ev[*] real, intent(out) :: redu : call co_sum(source=x, result=redu, posted_event=ev) ! redu and x implicitly have the ASYNCHRONOUS attribute ! because co_sum takes a POSTED_EVENT argument : end subroutine foo subroutine bar(ev, redu, ...) type(local_event_type), intent(inout) :: ev[*] real, asynchronous :: redu : event waitall (ev) ! Cf (C.2) above ... = redu ... ! may now be able to safely reference redu end subroutine The program invoking the two above would need to look like this: type(local_event_type) :: myev real, asynchronous :: x call foo(myev, x, ...) : ! do other computations call bar(myev, x, ...) (E) N1967 / Atomic functions: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (E.1) OLD argument in atomic functions It should be clarified that this argument of an atomic function is not updated atomically. Perhaps using coindexed entities should be prohibited here? (F) Partial synchronization and teams ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This section attempts a discussion of the consequences of relaxing the synchronization properties of CHANGE TEAM with respect to partial-synchronization image control statements. The following text, unless explicitly stated otherwise, assumes that the global barrier at the beginning and end of a CHANGE TEAM block construct is removed, and only SYNC MEMORY is performed on all images. In order to achieve the aim stated in N1930/T1, "When a block of code is executed on images executing as a team, it should execute on those images as if the program contained no other images", the following third sub-item needs to be added to T1: - Activities that involve partial synchronization of images inside a team, such as SYNC IMAGES, events and locks, need to be clearly separated from any such activities that are invoked in a parent team. The following subsections suggest usage restrictions on the partial synchronization constructs that are necessary to guarantee this. (F.1) Events For a first example, assume that the following program is executed by 3 images: integer :: data[*], id type(local_event_type) :: ev[*] type(team_type) :: tm data = this_image() id = 1 ; if (data == 3) id = 2 form team ( id, tm ) ! one subteam, assume image numbering 1 --> 1, 2 --> 2 for first team if (data == 1) then data[3] = 4 event post (ev[2]) ! (1) end if change team (tm) if (data == 2) event wait(ev) ! (2) end team if (data == 2) write(*,*) data[3] ! (3) Two questions arise here: * Is the event post / wait sequence (1), (2) that crosses team execution contexts valid? * If yes, is the coindexed access in (3) valid? I think that the answer to the first question should be "no", because otherwise either the answer to the second question would be "no" (somewhat counterintuitive for the programmer), or statement (2) would effectively need to perform synchronization that extends outside the currently executing team. Therefore, the following restriction should be added: * The event variable used in EVENT POST or EVENT WAIT statements must be associated with the team that executes these statements. As a consequence, an EVENT WAIT statement that can match an EVENT POST statement must be executed in the same current team. Note that "associated with" as used above will need a proper definition. Also, the restriction would essentially be implied if the coindexing suppression suggested in (B.4) is accepted. Note that the following two usage patterns would be permitted if () event post if () event wait change team change team : : end team end team if () event wait if () event post (The first would also be permitted with the big barrier in place, but would have no effect. The second one would always deadlock with the present semantics; it may or may not deadlock under relaxed synchroni- zation, depending on what syncs are applied inside the CHANGE TEAM block. A deadlock detection tool will be helpful in identifying such issues.) (F.2) Locks and CRITICAL blocks For locks, the same restriction will be needed as for events: * A lock variable used in LOCK or UNLOCK statements must be associated with the team that executes these statements. Since CHANGE TEAM is a (collectively executed) image control statement, its appearance inside a CRITICAL block is already prohibited via the rules in section 8.1.5 of N1830. The general rules on block structure nesting prevent undesirable interleaving of CHANGE TEAM and CRITICAL. (F.3) SYNC IMAGES This statement can be understood in term of pairwise notifying events; in Fortran 2008 a single global event variable would be available. In order to avoid interference of SYNC IMAGES statements that appear outside and inside a CHANGE TEAM block, it must be explicitly spelled out that on each team, SYNC IMAGES gets its own synchronization context (i.e. a team-specific event variable, that might be created when FORM TEAM is executed). As an example, consider the following code, run with 3 images and (in order to avoid complications due to image reindexing) a subteam decomposition that contains a single team with the same 3 images: me = this_image() if (me == 1) sync images ([2]) change team (...) sync images ([1,3]) ! (X) end team if (me == 2) sync images ([1]) This code would execute just fine, while it would deadlock if the barrier on CHANGE TEAM is in place. It would also be considered "bad practice", because it is very likely to produce deadlocks, for example by replacing [1,3] by [1,2] in statement (X). Note that without separated contexts, the latter would not deadlock, but very likely not produce the desired results. (F.4) Team variables and CHANGE TEAM construct For proper nesting of CHANGE TEAM constructs the statement sequence form subteam (a) change team (a) form subteam (b) change team (b) end team end team should be enforced by requiring that a team decomposition is created (and perhaps even declared) in the same context (current team) that uses it. For nested team use this effectively brings back the big barrier, unless FORM SUBTEAMS only synchronizes by subteam (cf (B.3.2)); it may be more appropriate to consider a more powerful form of FORM SUBTEAMS in the future that is capable of generating nested subteam sequences with a single invocation. Finally, consider the following situation: real, allocatable, save :: a(:)[*] type(team_type) :: t integer :: id if (this_image() == 1) then id = 1 else id = 2 end if form subteams(id, t) change team (t) if (id == 2) then allocate(a(1000)[*]) ! subteam 2 only : ! work with a deallocate(a) end if end team allocate(a(2000)[*]) ! all images Here image 1 proceeds to the second ALLOCATE statement, while all other images execute the code inside the CHANGE TEAM block. From the application point of view this is fine: since the second ALLOCATE statement performs synchronization on exit, it will only complete once all images have executed it. However, there may (depending on the implementation) exist a race condition on the descriptor for a, which can only be prevented by also synchronizing upon entry to ALLOCATE. This however is an implementation issue, and prescribing CHANGE TEAM to be executed collectively gives the implementation the opportunity to do the necessary (image-local) bookkeeping that helps to decide whether or not to perform such an extra synchronization. _______________________________________________________________________ Daniel Chen There are a few technical issues raised by others that need more discussion and consideration. _______________________________________________________________________ Malcolm Cohen It would be slightly nicer if the text describing the features indicated whether certain things were or were not image control statements. I understand that this would complicate the edits, but perhaps there could be an overview at an earlier stage. It seems to be possible to copy event variables by argument association, e.g. CALL sub(event,(event)) This should probably be prevented by requiring INTENT(INOUT) on event dummy arguments. 6.4 says "If the count of a event variable increases through the execution of an EVENT POST statement on image M and later decreases through the execution of an EVENT WAIT statement on image T, the segments preceding the EVENT POST statement on image M precede the segments following the EVENT WAIT statement on image T." which is all very well, but the very definition of "later" can only be interpreted as have segments already ordered, i.e. it is assuming a stronger fact than the result that it requires. Consider image 1 segment i does POST EVENT(x) image 2 segment j does POST EVENT(x) image 3 segment k does WAIT EVENT(x) for unordered segments i, j, k; then image 3 segment k+1 follows image 1 segment i or image 2 segment j, but which? Both? Neither? One of them but no-one knows which? The obvious semantics would be that it follows both, but that i,i+1,j,j+1,k are all unordered. That is, if the event counter has value N, posted by segments ii(1) to ii(N), then image k+1 follows all of ii(1) to ii(N). Obviously this needs to be rewritten to avoid assuming linear time ("later" forsooth) and to clarify the ordering that results. I would slightly prefer C605 to reworded as "An in an that is". Yes, it has the BNF rule number on it anyway, but we have gotten that wrong so often in the past that it is best to spend the ink to make it more readable. 7.2 STAT para has "argument" twice and "variable" twice. Please be consistent. An "unsuccssful" collective with no STAT= does not cause error termination. Why? If STAT_STOPPED_IMAGE is good enough to terminate SYNC ALL, it should be good enough to terminate a CO_BROADCAST. Why permit STAT to be present on some images and not others? Are there any error conditions for collectives apart from FAILED/STOPPED image? I see nothing about this being processor dependent. If there are possible error conditions, the draft requires the processor to compute the correct result and perform the correct action regardless ... surely some mistake. The stated design goal for performance is that collectives are not required "wait" for completion, except on the image receiving the result. However, if there are error conditions, presence of STAT (and maybe ERRMSG) will surely force such a wait to occur. It seems unsatisfactory to have go-faster features that don't work if one uses the reliability features. CO_BROADCAST does not require the same type parameters for SOURCE on all images. Also, VARIABLE would be a better name than SOURCE since it is INTENT(INOUT) and receives the result. CO_MAX of a scalar does not require the same type parameters for SOURCE on all images. It is unsatisfactory for CO_MAX et al to require the SOURCE to be a definable variable, thus preventing collective max/sum/etc. of INTENT(IN) or PROTECTED variables (or indeed, of expressions). If we can't have unambiguous syntax that handles both inplace collectives and result collectives, perhaps we should have two names, e.g. CO_SUM(SOURCE [ ,RESULT_IMAGE,STAT,ERRMSG ]) CO_SUM_RESULT(SOURCE,RESULT [ ,RESULT_IMAGE,STAT,ERRMSG ]) As specified, EVENT_QUERY seems completely useless in that one would not be permitted to use it from a segment that is unordered with respect to any EVENT POST statement that updates it. Indeed, events seem useless if there are multiple images that might want to post an event, since it would modify the variable from an unordered segment. Presumably events are meant to be excluded from the unordered modification rules, but I see no text that describes such semantics. _______________________________________________________________________ Bill Long No, N1967 is not ready for a DTS ballot because, based on the ballots submitted so far, it would likely fail such a ballot. Several people have raised issues that require more discussion and consensus before the TS is ready for a DTS vote. I'll not repeat all of the other ballot comments here, but would like to point out a few - 1) Should we add a CO_PRODUCT collective subroutine? Editorial disruption for this is minimal, so the question is between need/value and additional clutter in Clause 13. 2) A proposed modification to the TEAM facility needs discussion. If we adopt the idea, there appear to be material side-effects to the base memory model (such as SYNC ALL statements not executing on all images that could affect local variable values). 3) There are general concerns that the memory model aspects of the new features are not adequately specified. 4) Additional examples in the Annex would be helpful. (This was a known deficiency going into the ballot.) _______________________________________________________________________ Nick Maclaren I have not had time to cross-check on all of the details of N1967 against Fortran 2008, so these are not necessarily all of my objections. At the end of my reasons, I append some proposals for improvement, but the largest one is a rough draft. REASONS FOR VOTING NO --------------------- --------------------- Generic ------- 1.1) The wording refers to cases when the execution of a statement is not successful, but Fortran 2008 refers to error conditions. This is confusing, at best, and they should use compatible terminology. It is more serious when one considers node failure. 1.2) That is not the only aspect in which the details differ. The wording and other details need a systematic check and improvement. 1.3) I am distinctly unhappy about the number of places where semantics are defined for error conditions that are caused by infrastructure failure, which is not in accordance with the Fortran standard's previous practice. STAT_FAILED_IMAGE is mentioned later, but this is also done for events. 1.4) The current dominating standard for parallel processing is MPI, and its basic model has proven to be solid over many years. This TS provides many comparable facilities, but does not seem to have included the comparable constraints needed for correctness and implementability. This applies particularly to teams, but also to collectives. Teams ----- I have serious difficulty even understanding the basic model, and it appears to make little sense. FORM SUBTEAM is specified to be an ordinary statement creating a variable, and all synchronisation is in CHANGE TEAM, using a variable defined by a previous FORM SUBTREAM statement. All of the descriptions of which team is being referred to are in terms of a variable, and not a value. The following are a few of the issues this causes. 2.1) What happens if only some images in the current team have called FORM SUBTEAM? How does CHANGE TEAM know which other images to wait on? 2.2) In the following, do alf and bert indicate the same subteam? And is it allowed to create two different teams at the same level, as in bert and colin? And how do other images know which of these FORM SUBTEAM statements matches the FORM SUBTEAM statement on their image? TEAM_TYPE alf, bert, colin, dave FORM SUBTEAM (13, alf) FORM SUBTEAM (13, bert) FORM SUBTEAM (666, colin) 2.3) Fortran defines intrinsic assignment for derived types; even if that were locked out, several argument passing mechanisms imply implicit copying. The nearest that Fortran has to the concept of two variables being the same is association. It is not within the remit of this TS to add a major new, fundamental semantic concepts to Fortran, such as unassociatable objects. For example, in: TEAM_TYPE alf, bert FORM SUBTEAM (13, alf) bert = alf FORM SUBTEAM (42, bert) or: TEAM_TYPE alf FORM SUBTEAM (13, alf) CALL ugh(alf) SUBROUTINE ugh (TEAM_TYPE arg) FORM SUBTEAM (42, arg) END SUBROUTINE ugh 2.4) The following is allowed by the specification, but it makes no sense. Specifying synchronisation by how often CHANGE TEAM is called doesn't work if its argument may be variable and there are no further constraints. TEAM_TYPE array(NUM_IMAGES()) // Set up somehow REAL :: junk CALL RANDOM_NUMBER(junk) CHANGE TEAM (array(junk*THIS_IMAGE()+1)) ... END TEAM 2.5) Allowing subteam variables in CHANGE TEAM with no further constraints allows non-hierarchical team usage, which was not the intent of N1930 T3. TEAM_TYPE alf, bert, colin FORM SUBTEAM (13, alf) CHANGE TEAM (alf) FORM SUBTEAM (42, bert) END TEAM FORM SUBTEAM (666, colin) CHANGE TEAM (colin) CHANGE TEAM (bert) ... END TEAM END TEAM 2.6) The issue described in 2.4 also allows SYNC TEAM to synchronise teams which are not the current team or one of its descendants. This is, at best, a recipe for deadlock. Even allowing it on ancestors introduces a conflict with N1930 T1 and T2. Also, I cannot see that the statement is required by N1930, or actually necessary. It can be done by temporarily changing team and calling SYNC ALL. There are other problems, too, such as: 2.7) In the specification of CHANGE TEAM, the current team when the CHANGE TEAM was executed is not necessarily the parent of the team that is being changed to, so specifying synchronisation of the parent team is incorrect. 2.8) I have tried to convince myself that correct programs will not deadlock, and I have tried to convince myself that correct programs can deadlock, and have failed with both. The design is just too complicated to be sure it is correct. 2.9) The design very dubiously meets the requirement N1930 T2, because an image belongs to all of the teams that it has formed and can use them, which is the cause of the SYNC TEAM problems. 2.10) There are some very nasty issues to do with resource leakage if these facilities are used in a library. FORM SUBTEAMS creates a handle to something or other, but there is no way to release that handle. This would be easily soluble only if its function were subsumed into CHANGE TEAM. 2.11) It has omitted the qualification in LOCK and UNLOCK that semantics are defined only for successful execution of the statements. This is a variant of reason 1.1. Conclusion: the constraints on team actions and the semantics of teams need a complete rethink. Collectives ----------- 3.1) CO_REDUCE requires commutativity but not associativity of OPERATOR, which makes no sense. MPI requires associativity but not commutativity, which at least makes sense. It should require both. 3.2) Also, it does not specify anything about the consistency of OPERATOR, which is a recipe for problems. I have serious difficulty in understanding the combination of C730, C1218, C1220, C1234 and 12.4.3.6 paragraph 7, but can believe that the requirement for an elemental function means that it must be the same function. However, that is not enough (semantically) because of global or parent scope variables and THIS_IMAGE(). This should be improved. 3.3) The specification of the ordering of collective subroutines makes sense and is what we agreed, but remains confusing. A NOTE should be added to clarify our intent. 3.4) There is a potentially serious interaction between collectives and atomics, as far as consistency goes, because both can be used to pass information between unordered segments. See Data Consistency below. Events ------ 4.1) I am baffled by the reference to INTENT(INOUT) in C602 and C603. In particular, both EVENT POST and EVENT WAIT necessarily both read and write the variable, so it seems bizarre to lock out the only case that makes semantic sense. Neither of those statements make any reference to whether their event-variable may be INTENT(IN), INTENT(OUT) or PROTECTED, none of which make semantic sense. The only thing that I am assume is that the sense of the condition has got inverted. This needs fixing. 4.2) Page 14 (6.4 EVENT WAIT) lines 7-11 are surely erroneous in the case where the EVENT POST fails, and probably when the EVENT WAIT does. This matter is not as simple as it appears to be, because it has a significant impact on permitted serial optimisations. See Data Consistency below. 4.3) The word 'later' is thoroughly ill-defined in a parallel context, especially when it is applied to general semaphores. In particular, it begs the question of which one of several possible uses of EVENT POST does the EVENT WAIT synchronise with? As the specification stands, this means that they must NOT be image control statements, because that would introduce a logically recursive definition into the standard. I.e. the sequence of their execution controls the ordering, but the ordering controls the sequence of their execution! This needs specifying properly, and would be vastly simplified if events were changed from being general semaphores to being binary ones, though that would conflict with N1930. See Data Consistency below. 4.4) There is nothing said about global consistency, which is well-known to be a potential problem with such actions (as with atomics). In particular, it might seem obvious to assume sequential consistency, but that does NOT immediately follow. Whatever model is chosen needs specifying. See Data Consistency below. Obviously, that choice has a major impact on the EVENT_QUERY intrinsic, especially as it is defined only when it is ordered with respect to all EVENT POST and EVENT WAIT statements. Atomic Intrinsics ----------------- There are at least two structural problems with these. 5.1) The first is that their ATOM argument is not required to be a coarray, unlike ATOMIC_DEFINE and ATOMIC_REF, which is undefined if an atomic coarray object is an actual argument to a procedure which does not define it as a coarray but then calls these procedures. That needs fixing. 5.2) The second is that these extensions are truly baffling in the context of Fortran 2008 13.1 paragraph 3 and Note 13.1. I am not sure what to do, but supporting the fetch-and-operate paradigm means that the global data consistency problem simply has to be addressed. There are several options, but all have extremely unobvious and serious consequences. See Data Consistency below. Data Consistency ---------------- 6.1) This is not a simple matter, and WG5 will be making a serious mistake if it proceeds to add facilities of the nature proposed in the TS without putting some serious thought into the data consistency model. In Fortran 2008, we evaded this by selecting a simple and extremely proscriptive model for SYNC IMAGES and kicking the consistency of atomics into the long grass. This is no longer viable, for three reasons: 1) Upon doing a bit more research, I realise that I may have been wrong in believing that all DAGs can be serialised, though I have so far been unable to work out or track down an example of a DAG that cannot be. It turns out that this is a currently active area in computer science, and is known (at least by some people) as DAG consistency. Fortran's model is what the following papers define as the WW model; Fortran is not fixated about determinism, unlike most modern computer science. See, for example: http://supertech.csail.mit.edu/papers/frigo-ms-thesis.pdf http://www.fftw.org/~athena/papers/ipps96.ps.gz http://www.fftw.org/~athena/papers/spaa98-memory.ps.gz http://www.fftw.org/~athena/papers/spaa96.ps.gz My concern is that events may be the straw that breaks the camel's back, and we may have stepped over the boundary into an inconsistent model. Unfortunately, I am not an expert in this area, though I know enough to know that apparently obvious truths are often false. 2) Fortran events are general semaphores. While these are well-studied, they are nowadays usually avoided in favour of other mechanisms. However, I have so far been unable to find any precise description of the ordering semantics for general semaphores, or convince myself that I understand that aspect. Dijkstra himself pointed out that they are no more powerful than binary semaphores. See 4.2 in: http://www.cs.utexas.edu/users/EWD/transcriptions/EWD01xx/EWD123.html Worse, the current specification defines behaviour if they are unsuccessful, and I have absolutely no idea what implications that might have. In particular, events affect segment ordering and, if we do not specify anything, that is going to break the properties of the data consistency model that we agreed (after much discussion!) in Fortran 2008. Either we need to simplify them very considerably (probably to binary, unconditional semaphores), or we need to call in some specialist expertise, or both. I cannot convince myself that the current specification is consistent. 3) I really can't see any way to make sense of these atomics except by enforcing consistency. In particular, it is NOT automatic that operations on even a single atomic variable are sequentially consistent, which was the topic of such debate in Fortran 2008. However, thinking about what various forms of relaxed consistency might mean with the fetch-and-operate intrinsics makes my head hurt. However, recently, I have found some leads. There is also the point that the simple, inconsistent atomics that we defined in Fortran 2008 are extremely useful on systems that have no hardware or operating system support for global consistency, because they can often be implemented efficiently, whereas consistent atomics need to be emulated by using locks or equivalent. There are also a lot of uses for atomics that do not require any particular consistency. 6.2) When it comes to consistency, there is no logical difference between a PGAS model and shared memory, and one of the few good designs I have seen is the C++ memory model. For a clear and fairly simple description of the issues, see sections 1 to 3 of: http://www.hpl.hp.com/techreports/2008/HPL-2008-56.html in: http://www.hpl.hp.com/personal/Hans_Boehm/pubs.html Evidence of the model's consistency is in: http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf in: http://www.cl.cam.ac.uk/~pes20/ Regrettably, I cannot claim to be enough of an expert to guarantee to validate such a design for Fortran, though I am enough of one to spot obvious flaws. There is the question of whether the atomics should also synchronise other data uses. I believe that our assumption is that they should NOT necessarily do so, which is a significant difference from the C++ model. This is the point at which I get out of my depth. I am almost certain that the Fortran design makes sense, but parallel semantics is so deceptive that I am chary of assuming I am right. 6.3) While I believe that the time has come to do it properly, I accept the argument that this would derail the schedule. However, I believe that it is essential that we (a) integrate events with the basic DAG consistent segment model, (b) decide and define something about atomics and (c) try to avoid taking decisions that will prevent a proper solution later. 6.4) Arising from this, I believe that the easiest solution to the event problem is to simplify them to be binary semaphores and explicitly require all image control statements to be DAG consistent. This last is not a change, but merely a more explicit statement of what the standard currently says. I am a bit nervous about even allowing EVENT POST on an already posted event variable to return an indication of the fact, but suspect that it will be so often demanded that it is unavoidable, and it does not seem to add any problems that have not already been introduced by the ACQUIRED_LOCK= specifier in the LOCK statement. 6.5) I previously believed that the simplest solution to the atomics problem is to explicitly define sequential consistency on a single atomic variable for the new atomics, and to explicitly state that the sequence for two variables need not be consistent. However, since then I have thought of realistic examples where that will not work, and Mark Batty has pointed out others to me. He has, however, suggested an alternative model that might work, which he refers to as the release/acquire model. I am continuing my discussions with him. This leaves ATOMIC_DEFINE and ATOMIC_REF as anomalies, and I believe that we should provide alternative store and fetch atomics that are consistent with the new ones, and leave ATOMIC_DEFINE and ATOMIC_REF as processor-dependent. Alternatively, we could use those names for the consistent versions, and add new names for the relaxed ones. Image Failure ------------- 7.1) This is not a minor addition. No language has ever managed to standardise recovery of an application from general system-generated errors or infrastructure failure, and even POSIX does not attempt it. There are fundamental reasons why this should not be attempted in a portable language. Fortran has not so far defined any recovery facilities from even the simplest cases of I/O errors (e.g. erroneous data in formatted input), which have been supported by many languages for the past 40 years. Image failure is significantly harder to recover from than even system-generated I/O errors such as a disk failure. Furthermore, I/O error recovery is required by a vastly larger number of programs than even use coarrays, let alone those that want this facility. Fortran has not defined any form of the now-conventional exception handling, because of difficulties in integrating it with the existing language. Even Ada has not attempted to define recovery from system-generated errors, let alone infrastructure failure. C's signal facility permits this sort of failure trapping, but is merely a standardised syntax to entirely undefined (sic) semantics and implementations that simply crash exist and are conforming. When an image fails, it will usually be while it is active, which at the very least means that any data it might have defined in its active segment (including coarray data on other images) becomes undefined. Any other images that may have been interacting with the image when it failed (whether via coarray data that it owns, collectives, events or other) also reach an undefined state. Because Fortran permits the buffering of file operations over image control statements, the output and error units, and any shared files also become undefined, even if they were not being accessed at the time or even used in the image that failed. Lastly, there is nothing stopping processors from providing this facility as an extension, and that would be a far better way to do it, at least until such time as it this is shown to be feasible in at least most processors. This feature should be removed. PROPOSALS --------- --------- This is mainly in the form of changes to the main body of N1967, excluding the editorial changes, but the teams proposal is much looser. Generic ------- A.1) Throughout the document, the wording should be changed to be the same as in Fortran 2008, and any missing constraints, restrictions and similar wording added. Image Failure ------------- B.1) This should be removed, as should all references to it. If it is to be presenved, it is critical that it spells out exactly what a processor must do, and exactly what a programmer may assume, when STAT FAILED IMAGE is returned. This may well be "nothing" and "nothing", but it must be in normative text. Collectives ----------- C.1) On page 19, lines 28 and 28, "mathematically commutative operation. If the operation implemented by OPERATOR is not associative, the computed value of the reduction is processor dependent" should be replaced by "mathematically associative and commutative operation". C.2) On page 19, line 36, the following should be added: If the same pair of values is passed as arguments to OPERATOR on two images, the results shall be equivalent. If SOURCE is numeric, equivalent means mathematically equivalent [Fortran 2008 7.1.5.2.4]; otherwise, its meaning is processor-dependent. C.3) On page 15, after line 16, the following should be added: NOTE Calls to collective subroutines are not image control statements and do not perform any synchronisation, to allow maximum scope for optimisation. If synchronisation is needed, programs must call SYNC ALL explicitly. C.4) It is critical that we say something in normative code about the consistency they require, to prevent optimisation creating causal loops. Events ------ I am proposing a lesser facility than the one agreed in N1930, to avoid the synchronisation definition problems I mention above. D.1) On page 13, lines 10 to 13 should be replaced by: A scalar variable of type EVENT TYPE or LOCAL EVENT TYPE is an event variable. An event variable includes a boolean state of whether an event has been posted. The initial value of the state of an event variable is that no event has been posted. D.2) On page 13, lines 20 and 23 (C602 and C603), "where" should be replaced by "unless". D.3) On page 13, line 27 should be replaced by: R601 event-post-stmt is EVENT POST( event-variable [POSTED=event-posted-variable, sync-stat-list] ) After line 28, the following should be added: R60x event-posted-variable is scalar-logical-variable Line 31 should be replaced by: The EVENT POST statement is an image control statement. If the state of the event variable is not posted, successful execution of an EVENT POST statement causes its state to be set to posted. If a POSTED= specifier is present, it also causes the scalar logical variable to become defined with the value true. If the state of the event variable is posted, successful execution of an EVENT POST statement without a POSTED= specifier causes the executing image to wait until it is not posted, and then causes its state to be set to not posted. [[[ This is not quite the wording used in LOCK, but this situation is a bit more complicated. ]]] If the state of the event variable is posted, successful execution of an EVENT POST statement with a POSTED= specifier does not change its state and causes the scalar logical variable to become defined with the value false. D.4) On page 14, lines 5 to 6 should be replaced by: The EVENT WAIT statement is an image control statement. If the state of the event variable is not posted, successful execution of an EVENT WAIT statement causes the executing image to wait until it is set to posted, and then causes its state to be set to not posted. If the state of the event variable is posted, successful execution of an EVENT WAIT statement causes causes its state to be set to not posted. D.5) In Fortran 2008, on page 189, after paragraph 2, the following should be added: During the execution of a program, a processor shall ensure that all uses of image control statements are consistent with themselves and with the program execution order on each image. [[[ This is not sufficient, unfortunately - see below. ]]] NOTE Successful execution of a LOCK statement with an ACQUIRE_LOCK= specifier or an EVENT POST statement with a POSTED= specifier that set the scalar logical variable to false have the semantics of a SYNC MEMORY statement. [[[ Unfortunately, the explicit requirement for consistency is necessary. At best, parallel time has all of the event ordering problems of special relativity - and when one allows optimisation, as is critical for the larger systems, it's nearly as bad as general relativity, including the Cauchy problem. And, yes, I do mean apparent time warps. We swept the problem under the carpet in Fortran 2008, because the only way that problems could be caused was by direct use of SYNC MEMORY or by arcane uses of locks, but unfortunately we can do so no longer. Also, though it is not spelled out here, we need to specify what we mean by consistent. ]]] D.6) On page 14, lines 7 to 11 should be replaced by: During the execution of the program, the state of a event variable is changed by the execution of EVENT POST and EVENT WAIT statements in a sequentially consistent order [Fortran 2008 8.5.2]. If the state of a event variable is set to posted through the execution of an EVENT POST statement on image M and is subsequently in that order set to not posted through the execution of an EVENT WAIT statement on image T, the segments preceding the EVENT POST statement on image M precede the segments following the EVENT WAIT statement on image T. NOTE When there are more than two images using EVENT POST and EVENT WAIT statements on the same event variable, programs should regard the precise execution order as being unpredictable. D.7) On page 21, after line 2, the following should be added: NOTE A use of EVENT_QUERY is defined only if its segment is ordered against all uses of EVENT POST and EVENT QUERY on its event variable argument, and should not be used to track the progress of other images. Atomics ------- E.1) The ATOM argument should be required to be a coarray. E.2) The wording should be adapted from that in Fortran 2008. E.3) On page 15, lines 304, the procedure names ATOMIC_SET and ATOMIC_VALUE should be added. E.4) Somewhere on page 15, the following should be added: A processor shall ensure that all uses of ATOMIC_ADD, ATOMIC_AND, ATOMIC_CAS, ATOMIC_OR, ATOMIC_SET, ATOMIC_VALUE and ATOMIC_XOR on a single atomic variable are consistent, and are consistent with the program execution order on each image and with segment order [8.5.2]. [[[ This is not sufficient, unfortunately - see below. ]]] NOTE Programmers should not assume that the apparent sequences of actions on two different atomic variables are compatible, nor that any uses of ATOMIC_DEFINE or ATOMIC_REF are consistent with uses of the other atomic procedures, not even on the same variable. Also, using changes in the value of atomic variables together with the SYNC MEMORY statement will not create a well-defined segment ordering, though it may appear to. [[[ My remarks under D.5 are equally relevant here. We need to specify what we mean by consistent in this context as well. ]]] E.5) On page 17, after line 13, definitions of ATOMIC_SET and ATOMIC_VALUE should be added. These are essentially replications of ATOMIC_DEFINE and ATOMIC_REF with the semantic difference described above. TEAMS ----- This is very much a rough draft. I have following N1967 as closely as possible, and it is largely in the form of differences from the current specification. Much of the syntax is the same, but I proposing changing the constraints and semantics quite drastically in order to solve the problems I and some other people raised. It models itself on the proven design of MPI (and especially communicators and MPI_Split). Unlike in N1967, the parent team of a subteam is the current team at the time the CHANGE TEAM statement that created it was executed, and not the FORM_SUBTEAMS intrinsic. The FORM_SUBTEAMS intrinsic is a collective that returns a handle that describes a team, and is nothing more. The only actions that can be performed on any team other than the current one are to return to the parent by executing an END TEAM statement, to change to a descendent by executing a CHANGE TEAM statement, and to call inquiry functions (N1930 T6) on a local object, which can be guaranteed not to cause synchronisation problems. I have not provided a collective to release a subteam created by FORM_SUBTEAMS, as I can't see how to specify it at all concisely. This is something that needs discussing. There are many possibilities, but all of the ones I can see involve drastic changes to N1967's model, or would be simply documentating that a program can use Fortran's scoping rules and DEALLOCATE to release resources. The TEAM_NONE constant ---------------------- The value of the TEAM_TYPE constant TEAM_NONE shall indicate a conceptual team containing no images. The CHANGE TEAM construct ------------------------- The only syntactic change is: R502 change-team-stmt is [team-construct-name:] CHANGE TEAM ( subteam [, sync-stat-list] ) This has extra restrictions: The subteam shall be of type TEAM_TYPE, shall have been set up by the FORM_SUBTEAMS intrinsic, shall not be TEAM_NONE and shall be a descendent of the current team. In matching CHANGE TEAM statements, all values of subteam shall describe the same team. [[[ This needs phrasing properly. ]]] The synchronisation at both the CHANGE TEAM and END TEAM statements is between all images of the team represented by the value of the subteam variable. [[[ This needs phrasing properly. ]]] Note that the the current team or any parent team are not involved. This provides a considerable performance enhancement in the case where a subset of the teams needs to call a collective without interacting with the rest of the current team. The FORM_SUBTEAMS intrinsic --------------------------- This is now a collective subroutine and not a statement, with a specification like: FORM_SUBTEAMS ( IMAGE-ID , RESULT ) Description. Create values for subteams. Class. Collective subroutine. Arguments. IMAGE-ID shall be scalar and of type integer and shall be non-negative. Two images will be in the same subteam if and only if IMAGE-ID is the same on those images, except that IMAGE-ID may be zero, which indicates that the invoking image is not involved in any subteam. RESULT shall be scalar and of type TEAM_TYPE. It is an INTENT(OUT) argument. If IMAGE_ID is zero, RESULT becomes defined with the value TEAM_NONE. Otherwise, RESULT becomes defined with a value that identifies the subteam that includes the calling image. I have omitted STAT and ERRMSG, as I don't see what use they are. They could be restored, as for ALLOCATE. The SYNC TEAM statement ----------------------- This is removed, on the grounds that it is almost impossible to specify constraints on its use that make it implementable and give it well-defined semantics. The only use that is definable is that of using it for a descendent team, where it would simply be syntactic sugar for: CHANGE TEAM (team) SYNC ALL END TEAM The SUBTEAM_ID intrinsic ------------------------ This is removed. The NUM_IMAGES and THIS_IMAGE intrinsic --------------------------------------- The descriptions of the intrinsic functions NUM_IMAGES() and THIS_IMAGE() in ISO/IEC 1539-1:2010 are changed by adding optional arguments DISTANCE and TEAM and a modified result if either is present. The DISTANCE argument shall be a scalar integer. It shall be nonnegative. The TEAM argument shall be a scalar of type TEAM_TYPE, and shall have a value that was returned by the FORM_SUBTEAMS intrinsic. If DISTANCE is present, TEAM shall not be present. If neither argument is not present, the result value is the image index of the invoking image in the current team. If DISTANCE is present with a value less than or equal to the team distance between the current team and the initial team, the result has the value of the image index in the team of which the invoking image was last a member with a team distance of DISTANCE from the current team; otherwise, the result has the value -1. If TEAM is present and the invoking image is a member of that team, the result has the value of the image index of the invoking image in that team; otherwise, the result has the value -1. [[[ Returning -1 is cleaner than choosing some random team, whether or not that is the current one. ]]] _______________________________________________________________________ Toon Moene I have to vote no. In addition to all the arguments to not pass this TS is that I reconsidered my example A2. Clause 6 notes. "Example 2: Producer consumer program." As far as I can see, it is correct with TYPE(LOCAL_EVENT_TYPE) :: EVENT[*] As I tried very hard to come up with an example to show the need for (non_local) EVENT_TYPE, I question whether we need two event types. Please give TS 18508 another round at the meeting in Delft. _______________________________________________________________________ David Muxworthy Technical: Given the issues already raised by others, the document is clearly not yet ready for forwarding. Editorial: Subclause 2 At [3:5+] add: ISO/IEC 1539-1:2010/Cor 1:2012 ISO/IEC 1539-1:2010/Cor 2:2013 Subclause 3 The 'ISO_FORTRAN_ENV' module referenced at 3.2 and 3.3 is not the module from Fortran 2008. At [5:3] add sentence "The intrinsic module ISO_FORTRAN_ENV is extended by this document." At 3.3 'team variable' should either follow 3.4 'team' or be subsumed under 'team'. Subclause 4 [7:8] Replace "This" by "Except as identified in 4.1 above, this". Subclause 8.3 The items should be in numerical order. ______________________________________________________________________ John Reid 1. The introduction does not mention failed images. 2. There is no explanation of how codes might be written to continue execution in the presence of failed images. 3. I do not understand why SOURCE is required to be a coarray for CO_BROADCAST but not for the other collectives. 4. The operation defined by OPERATOR of CO_REDUCE should be required to be mathematically associative, because the sequence of partial results to which it is applied is (purposely) undefined. 5. There need to be more examples and notes to explain the features. For instance, there are no examples of the use of SYNC TEAM. 6. The complexity of the whole feature is greater than was envisioned in Markham. Some reduction would be desirable. _______________________________________________________________________ Van Snyder {The constraint against branching out of a CHANGE TEAM construct (C501 in 13-251/N1967), and parallel constraints on CRITICAL (C811 in 12-007) and DO CONCURRENT (C824 in 12-007) ought to be gathered into a single constraint on branching, e.g., C845a A branch within a CHANGE TEAM, CRITICAL, or DO CONCURRENT construct shall not have a branch target that is outside the construct. but that can be done during integration.} [13-251/N1968:9:32+ C501+] Insert a constraint "C501a A RETURN statement shall not appear within a CHANGE TEAM construct." {This, and similar constraints on CRITICAL (C810 in 12-007) and DO CONCURRENT (C822 in 12-007), ought to be gathered into a single constraint on the RETURN statement, e.g. C1269a A shall not appear within a CHANGE TEAM, CRITICAL, or DO CONCURRENT construct. but that can be done during integration.} [13-251/N1967:10:21 5.4p1] Replace "greater than zero" by "not be negative". But why even that? Better yet, delete "greater ... is" [13-251/N1967:10 Note 5.2] After the previous change, replace "2-MOD(ME,2)" by "MOD(ME,2)". [13-251/N1967:10 5.4] {The description of FORM SUBTEAM, and its relationship to CHANGE TEAM, are inadequate. Is it necessary for every image of the current team to execute a FORM SUBTEAM statement, even though it's not an image control statement? What happens if only, say, odd-numbered images do it? And if so, then what happens when a CHANGE TEAM statement is executed? The CHANGE TEAM statement is an image control statement, so one could not have, e.g. IF ( MOD(ME,2) /= 0 ) THEN FORM SUBTEAM ( 2-MOD(ME,2), ODD_EVEN ) CHANGE TEAM ( ODD_EVEN ) etc. END IF Is prohibition against, e.g., IF ( MOD(ME,2) /= 0 ) FORM SUBTEAM ( 2-MOD(ME,2), ODD_EVEN ) CHANGE TEAM ( ODD_EVEN ) etc. implied by the prohibition to reference an undefined variable? What if ODD_EVEN isn't undefined, but inconsistent on even-numbered images with the result of executing FORM SUBTEAM on odd-numbered images? E.g., FORM SUBTEAM ( 1+MOD(ME,4), ODD_EVEN ) IF ( MOD(ME,2) /= 0 ) FORM SUBTEAM ( 2-MOD(ME,2), ODD_EVEN ) CHANGE TEAM ( ODD_EVEN ) etc. I have no suggestions how to repair this problem, because I don't know the answers to the questions. } [13-251/N1967:10 5.5] {Description of SYNC TEAM is confusing. What does "images of the team specified by " mean? What does "each other image of the specified team" mean? Does it mean something like "If image M executes a SYNC TEAM statement, and the value of SUBTEAM_ID() for image M is S, then execution of the segment on image M of the segment following the SYNC TEAM statement is delayed until every image of the current team for which the value of SUBTEAM_ID() is S executes a SYNC TEAM statement specifying the same team...."? Since a team variable cannot be a coindexed object, is that what "specifying the same team" means? Suppose one has: FORM SUBTEAM ( 1+MOD(ME,4), ODD_EVEN ) IF ( MOD(ME,2) /= 0 ) SYNC TEAM ( ODD_EVEN ) What happens? Who waits? I have no suggestions how to repair this problem, because I don't know the answers to the questions. } [13-251/N1967:15:14 7.2p1] Replace "calls to" by "invocations of". [13-251/N1967:15:15-16 7.2p1] Delete "A call ... image control statement." The effect will reappear in an edit for page 27 below. [13-251/N1967:27:3+] Insert edit to 12-007:C821 in 8.1.6.6.3 CYCLE statement "{Replace C821 in 8.1.6.6.3 CYCLE statement} "C821 (R831) A within a CHANGE TEAM, CRITICAL, or DO CONCURRENT construct shall not belong to an outer construct." [13-251/N1967:27:3+] Insert edit to replace 12-007:C845 in 8.1.10 EXIT statement (see also 201-wvs-002). "{Replace C845 in 8.1.10 EXIT statement} "C845 An within a DO CONCURRENT construct shall not belong to that construct or an outer construct; an within a CHANGE TEAM or CRITICAL construct shall not belong to an outer construct." {This is a slight extension of 12-007, in that it allows an EXIT statement to belong to a CRITICAL construct but not to an outer construct. 12-007 prohibits an EXIT statement to belong to a CRITICAL construct or an outer one. In e-mail discussion of CHANGE TEAM, it was argued that it is better to allow an EXIT statement to belong to a CHANGE TEAM construct than to expect one to put a label on the END TEAM statement and branch to it, or to enclose the of the CHANGE TEAM construct in a BLOCK construct and exit the BLOCK construct. The same analysis applies to the CRITICAL construct, and for consistency the same should be allowed for it. } [13-251/N1967:27:9+ or thereabouts] Insert a bullet (compare to edits for F08/0040 concerning MOVE_ALLOC) "o a CALL statement that invokes a collective intrinsic subroutine;" This could be combined with the edit from F08/0040 "o a CALL statement that invokes the intrinsic subroutine MOVE_ALLOC with coarray arguments, or a collective intrinsic subroutine;" --------------050203040506080908050508--