From owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org  Wed Nov 12 13:36:18 2014
Return-Path: <owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org>
X-Original-To: sc22wg5-dom8
Delivered-To: sc22wg5-dom8@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 270663581F2; Wed, 12 Nov 2014 13:36:18 +0100 (CET)
Delivered-To: sc22wg5@open-std.org
Received: from ppsw-50.csi.cam.ac.uk (ppsw-50.csi.cam.ac.uk [131.111.8.150])
	by www.open-std.org (Postfix) with ESMTP id F0FD03566AB
	for <sc22wg5@open-std.org>; Wed, 12 Nov 2014 13:36:12 +0100 (CET)
X-Cam-AntiVirus: no malware found
X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/
Received: from hermes-1.csi.cam.ac.uk ([131.111.8.51]:54866)
	by ppsw-50.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.158]:25)
	with esmtpa (EXTERNAL:nmm1) id 1XoX9e-0001Hk-rr (Exim 4.82_3-c0e5623)
	(return-path <nmm1@hermes.cam.ac.uk>); Wed, 12 Nov 2014 12:36:10 +0000
Received: from prayer by hermes-1.csi.cam.ac.uk (hermes.cam.ac.uk)
	with local (PRAYER:nmm1) id 1XoX9e-0005Lr-LA (Exim 4.72)
	(return-path <nmm1@hermes.cam.ac.uk>); Wed, 12 Nov 2014 12:36:10 +0000
Received: from [131.111.56.53] by old-webmail.hermes.cam.ac.uk
	with HTTP (Prayer-1.3.5); 12 Nov 2014 12:36:10 +0000
Date: 12 Nov 2014 12:36:10 +0000
From: "N.M. Maclaren" <nmm1@cam.ac.uk>
To: Van.Snyder@jpl.nasa.gov
Cc: sc22wg5 <sc22wg5@open-std.org>
Subject: Re: [ukfortran] (SC22WG5.5365) Nondeterminacy of reductions
Message-ID: <Prayer.1.3.5.1411121236100.5534@hermes-1.csi.cam.ac.uk>
In-Reply-To: <20141111192952.4EFD33588A2@www.open-std.org>
References: <20141111192952.4EFD33588A2@www.open-std.org>
X-Mailer: Prayer v1.3.5
Mime-Version: 1.0
Content-Type: text/plain; format=flowed; charset=ISO-8859-1
Sender: owner-sc22wg5@open-std.org
Precedence: bulk

On Nov 11 2014, Van Snyder wrote:
>
>Sylvain Collange et al remark in ...
>
>that parallel computations, especially reductions, are non-deterministic
>due to floating-point computations not being computationally
>associative.

Which has been well-known for at least half a century.  Indeed, it is
best to regard ALL floating-point computations as non-deterministic,
because of the way that optimisation and different underlying algorithms
affect the result.  That is the model used for traditional numerical
analysis, after all.

>This method can accumulate an exact dot product as fast as data can be
>provided.  With a super accumulator, the method is somewhat simpler than
>a floating-point fused-multiply-add.  The size of the superaccumulator
>advocated therein is 536 bytes (4288 bits) for IEEE binary64 format.
>Contemporary processors have 16k of registers that could be organized
>into a super accumulator.

Which, inter alia, will increase the network and memory bandwidth
required by a factor of nearly 70.  Or, in the case of IEEE 128-bit,
a factor of over 500.

More importantly, this trick (and it is simply a trick) handles
ONLY reduction by summation.  It can't be extended to multiplication,
let alone to more complicated reductions.

>The present descriptions of REDUCE and CO_REDUCE do not accomodate the
>use of EXACT_DOT_PRODUCT.  If EXACT_SUM is parallel to SUM, it also
>cannot be used in those contexts.  An alternative to EXACT_SUM that
>takes two scalar arguments that are independently either floating-point
>(of any kind) or complete, and produces a complete result, would allow
>to use it in those contexts.

Yes, it can.  All you need to do is the following:

    Write a call EXACT_DOT_PRODUCT or EXACT_SUM

    Expand the multiplication or number in that call to a suitable
    derived type or array

    Call CO_REDUCE on that expanded form with a suitable operation

    Reduce the result to normal precision and store it back

If it were regarded as desirable that the standard should include such
a facility, it would be FAR cleaner to have specific intrinsics, and/or
an optional argument to CO_SUM, because this is not a general facility.


Regards,
Nick.

