From owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org  Fri Jul 26 03:56:34 2019
Return-Path: <owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org>
X-Original-To: sc22wg5-dom8
Delivered-To: sc22wg5-dom8@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id C309B9DB195; Fri, 26 Jul 2019 03:56:34 +0200 (CEST)
Delivered-To: sc22wg5@open-std.org
X-Greylist: delayed 4275 seconds by postgrey-1.34 at www5.open-std.org; Fri, 26 Jul 2019 03:56:34 CEST
Received: from ppa02.jpl.nasa.gov (ppa02.jpl.nasa.gov [128.149.137.113])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by www.open-std.org (Postfix) with ESMTP id 3F80C356840
	for <sc22wg5@open-std.org>; Fri, 26 Jul 2019 03:56:32 +0200 (CEST)
Received: from pps.filterd (ppa02.jpl.nasa.gov [127.0.0.1])
	by ppa02.jpl.nasa.gov (8.16.0.27/8.16.0.27) with SMTP id x6Q1tF5P082986
	for <sc22wg5@open-std.org>; Thu, 25 Jul 2019 18:56:30 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jpl.nasa.gov; h=subject : from :
 reply-to : to : content-type : date : message-id : mime-version :
 content-transfer-encoding; s=InSight1906;
 bh=7qpxJral7RlZCpouIfHjm6AgGXXZq+dh+gJNJqTkCfc=;
 b=H54nhsck38bQiEFKEFTmvnvbMmrTA8AORf6+rNWMDuS1KE0HTC2AEAhoYO4EpjR3RSv6
 jfG7qObZKRhGj8Dpz8QM63xozd9xa0c7NmQseoo4FintMSobUM8Cct+nDwhOCgquOF1W
 af7R1IAApj+RHxFDBASMgZqEB02wq3o7LKgTCjK5/khKTmSOijdmnCmF0jbDQVqKBsta
 VoAkS3sxLkKjOp3AiFtm1dFiBWDZsgoN0hFmhO3rtUTyNtZFEN6XubXBVXSOe9Z4AT5H
 SG0YfDKqVVKlmi9AZJKbyq+EK9iJ3xXUC2+BOq2aItKRWFQmKR+zj9TFNPrfrB3GENvw lA== 
Received: from mail.jpl.nasa.gov (altphysenclup02.jpl.nasa.gov [128.149.137.53])
	by ppa02.jpl.nasa.gov with ESMTP id 2ty6b3bv0c-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
	for <sc22wg5@open-std.org>; Thu, 25 Jul 2019 18:56:30 -0700
Received: from [137.79.7.57] (math.jpl.nasa.gov [137.79.7.57])
	by smtp.jpl.nasa.gov (Sentrion-MTA-4.3.1/Sentrion-MTA-4.3.1) with ESMTP id x6Q1uSIo032699
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128 bits) verified NO)
	for <sc22wg5@open-std.org>; Thu, 25 Jul 2019 18:56:30 -0700
Subject: Two things from IFIP WG 2.5 meeting
From: Van Snyder <Van.Snyder@jpl.nasa.gov>
Reply-To: Van.Snyder@jpl.nasa.gov
To: sc22wg5 <sc22wg5@open-std.org>
Content-Type: text/plain; charset="ISO-8859-1"
Organization: Yes
Date: Thu, 25 Jul 2019 18:56:28 -0700
Message-ID: <1564106188.31099.82.camel@math.jpl.nasa.gov>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.3 (2.32.3-37.el6) 
Content-Transfer-Encoding: 7bit
X-Source-IP: math.jpl.nasa.gov [137.79.7.57]
X-Source-Sender: Van.Snyder@jpl.nasa.gov
X-AUTH: Authorized
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-07-26_01:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0
 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0
 mlxlogscore=580 adultscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.0.1-1906280000 definitions=main-1907260023
Sender: owner-sc22wg5@open-std.org
Precedence: bulk

Two things that ought to be of interest to the Fortran standard were
discussed at the recent meeting of IFIP Working Group 2.5 (Numerical
Software).

1. The dot product is an ubiquitous operation in mathematically-related
problems, not just in library modules.  Floating-point arithmetic is not
associative.  The dot product can be arbitrarily badly conditioned.  The
numerical software community would like to have either an exact dot
product or a correctly-rounded dot product.  Using a super-accumulator
as originally described by Kulisch, the exact dot product can be
computed as fast as operands can be fetched.  IBM implemented the exact
dot product in hardware for the 370, in an option called ACRITH.
Kulisch and his students implemented the exact dot product on a PCI
card.  Demmel and his students implemented the exact dot product on a
PCI card.  Both Kulisch and Demmel observed that the PCI card computed
an exact dot product six times faster than a double-precision floating
point method using a Pentium.

Computing a correctly-rounded dot product is possible using conventional
floating point arithmetic, at a cost of up to six times the cost of the
obvious method.  It is difficult for users to implement it because
processors sometimes do not respect parentheses.  For example, a
compensated-summation loop might be written

  sum = 0.0; comp = 0.0
  do k = 1, 10
    oldsum = sum
    comp = comp + term(k)
    sum = comp + oldsum
1   comp = ( oldsum - sum ) + comp
  end do
  sum = sum + comp ! The final compensated result

but a too-clever optimizer might realize that statement 1 is a clever
way to compute comm = 0, and therefore scrub out all appearances of
comp.  (Siegfried Rump's correctly-rounded dot product algorithm uses
similar methods).

It would be useful to have an optional logical argument to DOT_PRODUCT
to request a correctly-rounded dot product.  Under the covers, the
processor could compute this using either a Kulisch super accumulator,
or (e.g.) Siegfried Rump's method.

It would be useful to have an EXACT_DOT_PRODUCT intrinsic function that
produces a result of a type specified in ISO_Fortran_ENV.  That type
should have type-bound operator(+) and operator(-) operations, that take
either two exact dot-product result objects, or one such object and a
floating-point object, producing an exact dot-product object.  This
object would be a representation of a Kulisch super accumulator.  It
should have a kind type parameter, allowed to have any of the values
allowed for type REAL.  One might hope that eventually there would be
hardware support to compute the exact dot product.  Software simulation
would be very slow.

2. It was observed (again) that parenthesization of matrix products has
a profound impact on performance.  For example, in ABx, where A and B
are NxN matrices and x is an n-vector, (AB)x has N^3 complexity, while
A(Bx) has N^2 complexity.

Determining the optimal parenthesization is a simple dynamic-programming
problem, but producing a general procedure to handle an arbitrary number
of factors is non-trivial for user programs.  It would not be difficult
for a processor to compute and execute the optimal parenthesization if
the MAMTUL intrinsic function allowed an arbitrary number of arguments.

One might hope that products involving the transpose of a matrix would
be noticed by a processor, and optimized so as not to produce the
transpose as a separate object.  For example, MATMUL(TRANSPOSE(A),B)
ought not to create a representation of TRANSPOSE(A); rather, it ought
to do the multiplication differently.  xGEMM (or xGEMV) can do this, so
if MATMUL simply uses xGEMM under the covers, there should be no
representation of TRANSPOSE(A).  This should not be part of normative
text; rather it ought to be  quality-of-implementation issue, perhaps
mentioned in a note, or Annex C.

Neither one of these seems to be a large project.