From owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org  Mon Jul 29 22:51:31 2019
Return-Path: <owner-sc22wg5+sc22wg5-dom8=www.open-std.org@open-std.org>
X-Original-To: sc22wg5-dom8
Delivered-To: sc22wg5-dom8@www.open-std.org
Received: by www.open-std.org (Postfix, from userid 521)
	id 96EF2358485; Mon, 29 Jul 2019 22:51:31 +0200 (CEST)
Delivered-To: sc22wg5@open-std.org
X-Greylist: delayed 490 seconds by postgrey-1.34 at www5.open-std.org; Mon, 29 Jul 2019 22:51:31 CEST
Received: from math.jpl.nasa.gov (unknown [137.79.7.57])
	by www.open-std.org (Postfix) with ESMTP id 2402C3567F4
	for <sc22wg5@open-std.org>; Mon, 29 Jul 2019 22:51:31 +0200 (CEST)
Received: by math.jpl.nasa.gov (Postfix, from userid 3593)
	id 69AC833DA73; Mon, 29 Jul 2019 13:43:20 -0700 (PDT)
Subject: Re: [EXTERNAL] Re: [J3] (SC22WG5.6121) Two things from IFIP WG 2.5
 meeting
From: Van Snyder <van.snyder@jpl.nasa.gov>
To: sc22wg5 <sc22wg5@open-std.org>
In-Reply-To: <30411996-C54E-4268-81D1-1FB9C692B8EF@cray.com>
References: <20190726015634.E052D35860C@www.open-std.org>
	 <Prayer.1.3.5.1907261324060.24019@hermes-1.csi.cam.ac.uk>
	 <20190726172038.9A3DD358B76@www.open-std.org>
	 <Prayer.1.3.5.1907271035310.18541@hermes-1.csi.cam.ac.uk>
	 <20190727152053.BEB06358745@www.open-std.org>
	 <30411996-C54E-4268-81D1-1FB9C692B8EF@cray.com>
Content-Type: multipart/alternative; boundary="=-sdaK3lCsIq3G/8LJX6Wt"
Date: Mon, 29 Jul 2019 13:43:20 -0700
Message-ID: <1564433000.31099.176.camel@math.jpl.nasa.gov>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.3 (2.32.3-37.el6) 
Sender: owner-sc22wg5@open-std.org
Precedence: bulk


--=-sdaK3lCsIq3G/8LJX6Wt
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, 2019-07-29 at 13:29 +0000, Bill Long wrote:

> > On Jul 27, 2019, at 10:20 AM, Van Snyder via J3
> <j3@mailman.j3-fortran.org> wrote:
> >=20
> > Specification that a DOT_PRODUCT produce a correctly-rounded result
> does
> > not depend upon the kind of the arguments.  If a processor has a
> super
> > accumulator that only works for binary64 (or binary32), it could use
> it
> > for those precisions, and use a software method otherwise.  The
> > processor could detect at run time whether the CPU (or a
> coprocessor)
> > provides a super accumulator, and use the appropriate method.
> >=20
>=20
> The focus on dot product seems odd for a couple of reasons. Most
> processors use IEEE floating point representations where the exponent
> range for 64-bit reals (used whenever accuracy is important)  is ~
> 10**300.  The number of particles in the universe is ~10*80, so
> overflow seem very unlikely for any reasonably formed problem.
> Similarly for underflow, as tiny values are ~10**-300.   The numerical
> problem I see as potential is for a very long vector sum at the end
> where the values are significantly different in size or have lots of
> cancellations because of alternating signs.  There are a couple of
> techniques that can help using existing hardware.  First, vectorize
> the operation which results in parallel accumulation of partial sums
> that are combined at the end.  (Note that this only happens at
> non-zero optimization levels).  Vectorization alone is sufficient for
> the vast majority of users.  For widely varying sized terms in the
> summation, resulting in truncation errors in the additions, you could
> sort the list and sum staring at the small end.=20


There are more things in heaven and earth, Horatio, than are dreamt of
in your philosophy.

As the paper remarks, dot product is ubiquitous.

The issue isn't overflow.  It's getting correct bits in the result.

Somehow, I trust experts who are specialists in this area (Kulisch,
Rump, Ogita, Oishi, Dekker, Neumaier, Knuth, Kahan...) who say that such
simplistic solutions aren't adequately reliable.  Sorting is O(n log n),
so for vectors longer than eight elements or so, Rump's algorithm (based
on compensated summation) would be faster.  Sorting requires branches.
Bin sorting on the exponent is faster but requires work space.  Summing
the bins might be slower than algorithms based upon compensated
summation.  If there are no sign changes, accumulating small-to-large
works best.  If there are sign changes, accumulating large-to-small
works best.  But none of those simplistic approaches are reliable.

The paper cited in the proposal considers Bill's simplistic methods.
It's in Tutorials/Ogita-Rump-Oishi.pdf.  Nick Higham devoted an entire
chapter to this in "Accuracy and Stability of Numerical Algorithms."

Dekker's algorithm has branching.  Neumaier uses Dekker's algorithm.
Knuth's doesn't branch, but requires six flops per add.  Even so it's
typically 50% faster than Dekker's.  Extended precision (e.g. XBLAS) can
be silently inadequate (and Rump's algorithm is 40% faster).

Ogita and Oishi were at Waseda University in Tokyo in 2005.  Maybe we
could invite them to give a presentation, if they're still nearby.


> The relevant question for WG5 is whether this issue is in the scope of
> the Fortran standard.  For the small number of cases where these
> round-off issues matter to a program, the user might want to either
> write separate code, or use one of the professionally written
> libraries (NAG, IMSL, =E2=80=A6) for the dot_product computation.   Fortr=
an
> specifies the result in mathematical terms.  I think it is unwise to
> be specifying implementation details. =20


The description of NORM2 includes a recommendation "without undue
overflow or underflow."  That's a tiny bit more emphatic than "processor
dependent approximation."

We've done many things for small numbers of users.  HYPOT, for example,
which is entirely redundant to CABS and NORM2.

The problem with implementing reliable algorithms is that optimizers are
eager to subvert them, and subverting optimizers is difficult.

Does NAG (or IMSL) provide a correctly-rounded dot product procedure?
Maybe they believe XBLAS are sufficient?

If one has a super accumulator in the (co)processor, would a NAG (or
IMSL) dot product function detect it and exploit it?

The proposal EXPLICITLY DOES NOT SPECIFY IMPLEMENTATION DETAILS!  DID
YOU READ IT OR JUST JERK YOUR KNEE?


> > I recall descriptions of neural network training failing to converge
> in
> > even 100,000 iterations without correct linear algebra.  Iterative
> > refinement was tried, but made only a small dent, because of poor
> > conditioning of the dot product.  It finally worked when an accurate
> dot
> > product was used.  IIRC, this was described in a paper by Victor
> > Pereyra, in connection with variable projection and separable
> nonlinear
> > least-squares problems.  I've asked correspondents for more
> examples.
>=20
> The linear algebra needs in neural network training codes are beyond
> the skill set of most of the Python programmers who dominate the
> field.  They typically just (ultimately) call the library routines
> supplied by the GPU vendors appropriate for the target hardware.  If
> those routines are insufficient for some reason, comments to that
> effect should be directed to the vendors.=20


Neural network training is only one example, not the entire universe of
problems.

Bit it's always handy to throw in a red herring when you don't have much
else.


--=-sdaK3lCsIq3G/8LJX6Wt
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">
<HTML>
<HEAD>
  <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; CHARSET=3DUTF-8">
  <META NAME=3D"GENERATOR" CONTENT=3D"GtkHTML/3.32.2">
</HEAD>
<BODY>
On Mon, 2019-07-29 at 13:29 +0000, Bill Long wrote:<BR>
<BLOCKQUOTE TYPE=3DCITE>
    &gt; On Jul 27, 2019, at 10:20 AM, Van Snyder via J3 &lt;<A HREF=3D"mai=
lto:j3@mailman.j3-fortran.org">j3@mailman.j3-fortran.org</A>&gt; wrote:<BR>
    &gt; <BR>
    &gt; Specification that a DOT_PRODUCT produce a correctly-rounded resul=
t does<BR>
    &gt; not depend upon the kind of the arguments.  If a processor has a s=
uper<BR>
    &gt; accumulator that only works for binary64 (or binary32), it could u=
se it<BR>
    &gt; for those precisions, and use a software method otherwise.  The<BR=
>
    &gt; processor could detect at run time whether the CPU (or a coprocess=
or)<BR>
    &gt; provides a super accumulator, and use the appropriate method.<BR>
    &gt; <BR>
    <BR>
    The focus on dot product seems odd for a couple of reasons. Most proces=
sors use IEEE floating point representations where the exponent range for 6=
4-bit reals (used whenever accuracy is important)  is ~ 10**300.  The numbe=
r of particles in the universe is ~10*80, so overflow seem very unlikely fo=
r any reasonably formed problem.  Similarly for underflow, as tiny values a=
re ~10**-300.   The numerical problem I see as potential is for a very long=
 vector sum at the end where the values are significantly different in size=
 or have lots of cancellations because of alternating signs.  There are a c=
ouple of techniques that can help using existing hardware.  First, vectoriz=
e the operation which results in parallel accumulation of partial sums that=
 are combined at the end.  (Note that this only happens at non-zero optimiz=
ation levels).  Vectorization alone is sufficient for the vast majority of =
users.  For widely varying sized terms in the summation, resulting in trunc=
ation errors in the additions, you could sort the list and sum staring at t=
he small end. <BR>
</BLOCKQUOTE>
<BR>
There are more things in heaven and earth, Horatio, than are dreamt of in y=
our philosophy.<BR>
<BR>
As the paper remarks, dot product is ubiquitous.<BR>
<BR>
The issue isn't overflow.&nbsp; It's getting correct bits in the result.<BR=
>
<BR>
Somehow, I trust experts who are specialists in this area (Kulisch, Rump, O=
gita, Oishi, Dekker, Neumaier, Knuth, Kahan...) who say that such simplisti=
c solutions aren't adequately reliable.&nbsp; Sorting is O(n log n), so for=
 vectors longer than eight elements or so, Rump's algorithm (based on compe=
nsated summation) would be faster.&nbsp; Sorting requires branches.&nbsp; B=
in sorting on the exponent is faster but requires work space.&nbsp; Summing=
 the bins might be slower than algorithms based upon compensated summation.=
&nbsp; If there are no sign changes, accumulating small-to-large works best=
.&nbsp; If there are sign changes, accumulating large-to-small works best.&=
nbsp; But none of those simplistic approaches are reliable.<BR>
<BR>
The paper cited in the proposal considers Bill's simplistic methods.&nbsp; =
It's in Tutorials/Ogita-Rump-Oishi.pdf.&nbsp; Nick Higham devoted an entire=
 chapter to this in &quot;Accuracy and Stability of Numerical Algorithms.&q=
uot;<BR>
<BR>
Dekker's algorithm has branching.&nbsp; Neumaier uses Dekker's algorithm.&n=
bsp; Knuth's doesn't branch, but requires six flops per add.&nbsp; Even so =
it's typically 50% faster than Dekker's.&nbsp; Extended precision (e.g. XBL=
AS) can be silently inadequate (and Rump's algorithm is 40% faster).<BR>
<BR>
Ogita and Oishi were at Waseda University in Tokyo in 2005.&nbsp; Maybe we =
could invite them to give a presentation, if they're still nearby.<BR>
<BR>
<BLOCKQUOTE TYPE=3DCITE>
    The relevant question for WG5 is whether this issue is in the scope of =
the Fortran standard.  For the small number of cases where these round-off =
issues matter to a program, the user might want to either write separate co=
de, or use one of the professionally written libraries (NAG, IMSL, &#8230;)=
 for the dot_product computation.   Fortran specifies the result in mathema=
tical terms.  I think it is unwise to be specifying implementation details.=
  <BR>
</BLOCKQUOTE>
<BR>
The description of NORM2 includes a recommendation &quot;without undue over=
flow or underflow.&quot;&nbsp; That's a tiny bit more emphatic than &quot;p=
rocessor dependent approximation.&quot;<BR>
<BR>
We've done many things for small numbers of users.&nbsp; HYPOT, for example=
, which is entirely redundant to CABS and NORM2.<BR>
<BR>
The problem with implementing reliable algorithms is that optimizers are ea=
ger to subvert them, and subverting optimizers is difficult.<BR>
<BR>
Does NAG (or IMSL) provide a correctly-rounded dot product procedure?&nbsp;=
 Maybe they believe XBLAS are sufficient?<BR>
<BR>
If one has a super accumulator in the (co)processor, would a NAG (or IMSL) =
dot product function detect it and exploit it?<BR>
<BR>
The proposal EXPLICITLY DOES NOT SPECIFY IMPLEMENTATION DETAILS!&nbsp; DID =
YOU READ IT OR JUST JERK YOUR KNEE?<BR>
<BR>
<BLOCKQUOTE TYPE=3DCITE>
    &gt; I recall descriptions of neural network training failing to conver=
ge in<BR>
    &gt; even 100,000 iterations without correct linear algebra.  Iterative=
<BR>
    &gt; refinement was tried, but made only a small dent, because of poor<=
BR>
    &gt; conditioning of the dot product.  It finally worked when an accura=
te dot<BR>
    &gt; product was used.  IIRC, this was described in a paper by Victor<B=
R>
    &gt; Pereyra, in connection with variable projection and separable nonl=
inear<BR>
    &gt; least-squares problems.  I've asked correspondents for more exampl=
es.<BR>
    <BR>
    The linear algebra needs in neural network training codes are beyond th=
e skill set of most of the Python programmers who dominate the field.  They=
 typically just (ultimately) call the library routines supplied by the GPU =
vendors appropriate for the target hardware.  If those routines are insuffi=
cient for some reason, comments to that effect should be directed to the ve=
ndors. <BR>
</BLOCKQUOTE>
<BR>
Neural network training is only one example, not the entire universe of pro=
blems.<BR>
<BR>
Bit it's always handy to throw in a red herring when you don't have much el=
se.<BR>
<BR>
</BODY>
</HTML>

--=-sdaK3lCsIq3G/8LJX6Wt--