ISO/ IEC JTC1/SC22/WG14 N829

WG14/N829 J11/98-028

Previous version: WG14/N808. 


DR 63 (and 56): Floating-Point accuracy

Proposed wording for the Response for DR 63 (and 56). 

The following wording (approved by the editorial review
committee) should be incorporated into a future version of
the standard in section 5.2.4.2.2 <float.h>, page 25,
between paragraphs 3 and 4:

The accuracy of the floating-point operations (+, -, *, /)
and of the math library (<math.h> and <complex.h>) functions
that return a floating-point result is implementation
defined.  The implementation shall document the accuracy,
but is allowed to state that the accuracy is unknown. 


Rationale (section 5.2.4.2.2 <float.h>)

The overflow and/or underflow thresholds may not be the same
for all arithmetic operations.  For example, there is at
least one machine where the overflow threshold for addition
is twice as big as for multiplication.  Another
implementation uses a pair of doubles to represent a long
double.  In that implementation, the next representable long
double value after 1.0L is 1.0L+LDBL_MIN, yet, the
difference bewteen those two numbers (LDBL_MIN) is not [note
to editor: math type] b ** (1-p), otherwise known as
LDBL_EPSILON.  Because of anomalies like these, there are
few hard requirements on the <float.h> values.  But, the
values in <float.h> should be in terms of the hardware
representation used to store floating-point values in memory
(not in terms of the effective accuracy of operations, nor
in terms of registers) and should apply to all operations. 
The representation stored in memory may have padding bits
and/or bytes that do not contribute to the value.  The
padding should not be included in the <float.h> values. 

Because of the practical difficulty involved in defining a
uniform metric that applies to both real and complex types
and that all vendors would be willing to follow (just
computing the accuracy reliably could be a significant
burden that varied depending on the required metric), and
because the importance of floating point accuracy differs
greatly among users, the standard allows a great deal of
latitude in how an implementation documents the accuracy of
the real and complex floating-point operations and
functions. 

Here are some ways that an implementation might address the
need to define the accuracy ... 

 digits correct
 digits wrong
   maximum Units in the Last Place (ULPs) error
   maximum absolute error
   maximum relative error

For complex values, some methods are ... 

  error in terms of both real and imaginary parts
  error in terms of Euclidean norm, 
    ||a + b*i|| = sqrt(a*a + b*b).

There are two usages of the term ULP.  One is in the context
of differences between two numbers, e.g., f(x) differs from
F(x) by 3 ULPs.  The other is the value of the ULP of a
number, e.g., an ULP of the value 1.0 is DBL_EPSILON.  For
this discussion, we are interested in the former; the
difference between the computed value and the infinitely
precise value. 

The error between two floating-point numbers in ULPs depends
on the radix and the precision used in representing the
number, but not the exponent.  With a decimal radix and 3
digits of precision, the computed value 0.314e+1 differs
from the value 0.31416e+1 by 0.16 ULPs.  If both numbers are
scaled by the same power of the radix, e.g., 0.314e+49 and
0.31416e+49, they still differ by 0.16 ULPs. 

When the two numbers being compared span a power of the
radix, the two possible ULP error calculations differ by a
factor of the radix.  For a decimal radix and 3-digits of
precision, consider the two values 9.99e2 and 1.01e3.  These
are the two values adjacent to the value 1.00e3, a power of
the radix, in this number system.  If 999 is the correct
value and 1010 is the computed value, the error is 11 ulps. 
But, if 1.01e3 is the correct value and 0.999e3 is the
computed value, then the error is 1.1 ulps. 

Some math functions, such as those that do argument
reduction modulo an approximation of pi, have good accuracy
for small arguments, but poor accuracy for large arguments. 
It is not unusual for an implementation of the trig
functions to have zero bits correct in the computed result
for large arguments.  For cases like this, an implementation
might break the domain of the function into disjoint
regions and specify the accuracy in each region. 

If an implementation documents worst case error, there is no
requirement that it be the minimum worst case error.  That
is, if a vendor believes that the worst case error for a
function is around 5 ULPs, they could document it as 7 ULPs
to be on the safe side. 

The committee could not agree on upper limits on accuracy
that all conforming imlementations must meet, e.g., addition
is no worse than 2 ULPs for all implementations.  It is a
quality of implementation issue. 

Implementations that conform to IEC-559 have half ULP
accuracy in round to nearest mode, and 1-ULP accuracy in the
other three rounding modes, for the basic arithmetic
operations and sqrt.  For other floating-point arithmetics,
it is a rare implementation that has worse than 1-ULP
accuracy for the basic arithmetic operations. 

The accuracy of binary-decimal converions and format
conversions are discussed elsewhere in the standard. 

For the math library functions, currently, fast correctly
rounded 0.5 ULP accuracy is still a research problem.  Some
implementations provide two math libraries: one being faster
but less accurate than the other. 

The C9X committee discussed the idea of allowing the
programmer to find out the accuracy of floating-point
operations and math functions during compilation (say via
macro symbols) or during execution (via a function call),
but neither got enough support (even though this is what
some users want) to warrant the change to the standard.  The
use of macro symbols would require over one hundred macro
symbols to name every math function (such as ULP_SINF,
ULP_SIN, and ULP_SIND just for the real valued sin
function).  One possible function implementation might be a
function that takes the name of the operation or math
function as a string, e.g., ulp_err( "sin" ) that would
return a double such as 3.5 to indicate the worst case error
or the value -1.0 to indicate unknown error.  But such a
"simple" scheme would likely be of very limited use given
that so many functions have accuracies that differ
significantly across their domains.  Constrained to worst
case error across the entire domain, most implementations
would wind up reporting either unknown or else a uselessly
large value for a very large percentage of functions -
useless because most programs that care about accuracy are
written in the first place to try to compensate for accuracy
problems that typically arise when pushing domain
boundaries.  And implementing something more useful, like
the worst case error for a user-specified partition of the
domain, would be a nightmare.