Audience: SG6
S. Davis Herring <herring@lanl.gov>
Los Alamos National Laboratory
January 10, 2025
The standard says very little about the actual results of floating-point evaluations. [basic.fundamental]/12 says the “accuracy of operations” is implementation-defined; [expr.pre]/6 makes the situation even less clear by suggesting that floating-point operands and results are somehow not even values of their types. Indeed, it is of practical value that implementations often interpret expressions involving floating-point types as mathematical expressions in order to improve the performance and accuracy of the computation of its overall result. Common techniques include fusing multiplications and additions, discarding canceling terms, and temporarily using extra precision (e.g., using x87 registers). Strict application of well-defined floating-point operations is of course critical to other numerical algorithms; the footnote in /6 suggests that “The cast and assignment operators” may be used for the purpose.
These ideas are derived from C, which additionally defines the
FLT_EVAL_METHOD
macro to describe the implementation’s
choices about such transformations. Matthias Kretz presented Floating-Point
Excess Precision to SG6 and EWG seeking guidance on how to most
consistently interpret these ideas in the context of C++’s stronger type
system, constant evaluation, and the larger set of contemporary
floating-point types. No clear direction has yet been reached,
suggesting that further research may be needed.
The idea that an operator result of a type does not have one of the values of that type is obviously problematic from the perspective of defining semantics for such an operator. Moreover, the idea that assigning a variable forces extended precision to be discarded is problematic in C++ because of the routine use of wrapper class types in mathematical expressions. The creation of every such object involves the initialization of its member variable, which seems to be just as strong as assignment in terms of incompatibility with extralinguistic extended precision.
An alternative approach is to extend the set of values for a
floating-point type beyond those that can even theoretically be stored
in the memory that an object of that type occupies. The result of the
subexpression in a * b + c
(all double
s) might
then have a value outside the set of values that can be stored in a
double
’s space in memory (typically the
binary64
set); the choice of that value conveys the
additional information needed to obtain the correctly rounded result of
the overall expression as of course implemented by an FMA instruction.
Similar careful choices of value from a larger set might capture the
bits (or finiteness) lost in a + b - b
; for x87
implementations, the larger set is simply the values supported by the
register format.
The crucial specification technology is the same as used for pointer provenance: the values are put in a many-to-one correspondence with the value representations of the type. (The presence of multiple rounding modes might require the formal duplication of values based on the representable value to which they round, but this matters only if the value representation is examined.) Note that every operation and object still has a single value. Aside from merely being tractable semantics, the stability of values prevents unfortunate practical results like taking both branches in
const double x = /* ... */, y = x + epsilon / 4;
if(x < y) { // X87 comparison
// ...
}
// further operations that cause spilling...
if(x == y) { // binary64 comparison
// ...
}
or failing the assert in
float id(float f) {return f;} // no computations
void call() {
const float x = /* ... */;
assert(3 * x == 3 * id(x));
}
Note the implication that if id
is not inlined
x
must be given a representable value for consistency;
passing +x
would avoid that coupling but might fail the
assert.
For obvious practical reasons, a value that escapes past an optimization frontier cannot actually store information beyond its bit pattern. The stability requirement implies that any such value must be normalized to its “memory type” upon computation. However, even the member variables of wrapper objects can have an extended value if the optimizer sees their entire lifetime (as is typical for temporaries in the sorts of expressions we want to optimize freely), because they truly are members of the type. Similarly, assignment does not need the normalization effect described in the [expr.pre]/6 footnote; even a value updated in a loop may be extended so long as its intermediate values do not escape. Values passed to and returned from standard-library mathematical functions can also be extended.
As there are no opaque functions (merely insufficiently clever
optimizers), it is only prudent to retain an explicit means of requiring
normalization; static_cast
is the obvious candidate (the
other part of the footnote), although std::memcpy
would
also have the effect of selecting the canonical value associated with a
value representation. (For pointers, std::memcpy
needs to
be able to preserve the abstract-machine information to prevent
undefined behavior, but here it would be unnecessary and difficult to
specify since it does affect observable behavior.)
Modify the definition of trivially-copyable types to allow floating-point types to have multiple values (one of which is canonical) per value representation. Specify that acquiring a value representation gives a floating-point object the corresponding canonical value.
Replace the “greater precision and range” provision ([expr.pre]/6)
with a note about the contextual dependence of rounding. Specify that
unary +
can, and that static_cast
does, round
floating-point values to their corresponding canonical values.