1. Abstract
This proposal is the less evolutionary part of [P1468], that attempts to ultimately provide the same functionality of [P0192] in a way that we expect to be more acceptable to the committee than the previous attempt.
This paper introduces the notion of extended floating-point types, modeled after extended integer types. To accomodate them,
this paper also attempts to rewrite the current rules for floating-point types, to enable well-defined interactions between all
the floating-point types. The end goal of this paper, together with [P1468], is to have a language to enable
-like
aliases for implementation specific floating point types, that can model more binary layouts than just a single fundamental type
(the previously proposed
) can provide for.
It also attempts to rewrite existing specification for both the core language and the library to not spell out all standard floating-point types every time.
2. Revision history
2.1. R0 -> R1 (pre-Cologne)
Applied SG6 guidance:
-
Make the floating-point conversion rank not ordered between types with overlapping (but not subsetted) ranges of finite values. This makes the ranking a partial order.
-
Narrowing conversions are now based on floating-point conversion rank instead of ranges of finite values, which preserves the current narrowing conversions relations between standard floating-point types; it also interacts favorably with the rank being a partial ordering.
-
Operations that deal with floating-point types whose conversion ranks are unordered are now ill-formed.
-
The relevant parts of the guidance have been applied to the library wording section as well.
Afterwards, applied suggestions from EWGI (this modifies some of the points above):
-
Apply the suggestion to make types where one has a wider range of finite values, but a lower precision than the other, unordered in their conversion rank, and therefore make operations that mix them ill-formed. The motivating example was IEEE-754
andbinary16
; see Floating-point conversion rank for more details. This change also caused this paper to drop the term "range of finite values", since the modified semantics are better expressed in terms of sets of values of the types.bfloat16 -
Add a change to narrowing conversions, to only allow exact conversions to happen (see the last paragraph of Narrowing conversions).
-
Explicitly list parts of the language that are not changed by this paper; provide a more detailed analysis of the standard library impact.
3. Motivation
The motivation for the general effort of this paper is the same as for [P0192]. The entire motivation is not repeated here, but the quick summary is that 16-bit floating-point support is becoming more widely available, both in hardware (ARM CPUs and NVIDIA GPUs) and software (OpenGL, CUDA, and LLVM IR). Providing a standard way for implementations to support 16-bit floating-point types will result in better code, more portable code, and wider use of those types.
The motivation for taking the currently proposed approach comes from the result of discussion on the previous paper. Several
people raised concerns about introducing just a single new fundamental type without a well-defined layout; those same people
were not satisfied with the option of having a dual ABI for that type when both IEEE-754
and
are needed in the same application.
This paper legitimizes implementation-specific floating-point types, which makes standardizing an existing practice an additional motivation for solving the need in the way described below.
4. Proposed approach
In a nutshell:
-
Introduce the notion of extended floating-point types.
-
Redefine usual arithmetic conversions in terms of floating-point conversion rank, closely modeled after the integer equivalent.
-
Redefine narrowing conversions for floating-point types, to be defined in terms the floating-point conversion rank.
-
Rewrite parts of the standard library spec as appropriate to use the new floating-point terms and rules.
4.1. Aspects of the core language we aren’t proposing to change
Implementations currently define whether or not each type supports infinity and NaN. This paper does not change that, still leaving those decisions up to the implementation.
Implementations currently define the radix of the exponent of each floating-point type. This paper does not change that, still leaving those decisions up to the implementation.
4.2. Finer design details
Here’s a list of the details of the design of this paper that we think are important; we’d like guidance on whether the committee likes the decision we’ve made, or if a change to them is requested; please consider them as proposed polls to determine that.
4.2.1. Floating-point conversion rank
The standard has always defined conversion rank for integral types. This paper extends the notion of conversion rank to floating-point types.
R0 of this paper used the range of finite values to determine conversion rank, the rationale being that conversions from types with narrower ranges (excluding infinities) to types with wider ranges should be preferred, even if there is some loss of precision. The paper provided a total order for floating-point types, with an implementation-defined order for types whose ranges do not have a proper subset relationship.
In Kona, SG6 recommended leaving types unordered when neither type’s range of finite values is a subset of the other’s. Any operation that mixes unordered types would be ill-formed. This would leave the door open for inventing semantics for such operations in the future.
Later in Kona, during EWGI discussions, a concern was raised about conversions from a lower-ranked type to a higher-ranked type (as
defined in R0) that would cause a loss in precision. The concern is that it is not clear how to handle types where there is a proper
subset between their ranges of finite values, but not a matching subset relationship between the sets of finite values in that range.
The specific case where this would happen today is between IEEE-754
and
.
, with an 8-bit exponent, has a
much greater range that
, with a 5-bit exponent, making conversions from
to
implicit according to the rules in R0.
This implicit conversion was worrisome because it results in a significant loss of precision, going from an 11-bit mantissa to an 8-bit mantissa.
It has been suggested that the paper be revised so that implicit conversions do not result in a loss of precision.
These issues are resolved in this revision of the paper by changing the definition of floating-point conversion rank to use the set of
values rather than the range of finite values. If two types have sets of values where neither set is a subset of the other, then the types
are unordered by conversion rank and neither type will be converted into the other during the usual arithmetic conversions.
Mixing unordered types in an operation is ill-formed. Conversions between unordered types is still possible with an explicit
,
but there won’t be any implicit conversions in either direction.
(The use of set of values rather than set of finite value is intentional. When using ranges, infinities get in the way because all types that support infinity have the same range; it was the finite ranges that are more interesting. But when using sets of values rather than ranges, infinity is just another value. So the word finite is not needed any longer when defining conversion rank.)
4.2.2. Narrowing conversions
Revision R0 of this paper proposed a definition of narrowing conversion that was not based on conversion rank and that allowed a
conversion from
to
to be a non-narrowing conversion if
and
had the same representation. SG6 in Kona didn’t
like that idea, and that approach has been abandoned in favor of what was originally presented as an alternative.
This paper now defines a narrowing conversion as a conversion from a type with higher floating-point conversion rank to one with a lower conversion rank, or as a conversion between two types that are unordered by conversion rank. This preserves the existing behavior for standard floating-point types while extending that behavior to extended floating-point types in a consistent way.
A topic of discussion for the committee is what to do when attempting a narrowing conversion where the source is a constant expression. The paper currently leaves unchanged the wording in [dcl.init.list] p7.2: "except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly)." This behavior cannot be changed for standard floating-point types, but it might be reasonable to mandate that the value must be represented exactly when converting to an extended floating-point type.
4.3. Standard library impact
Specification changes in the standard library will be required in:
-
: because operations on smaller floating-point types is the primary motivation for this feature.< cmath > -
: for the same reason.< complex > -
: because there should be some support for I/O of extended floating-point types, and because the existing specification already supports extended integer types.< charconv > -
: once [P0645] is adopted.< format >
There are parts of the standard library that mention floating-point types collectively rather than listing
,
, and
explicitly.
The implementations of those things might need to change to handle extended floating-point types, but no specification changes are necessary.
-
,std :: numeric_limits -
,std :: is_floating_point -
.std :: midpoint
Intentional omissions in standard library support:
-
The header
provides a set of C-style macros informing of the properties of< cfloat >
,float
anddouble
. Since this paper does not introduce any new standard floating-point types, no changes to this header are proposed.long double -
No streaming operations are supported for floating-point types. Properly (that is, without losing precision and/or range of values) supporting streaming operations would require support in
andnum_get
; those classes use virtual functions, so adding an extended floating-point type would necessitate an ABI break for the standard library.num_put -
and family, because no new standard floating-point types are introduced.std :: stof -
family. They are defined in terms ofstd :: to_string
; we do not propose changing the legacy C formatting facilities for this feature.snprinf -
[rand.req], for consistency with extended integer types.
-
The header
provides specializations for< atomic >
andstd :: atomic
forstd :: atomic_ref
,float
anddouble
. This list will be expanded, similarly to how it currently includes all other types necessary for aliases inlong double
, in the companion paper of this paper, [P1468], which proposes a close analogue for< cstdint >
.< cstdint >
5. Proposed wording
The wording changes in this paper are relative to N4810.
5.1. Core language
Modify Fundamental types [basic.fundamental] paragraph 12:
There are three standard floating-point types:
,
float , and
double . The type
long double provides at least as much precision as
double , and the type
float provides at least as much precision as
long double . The set of values of the type
double is a subset of the set of values of the type
float ; the set of values of the type
double is a subset of the set of values of the type
double . The value representation of standard floating-point types is implementation-defined. There may also be implementation-defined extended floating-point types. The standard and extended floating-point types are collectively called floating-point types. [...]
long double
Rename
Integer conversion rank
[conv.rank] to
Conversion ranks
and insert a new paragraph at the end:
Every floating-point type has an floating-point conversion rank defined as follows:
(2.1) The rank of a floating point type
shall be greater than the rank of any floating-point type whose set of values is a proper subset of the set of values of
T .
T (2.2) The rank of
shall be greater than the rank of
long double , which shall be greater than the rank of
double .
float (2.3) The rank of any standard floating-point type shall be greater than the rank of any extended floating-point type with the same set of values.
(2.4) The rank of any extended floating-point type relative to another extended floating-point type with the same set of values is implementation-defined, but still subject to the other rules for determining the floating-point conversion rank.
(2.5) For all floating-point types
,
T1 and
T2 , if
T3 has greater rank than
T1 and
T2 has greater rank than
T2 , then
T3 shall have greater rank than
T1 .
T3 [ Note: The conversion ranks of extended floating-point types
and
T1 will be unordered if the set of values of
T2 is neither a subset nor a superset of the set of values of
T1 . This can happen when one type has both a larger range and a lower precision than the other. -- end note ] [ Note: The floating-point conversion rank is used in the definition of the usual arithmetic conversions ([expr.arith.conv]). -- end note ]
T2
Modify Floating-point promotion [conv.fpprom] paragraph 1:
A prvalue of a floating-point type
whose floating-point conversion rank ([conv.rank]) is less than the rank of
float can be converted to a prvalue of type
double . The value is unchanged.
double
Modify Usual arithmetic conversions [expr.arith.conv] paragraph 1:
(1.1) If either operand is of scoped enumeration type, no conversions are performed; if the other operand does not have the same type, the expression is ill-formed.
(1.2) If either operand is of type long double, the other shall be converted to long double.(1.3) Otherwise, if either operand is double, the other shall be converted to double.(1.4) Otherwise, if either operand is float, the other shall be converted to float.- (1.2) Otherwise, if either operand has a floating-point type, the following rules shall be applied:
- (1.2.1) If both operands have the same type, no further conversion is needed.
- (1.2.2) Otherwise, if one of the operands has a type that is not a floating-point type, that operand shall be converted to the type of the operand with floating-point type.
- (1.2.3) Otherwise, if the floating-point conversion ranks ([conv.rank]) of the types of the operands are ordered, then the operand with the type of the lower floating-point conversion rank shall be converted to the type of the other operand.
- (1.2.4) Otherwise, the expression is ill-formed.
(
1.51.3 ) Otherwise, the integral promotions [...]
Modify the definition of narrowing conversions in List-initialization [dcl.init.list] paragraph 7 item 2:
(7.2)
fromfrom a floating-point typeto
long double or
double , or from
float to
double
float to another floating-point type whose floating-point conversion rank is not greater than that of
T , except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly), or
T
5.2. Library
Modify Header
synopsis [charconv.syn]:
[...]
to_chars_result to_chars ( char * first , char * last , * see below * value , int base = 10 ); to_chars_result to_chars ( char * first , char * last , float value ); to_chars_result to_chars ( char * first , char * last , double value ); to_chars_result to_chars ( char * first , char * last , long double value ); to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , * see below * value ); to_chars_result to_chars ( char * first , char * last , * see below * value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , * see below * value , chars_format fmt , int precision ); [...]
from_chars_result from_chars ( const char * first , const char * last , see below & value , int base = 10 ); from_chars_result from_chars ( const char * first , const char * last , float & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , long double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , * see below *& value , chars_format fmt = chars_format :: general ); [...]
Modify Primitive numeric output conversion [charconv.to.chars]:
[...]
to_chars_result to_chars ( char * first , char * last , float value ); to_chars_result to_chars ( char * first , char * last , double value ); to_chars_result to_chars ( char * first , char * last , long double value ); to_chars_result to_chars ( char * first , char * last , * see below * value );
Effects:
is converted to a string in the style of
value in the
printf locale. The conversion specifier is
"C" or
f , chosen according to the requirement for a shortest representation (see above); a tie is resolved in favor of
e .
f Throws: Nothing.
- Remarks: The implementation shall provide overloads for all floating-point types as the type of parameter
.
value to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , * see below * value , chars_format fmt );
Requires:
has the value of one of the enumerators of
fmt .
chars_format Effects:
is converted to a string in the style of
value in the
printf locale.
"C" Throws: Nothing.
- Remarks: The implementation shall provide overloads for all floating-point types as the type of parameter
.
value to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , * see below * value , chars_format fmt , int precision );
Requires:
has the value of one of the enumerators of
fmt .
chars_format Effects: value is converted to a string in the style of
in the
printf locale with the given precision.
"C" Throws: Nothing.
- Remarks: The implementation shall provide overloads for all floating-point types as the type of parameter
.
value
Modify Primitive numeric input conversions [charconv.from.chars]:
[...]
from_chars_result from_chars ( const char * first , const char * last , float & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , long double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , * see below *& value , chars_format fmt = chars_format :: general );
Requires:
has the value of one of the enumerators of
fmt .
chars_format Effects: The pattern is the expected form of the subject sequence in the
locale, as described for
"C" , except that
strtod
(7.1) the sign
may only appear in the exponent part;
'+' (7.2) if
has
fmt set but not
chars_format :: scientific , the otherwise optional exponent part shall appear;
chars_format :: fixed (7.3) if
has
fmt set but not
chars_format :: fixed , the optional exponent part shall not appear; and
chars_format :: scientific (7.4) if
is
fmt , the prefix
chars_format :: hex or
"0x" is assumed. [ Example: The string
"0X" is parsed to have the value
0x123 with remaining characters
0 . — end example ]
x123 In any case, the resulting value is one of at most two floating-point values closest to the value of the string matching the pattern.
Throws: Nothing.
- Remarks: The implementation shall provide overloads for all floating-point types as the type of parameter
.
value
Modify Complex numbers [complex.numbers] paragraph 2:
The effect of instantiating the template
for any type
complex other thanthat is not a floating-point type is unspecified. The specializations,
float , or
double
long double specializationsof,
complex < float > , and
complex < double >
complex < long double > for floating-point types are literal types.
complex
Modify Header
synopsis [complex.syn]:
[...]
// [complex.special], specializations template <> class complex < float > ; template <> class complex < double > ; template <> class complex < long double > ;
Modify Class template
[complex]:
namespace std { template < class T > class complex { public : using value_type = T ; constexpr complex ( const T & re = T (), const T & im = T ()); constexpr complex ( const complex & ); template < class X > constexpr complex ( const complex < X >& ); constexpr complex ( const complex & ) = default ; template < class X > constexpr explicit ( * see below * ) complex ( const complex < X >& other ); constexpr T real () const ; constexpr void real ( T ); constexpr T imag () const ; constexpr void imag ( T ); constexpr complex & operator = ( const T & ); constexpr complex & operator += ( const T & ); constexpr complex & operator -= ( const T & ); constexpr complex & operator *= ( const T & ); constexpr complex & operator /= ( const T & ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator -= ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; }
Remove Specializations [complex.special]:
namespace std { template <> class complex < float > { public : using value_type = float ; constexpr complex ( float re = 0.0f , float im = 0.0f ); constexpr complex ( const complex < float >& ) = default ; constexpr explicit complex ( const complex < double >& ); constexpr explicit complex ( const complex < long double >& ); constexpr float real () const ; constexpr void real ( float ); constexpr float imag () const ; constexpr void imag ( float ); constexpr complex & operator = ( float ); constexpr complex & operator += ( float ); constexpr complex & operator -= ( float ); constexpr complex & operator *= ( float ); constexpr complex & operator /= ( float ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator -= ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; template <> class complex < double > { public : using value_type = double ; constexpr complex ( double re = 0.0 , double im = 0.0 ); constexpr complex ( const complex < float >& ); constexpr complex ( const complex < double >& ) = default ; constexpr explicit complex ( const complex < long double >& ); constexpr double real () const ; constexpr void real ( double ); constexpr double imag () const ; constexpr void imag ( double ); constexpr complex & operator = ( double ); constexpr complex & operator += ( double ); constexpr complex & operator -= ( double ); constexpr complex & operator *= ( double ); constexpr complex & operator /= ( double ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator -= ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; template <> class complex < long double > { public : using value_type = long double ; constexpr complex ( long double re = 0.0 L , long double im = 0.0 L ); constexpr complex ( const complex < float >& ); constexpr complex ( const complex < double >& ); constexpr complex ( const complex < long double >& ) = default ; constexpr long double real () const ; constexpr void real ( long double ); constexpr long double imag () const ; constexpr void imag ( long double ); constexpr complex & operator = ( long double ); constexpr complex & operator += ( long double ); constexpr complex & operator -= ( long double ); constexpr complex & operator *= ( long double ); constexpr complex & operator /= ( long double ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator -= ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; }
Modify Member functions [complex.members] by inserting the following after paragraph 2:
template < class X > constexpr explicit ( * see below * ) complex ( const complex < X >& other );
Effects: Constructs an object of class
.
complex Ensures:
.
real () == other . real () && imag () == other . imag () Remarks: The expression inside
evaluates to
explicit false
if and only if the floating-point conversion rank ofis greater than the floating-point conversion rank of
T .
X
Modify Additional overloads [cmplx.over] paragraph 2 and 3:
The additional overloads shall be sufficient to ensure:
(2.1) If the argument has type, then it is effectively cast to
long double .
complex < long double > (2.2) Otherwise, if the argument has typeor an integer type, then it is effectively cast to
double .
complex < double > (2.3) Otherwise, if the argument has type, then it is effectively cast to
float .
complex < float > - (2.1) If the argument has a floating-point type
, then it is effectively cast to
T .
complex < T > - (2.2) Otherwise, if the argument has an integer type, then it is effectively cast to
.
complex < double > Function template
shall have additional overloads sufficient to ensure, for a call with at least one argument of type
pow :
complex < T >
(3.1) If either argument has typeor type
complex < long double > , then both arguments are effectively cast to
long double .
complex < long double > (3.2) Otherwise, if either argument has type,
complex < double > , or an integer type, then both arguments are effectively cast to
double .
complex < double > (3.3) Otherwise, if either argument has typeor
complex < float > , then both arguments are effectively cast to
float .
complex < float > - (3.1) If the type of one of the arguments is
and the type of the other is
complex < T1 > , then both arguments are effectively cast to
complex < T2 > , where
complex < TR > is
TR if
T1 has a higher floating-point conversion rank than
T1 , otherwise
T2 . If the floating-point conversion ranks of
T2 and
T1 are not ordered, the program is ill-formed.
T2 - (3.2) Otherwise, if the type of one of the arguments is
and the type of the other is a floating-point type
complex < T1 > , then both arguments are effectively cast to
T2 , where
complex < TR > is
TR if
T1 has a higher floating-point conversion rank than
T1 , otherwise
T2 .
T2 - (3.3) Otherwise, both arguments are effectively cast to
.
complex < T >
Modify Header
synopsis [cmath.syn] paragraph 2 and add paragraph 3:
For each set of overloaded functions within
, with the exception of
< cmath > , there shall be additional overloads sufficient to ensure:
abs
If any argument of arithmetic type corresponding to a double parameter has type long double, then all arguments of arithmetic type corresponding to double parameters are effectively cast to long double.Otherwise, if any argument of arithmetic type corresponding to a double parameter has type double or an integer type, then all arguments of arithmetic type corresponding to double parameters are effectively cast to double.- If all arguments of arithmetic types corresponding to
parameters have floating-point types, then all arguments of arithmetic type corresponding to
double parameters have type that is the type among the argument types with the highest floating-point conversion rank. If that type is an extended floating-point type, then the return type is also that type. If any two types
double and
T1 among the arithmetic type arguments have floating-point conversion ranks that are not ordered, the program is ill-formed.
T2 - Otherwise, if any argument of arithmetic type corresponding to a
parameter has a floating-point type, then all arguments of arithmetic type corresponding to
double parameters are effectively cast to that of parameters of floating-point type that is the type with the highest floating-point conversion rank among those of argument types that are floating-point.
double Otherwise, all arguments of arithmetic type corresponding to
parameters have type
double .
float - There shall be additional overloads of
for each extended floating-point type
abs . Those overloads shall have the signature
T and return the absolute value of
T abs ( T j ) .
j
Note: LWG question: should the signatures be somehow added to the synopsis itself?
Note: We have tried to capture what the current specification says, without having to add three identical items into this wording
to cover, respectively, EFPTs bigger than
, EFPTs between
and
, and EFPTs smaller than
. We
don’t have anything against reverting to that, but wanted to try this more generic way of describing the behavior.
Note: We are pretty sure this new paragraph 3 is not the way to spell it, so we will welcome any suggestions.