1. Abstract
This proposal is the less evolutionary part of [P1468], that attempts to ultimately provide the same functionality of [P0192] in a way that we expect to be more acceptable to the committee than the previous attempt.
This paper introduces the notion of extended floatingpoint types, modeled after extended integer types. To accomodate them,
this paper also attempts to rewrite the current rules for floatingpoint types, to enable welldefined interactions between all
the floatingpoint types. The end goal of this paper, together with [P1468], is to have a language to enable
like
aliases for implementation specific floating point types, that can model more binary layouts than just a single fundamental type
(the previously proposed
) can provide for.
It also attempts to rewrite existing specification for both the core language and the library to not spell out all standard floatingpoint types every time.
2. Motivation
The motivation for the general effort of this paper is the same as for [P0192], so we decided to avoid repeating it here, for brevity.
The motivation for taking the currently proposed approach comes from the result of discussion on the previous paper. Several
people raised concerns about introducing just a single new fundamental type with not well defined layout; those same people
were not satisfied with the option of having a dual ABI for that type, when for instance both IEEE754
and
are needed in the same application.
This paper legitimizes implementationspecific floatingpoint types, which makes standardizing an existing practice an additional motivation for solving the need in the way described below.
3. Proposed approach
In a nutshell:

Introduce the notion of extended floatingpoint types.

Redefine usual arithmetic conversions in terms of floatingpoint conversion rank, closely modeled after the integer equivalent.

Redefine narrowing conversions for floatingpoint types, to be defined in terms of value ranges, instead of being fixed for the standard floatingpoint types.

Rewrite as much of the standard library spec to use the new notion where it is possible and makes sense.
3.1. Finer design details
Here’s a list of the details of the design of this paper that we think are important; we’d like guidance on whether the committee likes the decision we’ve made, or if a change to them is requested; please consider them as proposed polls to determine that.
3.1.1. Floatingpoint conversion rank
At this time, the paper uses the range of finite values of a given floatingpoint type for determining the conversion rank; this is motivated by the fact that converting a value to a type that can’t represent it is undefined behavior. It is implementationdefined if a floatingpoint type can represent infinities or not; if they can, then the UB goes away, but we think that this is the useful way to determine the rank, even when the range of values is the entire set of real numbers, therefore the use of the notion of range of finite values. There is probably more acceptable behaviors, but this seems to be the most acceptable of them for the authors.
Since the definition this paper gives orders types by the relation of the ranges of finite values of different types, we included an item for when two types have ranges of finite values that are neither a subset nor a superset of each other. This doesn’t seem necessary in reality, but we decided to include it for completeness of the rules.
3.1.2. Narrowing conversions
This paper proposes to change the rules of narrowing conversions in a way that may introduce changes to what expressions are
well or illformed on systems, where
and
, and/or
and
, have the same size and layout.
Currently, the rule for narrowing conversion reads:
to
or
and
to
is a narrowed
conversion. After the proposed change, that will only be the case if those types have different ranges of finite values. This
change is made to simplify the rules; the rule that determines if a conversion is narrowing or not based on the range of finite
values is necessary for extended floatingpoint types, so it needs to appear in the text, so we decided to change the old rule
and unify it with the new one; the situation where they give a different result seems strange enough to justify this decision.
There’s another possible approach: to define that a floatingpoint conversion from a type with a higher floatingpoint conversion rank to a type with a lower floatingpoint conversion rank is always narrowing. This mostly follows the rule above, however it preserves the current narrowing conversion relations between standard floatingpoint types. This is not the approach currently worded by this paper, but we have no objections to move to this approach if it is preferred by the committee.
3.1.3. Support throughout the library
Extended floatingpoint types are supported in some part of the library, that is:
(because having access to operations
on shorter
s is the entire point of this feature),
(for the same reason), and
(because some way
of I/O should be available for them, and because the existing spec supports extended integer types already). They are not
supported in
and
, because (a) properly supporting them would require an ABI break (and then again every
time the implementation adds an extended floatingpoint type) and (b) because extended integer types are not supported there.
Similarly, no stream support is included in this paper.
4. Proposed wording
The wording changes in this paper are relative to N4791.
4.1. Core language
Modify Fundamental types [basic.fundamental] paragraph 12:
There are three standard floatingpoint types:
,
float , and
double . The type
long double provides at least as much precision as
double , and the type
float provides at least as much precision as
long double . The set of values of the type
double is a subset of the set of values of the type
float ; the set of values of the type
double is a subset of the set of values of the type
double . The value representation of standard floatingpoint types is implementationdefined. There may also be implementationdefined extended floatingpoint types. The range from the lowest finite value representable by a floatingpoint type to the maximum finite value representable by that type is called the range of finite values of that type. The standard and extended floatingpoint types are collectively called floatingpoint types. [...]
long double
Rename
Integer conversion rank
[conv.rank] to
Conversion ranks
and insert a new paragraph at the end:
Every floatingpoint type has an floatingpoint conversion rank defined as follows:
(2.1) The rank of a floating point type
shall be greater than the rank of any floatingpoint type whose range of finite values is a subset of the range of finite values of
T .
T (2.2) The rank of
shall be greater than the rank of
long double , which shall be greater than the rank of
double .
float (2.3) The rank of any standard floatingpoint type shall be greater than the rank of any extended floatingpoint type with the same range of finite values.
(2.4) The rank of any extended floatingpoint type relative to another extended floatingpoint type with the same range of values is implementationdefined, but still subject to the other rules for determining the floatingpoint conversion rank.
(2.5) For extended floatingpoint types
and
T1 , if the range of finite values of
T2 is neither a subset nor a superset of the range of finite values of
T1 , the rank of
T2 relative to
T1 is implementationdefined.
T2 (2.6) For all floatingpoint types
,
T1 and
T2 , if
T3 has greater rank than
T1 and
T2 has greater rank than
T2 , then
T3 shall have greater rank than
T1 .
T3 [ Note: The floatingpoint conversion rank is used in the definition of the usual arithmetic conversions ([expr.arith.conv]).  end note ]
Modify Floatingpoint promotion [conv.fpprom] paragraph 1:
A prvalue of a floatingpoint type
whose floatingpoint conversion rank ([conv.rank]) is less than the rank of
float can be converted to a prvalue of type
double . The value is unchanged.
double
Modify Usual arithmetic conversions [expr.arith.conv] paragraph 1:
(1.1) If either operand is of scoped enumeration type, no conversions are performed; if the other operand does not have the same type, the expression is illformed.
(1.2) If either operand is of type long double, the other shall be converted to long double.(1.3) Otherwise, if either operand is double, the other shall be converted to double.(1.4) Otherwise, if either operand is float, the other shall be converted to float. (1.2) Otherwise, if either operand has a floatingpoint type, the following rules shall be applied:
 (1.2.1) If both operands have the same type, no further conversion is needed.
 (1.2.2) Otherwise, if one of the operands has a type that is not a floatingpoint type, that operand shall be converted to the type of the operand with floatingpoint type.
 (1.2.3) Otherwise, the operand with the type of lesser floatingpoint conversion rank shall be converted to the type of the operand with greater rank.
(1.5) Otherwise, the integral promotions [...]
Modify the definition of narrowing conversions in Listinitialization [dcl.init.list] paragraph 7 item 2:
(7.2)
fromfrom a floatingpoint typeto
long double or
double , or from
float to
double
float to another floatingpoint type whose range of finite values is not a superset of the range of finite values of
T , except where the source is a constant expression and the actual value after conversion is within the range of finite values that can be represented (even if it cannot be represented exactly), or
T
4.2. Library
Modify Header
synopsis [charconv.syn]:
[...]
to_chars_result to_chars ( char * first , char * last , * see below * value , int base = 10 ); to_chars_result to_chars ( char * first , char * last , float value ); to_chars_result to_chars ( char * first , char * last , double value ); to_chars_result to_chars ( char * first , char * last , long double value ); to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , * see below * value ); to_chars_result to_chars ( char * first , char * last , * see below * value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , * see below * value , chars_format fmt , int precision ); [...]
from_chars_result from_chars ( const char * first , const char * last , see below & value , int base = 10 ); from_chars_result from_chars ( const char * first , const char * last , float & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , long double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , * see below *& value , chars_format fmt = chars_format :: general ); [...]
Modify Primitive numeric output conversion [charconv.to.chars]:
[...]
to_chars_result to_chars ( char * first , char * last , float value ); to_chars_result to_chars ( char * first , char * last , double value ); to_chars_result to_chars ( char * first , char * last , long double value ); to_chars_result to_chars ( char * first , char * last , * see below * value );
Effects:
is converted to a string in the style of
value in the
printf locale. The conversion specifier is
"C" or
f , chosen according to the requirement for a shortest representation (see above); a tie is resolved in favor of
e .
f Throws: Nothing.
 Remarks: The implementation shall provide overloads for all floatingpoint types as the type of parameter
.
value to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , * see below * value , chars_format fmt );
Requires:
has the value of one of the enumerators of
fmt .
chars_format Effects:
is converted to a string in the style of
value in the
printf locale.
"C" Throws: Nothing.
 Remarks: The implementation shall provide overloads for all floatingpoint types as the type of parameter
.
value to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , * see below * value , chars_format fmt , int precision );
Requires:
has the value of one of the enumerators of
fmt .
chars_format Effects: value is converted to a string in the style of
in the
printf locale with the given precision.
"C" Throws: Nothing.
 Remarks: The implementation shall provide overloads for all floatingpoint types as the type of parameter
.
value
Modify Primitive numeric input conversions [charconv.from.chars]:
[...]
from_chars_result from_chars ( const char * first , const char * last , float & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , long double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , * see below *& value , chars_format fmt = chars_format :: general );
Requires:
has the value of one of the enumerators of
fmt .
chars_format Effects: The pattern is the expected form of the subject sequence in the
locale, as described for
"C" , except that
strtod
(7.1) the sign
may only appear in the exponent part;
'+' (7.2) if
has
fmt set but not
chars_format :: scientific , the otherwise optional exponent part shall appear;
chars_format :: fixed (7.3) if
has
fmt set but not
chars_format :: fixed , the optional exponent part shall not appear; and
chars_format :: scientific (7.4) if
is
fmt , the prefix
chars_format :: hex or
"0x" is assumed. [ Example: The string
"0X" is parsed to have the value
0x123 with remaining characters
0 . — end example ]
x123 In any case, the resulting value is one of at most two floatingpoint values closest to the value of the string matching the pattern.
Throws: Nothing.
 Remarks: The implementation shall provide overloads for all floatingpoint types as the type of parameter
.
value
Note: other conversion to string functions (from [strings]) are not rewritten to support extended floatingpoint types.
Modify Complex numbers [complex.numbers] paragraph 2:
The effect of instantiating the template
for any type
complex other thanthat is not a floatingpoint type is unspecified. The specializations,
float , or
double
long double specializationsof,
complex < float > , and
complex < double >
complex < long double > for floatingpoint types are literal types.
complex
Modify Header
synopsis [complex.syn]:
[...]
// [complex.special], specializations template <> class complex < float > ; template <> class complex < double > ; template <> class complex < long double > ;
Modify Class template
[complex]:
namespace std { template < class T > class complex { public : using value_type = T ; constexpr complex ( const T & re = T (), const T & im = T ()); constexpr complex ( const complex & ); template < class X > constexpr complex ( const complex < X >& ); constexpr complex ( const complex & ) = default ; template < class X > constexpr explicit ( * see below * ) complex ( const complex < X >& other ); constexpr T real () const ; constexpr void real ( T ); constexpr T imag () const ; constexpr void imag ( T ); constexpr complex & operator = ( const T & ); constexpr complex & operator += ( const T & ); constexpr complex & operator = ( const T & ); constexpr complex & operator *= ( const T & ); constexpr complex & operator /= ( const T & ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; }
Remove Specializations [complex.special]:
namespace std { template <> class complex < float > { public : using value_type = float ; constexpr complex ( float re = 0.0f , float im = 0.0f ); constexpr complex ( const complex < float >& ) = default ; constexpr explicit complex ( const complex < double >& ); constexpr explicit complex ( const complex < long double >& ); constexpr float real () const ; constexpr void real ( float ); constexpr float imag () const ; constexpr void imag ( float ); constexpr complex & operator = ( float ); constexpr complex & operator += ( float ); constexpr complex & operator = ( float ); constexpr complex & operator *= ( float ); constexpr complex & operator /= ( float ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; template <> class complex < double > { public : using value_type = double ; constexpr complex ( double re = 0.0 , double im = 0.0 ); constexpr complex ( const complex < float >& ); constexpr complex ( const complex < double >& ) = default ; constexpr explicit complex ( const complex < long double >& ); constexpr double real () const ; constexpr void real ( double ); constexpr double imag () const ; constexpr void imag ( double ); constexpr complex & operator = ( double ); constexpr complex & operator += ( double ); constexpr complex & operator = ( double ); constexpr complex & operator *= ( double ); constexpr complex & operator /= ( double ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; template <> class complex < long double > { public : using value_type = long double ; constexpr complex ( long double re = 0.0 L , long double im = 0.0 L ); constexpr complex ( const complex < float >& ); constexpr complex ( const complex < double >& ); constexpr complex ( const complex < long double >& ) = default ; constexpr long double real () const ; constexpr void real ( long double ); constexpr long double imag () const ; constexpr void imag ( long double ); constexpr complex & operator = ( long double ); constexpr complex & operator += ( long double ); constexpr complex & operator = ( long double ); constexpr complex & operator *= ( long double ); constexpr complex & operator /= ( long double ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; }
Modify Member functions [complex.members] by inserting the following after paragraph 2:
template < class X > constexpr explicit ( * see below * ) complex ( const complex < X >& other );
Effects: Constructs an object of class
.
complex Ensures:
.
real () == other . real () && imag () == other . imag () Remarks: The expression inside
evaluates to
explicit false
if and only if the range of finite values ofis a superset of the range of finite values of
T .
X
Modify Additional overloads [cmplx.over] paragraph 2 and 3:
The additional overloads shall be sufficient to ensure:
(2.1) If the argument has type, then it is effectively cast to
long double .
complex < long double > (2.2) Otherwise, if the argument has typeor an integer type, then it is effectively cast to
double .
complex < double > (2.3) Otherwise, if the argument has type, then it is effectively cast to
float .
complex < float >  (2.1) If the argument has a floatingpoint type
, then it is effectively cast to
T .
complex < T >  (2.2) Otherwise, if the argument has an integer type, then it is effectively cast to
.
complex < double > Function template
shall have additional overloads sufficient to ensure, for a call with at least one argument of type
pow .:
complex < T >
(3.1) If either argument has typeor type
complex < long double > , then both arguments are effectively cast to
long double .
complex < long double > (3.2) Otherwise, if either argument has type,
complex < double > , or an integer type, then both arguments are effectively cast to
double .
complex < double > (3.3) Otherwise, if either argument has typeor
complex < float > , then both arguments are effectively cast to
float .
complex < float >  (3.1) If the type of one of the arguments is
and the type of the other is
complex < T1 > , then both arguments are effectively cast to
complex < T2 > , where
complex < TR > is
TR if
T1 has a higher floatingpoint conversion rank than
T1 , otherwise
T2 .
T2  (3.2) Otherwise, if the type of one of the arguments is
and the type of the other is a floatingpoint type
complex < T1 > , then both arguments are effectively cast to
T2 , where
complex < TR > is
TR if
T1 has a higher floatingpoint conversion rank than
T1 , otherwise
T2 .
T2  (3.3) Otherwise, both arguments are effectively cast to
.
complex < T >
Modify Header
synopsis [cmath.syn] paragraph 2 and add paragraph 3:
For each set of overloaded functions within
, with the exception of
< cmath > , there shall be additional overloads sufficient to ensure:
abs
If any argument of arithmetic type corresponding to a double parameter has type long double, then all arguments of arithmetic type corresponding to double parameters are effectively cast to long double.Otherwise, if any argument of arithmetic type corresponding to a double parameter has type double or an integer type, then all arguments of arithmetic type corresponding to double parameters are effectively cast to double. If all arguments of arithmetic types corresponding to
parameters have floatingpoint types, then all arguments of arithmetic type corresponding to
double parameters have type that is the type among the argument types with the highest floatingpoint conversion rank. If that type is an extended floatingpoint type, then the return type is also that type.
double  Otherwise, if any argument of arithmetic type corresponding to a
parameter has a floatingpoint type, then all arguments of arithmetic type corresponding to
double parameters are effectively cast to that of parameters of floatingpoint type that is the type with the highest floatingpoint conversion rank among those of argument types that are floatingpoint.
double Otherwise, all arguments of arithmetic type corresponding to
parameters have type
double .
float  There shall be additional overloads of
for each extended floatingpoint type
abs . Those overloads shall have the signature
T and return the absolute value of
T abs ( T j ) .
j
Note: LWG question: should the signatures be somehow added to the synopsis itself?
Note: We have tried to capture what the current specification says, without having to add three identical items into this wording
to cover, respectively, EFPTs bigger than
, EFPTs between
and
, and EFPTs smaller than
. We
don’t have anything against reverting to that, but wanted to try this more generic way of describing the behavior.
Note: We are pretty sure this new paragraph 3 is not the way to spell it, so we will welcome any suggestions.