1. Abstract
This proposal follows P0192R1 in proposing a new fundamental type,
: a floating point type of
unspecified length, shorter or equal to that of
. In addition, it also proposes standard library aliases for fixed
width floating point types, required to conform to [IEEE-754-2008], including
. Library support for
and
is also included.
2. Motivation
One may wonder: why would a programming language need yet another floating point type after so many years of doing just fine without it? Apparently, the times are (a-)changing. Small binary floating-point representation demand and support are becoming more and more common rapidly. Efficient support for hardware that use the new formats becomes mission critical for major software products.
2.1. Application use is growing
Application areas include computer graphics, image representation and machine learning. For example, a 16-bit floating-point number better represents the dynamic range of images than 16-bit or even 32-bit integers. A 16-bit floating-point number adequately handles human perceptual range. In 2012, Adobe has defined new HDR [DNG] image file format that most commonly uses 16-bit floats. This gives those 16-bits much more dynamic range than a traditional file stored as 16 or 32-bit integer data. Both Photoshop and Lightroom, as well as every professional camera produced in 2015 make use of it.
2.2. Software support is growing
-
OpenGL has
format since [OpenGL3.0].GL_HALF_FLOAT -
The [OpenEXR] software distribution includes
, a C++ class for manipulating half float values almost as if they were a built-in C++ data type.Half -
NVIDIA’s [CUDA7.5] platform header
defines thecuda_fp16 . h
andhalf
data types and defineshalf2
and__half2float ()
for conversion between that and__float2half ()
. [4]float -
The GCC compiler provides an
native data type extension for ARM. Values of that type are promoted to__fp16
for computation ([GCCFP16]).float -
The LLVM IR provides a 16-bit floating-point type called
.half -
Recently, a new 16-bit float format appeared, called [bfloat16] - truncated 32-bit float, used in Google TPUs and TensorFlow
.half
2.3. Hardware support is growing
-
NVIDIA was the first to implement 16-bit floating point in silicon, with the GeForce FX, released in late 2002.
-
Intel provides instructions for converting between 16-bit and 32-bit floats and C-level intrinsics to use them (see [INTEL-HALF-PERF], an Intel article on half precision performance benefits).
-
ARM provides support as an optional extension to the VFPv3 architecture: [ARM-HF].
Defining standards for 16-bit float math already exist. The [IEEE-754-2008] standard defined
16-bit float in 2008.
ISO/IEC 60559 ratified that standard in 2011. However, support for just the IEEE
format does not cover
existing use cases.
-
OpenGL provides 11-bit and 10-bit float channels in
and 14-bit float inGL_R11F_G11F_B10F
.GL_RGB9_E5 -
ARM, along an [IEEE-754-2008] compatible 16-bit floating-point format, provides a 16-bit floating-point format that differs from IEEE ’binary16’ by dropping support for NaN and Infinity and then extending the range of values.
-
The TI MSP430X architecture provides a 20-bit word-addressed machine. Short floating-point support on that machine would naturally use a 20-bit format.
-
Google TPU and TensorFlow use [bfloat16].
3. Proposed solution
We propose adding a new fundamental type,
- for a floating-point type of unspecified (platform defined) bit
size, shorter or equal to that of
. Language needs
to represent "shorter than
" math that may
be natively available on the platform. This name looks intuitive via
analogy. Most important, it does not
introduce any new keywords.
The proposed suffixes for
literals are
and
.
This proposal also extends the definition of floating-point promotions, including a promotion from
to
.
We also propose adding a set of conditionally supported type aliases in namespace
:
,
and
.
Those types would be quaranteed to be respectively 16, 32 and 64 bit long in their representation, and would be required
to implement [IEEE-754-2008]
,
and
formats, respectively. We propose to put those aliases
into a new header,
for consistency with
, and to make this new header a freestanding header.
Several people have suggested not having a new fundamental type, but only exposing it as the
alias. This
approach problems. All the library changes would still need to be done, but substituting
with
,
plus additional wording gauranteeing that the additional overloads don’t exist if, for some reason,
has 16 bits.
Additionally, the definition of a floating-point type would still need to be extended to include that - conditionally supported! -
type, and to give it the same capabilities that other floating-point types have. The authors of this paper find this to
be a change that is almost as complex as adding a new fundamental type, but with more corner cases around overload resolution
for standard library functions.
3.1. Implementation options
As of storage and bit-layout for a short float number, we would expect most implementations to follow [IEEE-754-2008] or [bfloat16] half-precision floating point number formats. On platform that do not provide any advantages of using shorter
float, short float may be implemented as storage-only type, like
on GCC/ARM today. For example, it can be stored in
format in memory (occupying less bytes than
), converted to native 32-bit
on read from memory,
operated on using native 32-bit floating-point math operations and converted back to
on store to memory. Or, the
platform may choose to not take any advantage of
and represent it using
in both memory and registers.
3.2. Implementation experience
Since CUDA 7.5 introduction of
16-bit floating type, applications can benefit by storing up to 2x larger models in GPU
memory. Applications that are bottlenecked by memory bandwidth may get up to 2x speedup. And applications bottlenecked by FP32
computation may benefit from 2x faster computation on
data. NVIDIA GPUs implement the [IEEE-754-2008] floating point
standard, which defines half-precision numbers as follows:
-
Sign: 1 bit
-
Exponent width: 5 bits
-
Significand precision: 11 bits (10 explicitly stored)
Google TPUs implement
format, which defines half-precision numbers as higher 16 bits of 32-bit IEEE float:
-
Sign: 1 bit
-
Exponent width: 8 bits
-
Significand precision: 8 bits (7 explicitly stored)
4. Proposed wording
4.1. Wording for a new fundamental type
4.1.1. Core language
Modify Floating literals [lex.fcon] by adding short float suffixes to floating-suffix:
floating-suffix: one of
sf f l SF F L
In paragraph 1, modify sentence 13:
The type of a floating literal is
unless explicitly specified by a suffix. The suffixes
double and
sf specify
SF , the suffixes
short float and
f specify
F , the suffixes
float and
l specify
L . [...]
long double
Modify Fundamental types [basic.fundamental] paragraph 8:
There are
threefour floating-point types:,
short float ,
float , and
double . The type
long double provides at least as much precision as
float , the type
short float provides at least as much precision as
double , and the type
float provides at least as much precision as
long double . The set of values of the type
double is a subset of the set of values of the type
short float , the set of values of the type
double is a subset of the set of values of the type
float ; the set of values of the type
double is a subset of the set of values of the type
double . The value representation of floating-point types is implementation-defined. [...]
long double
Modify Floating-point promotion [conv.fpprom] as follows:
A prvalue of typecan be converted to a prvalue of type
short float . The value is unchanged.
float A prvalue of type
can be converted to a prvalue of type
float . The value is unchanged.
double This conversion isThese conversions are called floating-pointpromotionpromotions.
Modify Usual arithmetic conversions [expr.arith.conv] paragraph 1 as follows:
[...]
If either operand is of type
, the other shall be converted to
long double .
long double Otherwise, if either operand is
, the other shall be converted to
float .
float - Otherwise, if either operand is
, the other shall be converted to
short float .
short float Otherwise, the integral promotions shall be performed on both operands. [...]
In Simple type specifiers [dcl.type.simple], modify table 11 as follows:
Specifier(s) Type [...] [...]
wchar_t " "
wchar_t
short float " "
short float
float " "
float [...] [...]
Modify List-initialization [dcl.init.list] paragraph 7 item 2:
from
higher precision floating-point type to a lower precision one, except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly), orto
long double or
double , or from
float to
double ,
float
In Standard conversion sequences [over.ics.scs], modify table 13 as follows:
[...] Floating-point
promotionpromotions [...]
In Predefined macro names [cpp.predefined], modify table 16 as follows:
Macro name Value [...] [...]
__cpp_rvalue_references
200610L
__cpp_short_float
201810L
__cpp_sized_deallocation
201309L [...] [...]
4.1.2. Library wording
Modify Header
synopsis [cstdlib.syn] as follows:
[...]
long long int abs ( long long int j );
short float abs ( short float j );
float abs ( float j );
[...]
Note:
is not added on purpose, since the
family is owned by C.
is provided for the new type,
since C++ already extends the overload set of this function.
Modify Header
synopsis [limits.syn] as follows:
[...]
template <> class numeric_limits < unsigned long long > ;
template <> class numeric_limits < short float > ;
template <> class numeric_limits < float > ;
[...]
Do not modify Header
synopsis [cfloat.syn].
Note: no macros are added to
, on purpose, because that header’s contents are fully owned by C.
Modify Header
synopsis [charconv.syn] as follows:
it is possible that instead of adding a new overload, the overloads could be folded into a single specified function, the way that integer overloads of all of these are specified.
[...]
// [charconv.to.chars], primitive numerical output conversion struct to_ chars_ result { char * ptr ; errc ec ; }; to_chars_result to_chars ( char * first , char * last , * see below * value , int base = 10 ); to_chars_result to_chars ( char * first , char * last , short float value ); to_chars_result to_chars ( char * first , char * last , float value ); to_chars_result to_chars ( char * first , char * last , double value ); to_chars_result to_chars ( char * first , char * last , long double value ); to_chars_result to_chars ( char * first , char * last , short float value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , short float value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt , int precision ); // [charconv.from.chars], primitive numerical input conversion struct from_ chars_ result { const char * ptr ; errc ec ; }; from_chars_result from_chars ( const char * first , const char * last , see below & value , int base = 10 ); from_chars_result from_chars ( const char * first , const char * last , short float & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , float & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , long double & value , chars_format fmt = chars_format :: general ); }
Modify Primitive numeric output conversion [charconv.to.chars] as follows, by adding overloads to the lists of signatures:
[...]
to_chars_result to_chars ( char * first , char * last , short float value ); to_chars_result to_chars ( char * first , char * last , float value ); to_chars_result to_chars ( char * first , char * last , double value ); to_chars_result to_chars ( char * first , char * last , long double value ); [...]
to_chars_result to_chars ( char * first , char * last , short float value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt ); [...]
to_chars_result to_chars ( char * first , char * last , short float value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , float value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , double value , chars_format fmt , int precision ); to_chars_result to_chars ( char * first , char * last , long double value , chars_format fmt , int precision ); [...]
Note: no changes in descriptions are needed, since all those overloads of every of those functions are specified together already.
Modify Primitive numeric input conversions [charconv.from.chars] as follows, by adding an overload to the list of signatures:
[...]
from_chars_result from_chars ( const char * first , const char * last , short float & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , float & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , double & value , chars_format fmt = chars_format :: general ); from_chars_result from_chars ( const char * first , const char * last , long double & value , chars_format fmt = chars_format :: general ); [...]
Note: no changes in descriptions are needed, since all those overloads of every of those functions are specified together already.
Modify Header
synopsis [string.syn] as follows:
[...]
string to_string ( unsigned long long val );
string to_string ( short float val );
string to_string ( float val ); [...]
wstring to_wstring ( unsigned long long val );
wstring to_wstring ( short float val );
wstring to_wstring ( float val ); [...]
the definitions of the
family depend on C functions in the
family. Should an overload for
be added?
Modify Numeric conversions [string.conversions] as follows:
[...]
string to_string ( int val );
string to_string ( unsigned val );
string to_string ( long val );
string to_string ( unsigned long val );
string to_string ( long long val );
string to_string ( unsigned long long val );
string to_string ( short float val );
string to_string ( float val );
string to_string ( double val );
string to_string ( long double val );
Returns: Each function returns a
object holding the character representation of the value of its argument that would be generated by calling
string with a format specifier of
sprintf ( buf , fmt , val ) ,
"%d" ,
"%u" ,
"%ld" ,
"%lu" ,
"%lld" ,
"%llu" ,
"%f" ,
"%f" , or
"%f" , respectively, where
"%Lf" designates an internal character buffer of sufficient size.
buf [...]
wstring to_wstring ( int val );
wstring to_wstring ( unsigned val );
wstring to_wstring ( long val );
wstring to_wstring ( unsigned long val );
wstring to_wstring ( long long val );
wstring to_wstring ( unsigned long long val );
wstring to_wstring ( short float val );
wstring to_wstring ( float val );
wstring to_wstring ( double val );
wstring to_wstring ( long double val );
Returns: Each function returns a
object holding the character representation of the value of its argument that would be generated by calling
wstring with a format specifier of
sprintf ( buf , fmt , val ) ,
"%d" ,
"%u" ,
"%ld" ,
"%lu" ,
"%lld" ,
"%llu" ,
"%f" ,
"%f" , or
"%f" , respectively, where
"%Lf" designates an internal character buffer of sufficient size.
buf
Modify Complex numbers [complex.numbers] paragraph 2:
The effect of instantiating the template
for any type other than
complex ,
short float ,
float , or
double is unspecified. The specializations
long double ,
complex < short float > ,
complex < float > , and
complex < double > are literal types.
complex < long double >
Modify Header
synopsis [complex.syn] as follows:
[...]
// [complex.special], complex specializations template <> class complex < short float > ; template <> class complex < float > ; template <> class complex < double > ; template <> class complex < long double > ; [...]
// [complex.literals], complex literals inline namespace literals { inline namespace complex_literals { constexpr complex < long double > operator "" il ( long double ); constexpr complex < long double > operator "" il ( unsigned long long ); constexpr complex < double > operator "" i ( long double ); constexpr complex < double > operator "" i ( unsigned long long ); constexpr complex < float > operator "" if ( long double ); constexpr complex < float > operator "" if ( unsigned long long ); constexpr complex < short float > operator "" isf ( long double ); constexpr complex < short float > operator "" isf ( unsigned long long ); } }
Modify
specializations [complex.special] as follows:
namespace std { template <> class complex < short float > { public : using value_type = short float ; constexpr complex ( short float re = 0.0 sf , short float im = 0.0 sf ); constexpr complex ( const complex < short float >& ) = default ; constexpr explicit complex ( const complex < float >& ); constexpr explicit complex ( const complex < double >& ); constexpr explicit complex ( const complex < long double >& ); constexpr short float real () const ; constexpr void real ( short float ); constexpr short float imag () const ; constexpr void imag ( short float ); constexpr complex & operator = ( short float ); constexpr complex & operator += ( short float ); constexpr complex & operator -= ( short float ); constexpr complex & operator *= ( short float ); constexpr complex & operator /= ( short float ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator -= ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; template <> class complex < float > { public : using value_type = float ; constexpr complex ( float re = 0.0f , float im = 0.0f ); constexpr complex ( const complex < short float >& ); constexpr complex ( const complex < float >& ) = default ; constexpr explicit complex ( const complex < double >& ); constexpr explicit complex ( const complex < long double >& ); constexpr float real () const ; constexpr void real ( float ); constexpr float imag () const ; constexpr void imag ( float ); constexpr complex & operator = ( float ); constexpr complex & operator += ( float ); constexpr complex & operator -= ( float ); constexpr complex & operator *= ( float ); constexpr complex & operator /= ( float ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator -= ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; template <> class complex < double > { public : using value_type = double ; constexpr complex ( double re = 0.0 , double im = 0.0 ); constexpr complex ( const complex < short float >& ); constexpr complex ( const complex < float >& ); constexpr complex ( const complex < double >& ) = default ; constexpr explicit complex ( const complex < long double >& ); constexpr double real () const ; constexpr void real ( double ); constexpr double imag () const ; constexpr void imag ( double ); constexpr complex & operator = ( double ); constexpr complex & operator += ( double ); constexpr complex & operator -= ( double ); constexpr complex & operator *= ( double ); constexpr complex & operator /= ( double ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator -= ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; template <> class complex < long double > { public : using value_type = long double ; constexpr complex ( long double re = 0.0 L , long double im = 0.0 L ); constexpr complex ( const complex < short float >& ); constexpr complex ( const complex < float >& ); constexpr complex ( const complex < double >& ); constexpr complex ( const complex < long double >& ) = default ; constexpr long double real () const ; constexpr void real ( long double ); constexpr long double imag () const ; constexpr void imag ( long double ); constexpr complex & operator = ( long double ); constexpr complex & operator += ( long double ); constexpr complex & operator -= ( long double ); constexpr complex & operator *= ( long double ); constexpr complex & operator /= ( long double ); constexpr complex & operator = ( const complex & ); template < class X > constexpr complex & operator = ( const complex < X >& ); template < class X > constexpr complex & operator += ( const complex < X >& ); template < class X > constexpr complex & operator -= ( const complex < X >& ); template < class X > constexpr complex & operator *= ( const complex < X >& ); template < class X > constexpr complex & operator /= ( const complex < X >& ); }; }
Add an item to Additional overloads [cmplx.over] paragraph 2:
[...]
Otherwise, if the argument has type
, then it is effectively cast to
float .
complex < float > - Otherwise, if the argument has type
, then it is effectively cast to
short float .
complex < short float >
Add an item to Additional overloads [cmplx.over] paragraph 3:
[...]
Otherwise, if either argument has type
or
complex < float > , then both arguments are effectively cast to
float .
complex < float > - Otherwise, if either argument has type
org
complex < short float > , then both arguments are effectively cast to
short float .
complex < short float >
Modify Suffixes for complex number literals [complex.literals] as follows:
This subclause describes literal suffixes for constructing complex number literals. The suffixes
,
i , and
il
if ,
il ,
i , and
if create complex numbers of the types
isf ,
complex < double > , and
complex < long double > ,
complex < long double > ,
complex < double > , and
complex < float > respectively, with their imaginary part denoted by the given literal number and the real part being zero.
complex < short float > constexpr complex < long double > operator "" il ( long double d ); constexpr complex < long double > operator "" il ( unsigned long long d );
Returns:
.
complex < long double > { 0.0 L , static_cast < long double > ( d )} constexpr complex < double > operator "" i ( long double d ); constexpr complex < double > operator "" i ( unsigned long long d );
Returns:
.
complex < double > { 0.0 , static_cast < double > ( d )} constexpr complex < float > operator "" if ( long double d ); constexpr complex < float > operator "" if ( unsigned long long d );
Returns:
.
complex < float > { 0.0f , static_cast < float > ( d )} constexpr complex < short float > operator "" isf ( long double d ); constexpr complex < short float > operator "" isf ( unsigned long long d );
Returns:
.
complex < short float > { 0.0 sf , static_cast < short float > ( d )}
Modify General requirements [rand.req.genl] paragraph 1 d):
d) that has a template type parameter named
is undefined unless the corresponding template argument is cv-unqualified and is one of
RealType ,
short float ,
float , or
double .
long double
Modify Header
synopsis [cmath.syn] as follows:
[...]
namespace std { short float acos ( short float x ); // see [library.c] float acos ( float x ); // see [library.c] double acos ( double x ); long double acos ( long double x ); // see [library.c] float acosf ( float x ); long double acosl ( long double x ); short float asin ( short float x ); // see [library.c] float asin ( float x ); // see [library.c] double asin ( double x ); long double asin ( long double x ); // see [library.c] float asinf ( float x ); long double asinl ( long double x ); short float atan ( short float x ); // see [library.c] float atan ( float x ); // see [library.c] double atan ( double x ); long double atan ( long double x ); // see [library.c] float atanf ( float x ); long double atanl ( long double x ); short float atan2 ( short float y , short float x ); // see [library.c] float atan2 ( float y , float x ); // see [library.c] double atan2 ( double y , double x ); long double atan2 ( long double y , long double x ); // see [library.c] float atan2f ( float y , float x ); long double atan2l ( long double y , long double x ); short float cos ( short float x ); // see [library.c] float cos ( float x ); // see [library.c] double cos ( double x ); long double cos ( long double x ); // see [library.c] float cosf ( float x ); long double cosl ( long double x ); short float sin ( short float x ); // see [library.c] float sin ( float x ); // see [library.c] double sin ( double x ); long double sin ( long double x ); // see [library.c] float sinf ( float x ); long double sinl ( long double x ); short float tan ( short float x ); // see [library.c] float tan ( float x ); // see [library.c] double tan ( double x ); long double tan ( long double x ); // see [library.c] float tanf ( float x ); long double tanl ( long double x ); short float acosh ( short float x ); // see [library.c] float acosh ( float x ); // see [library.c] double acosh ( double x ); long double acosh ( long double x ); // see [library.c] float acoshf ( float x ); long double acoshl ( long double x ); short float asinh ( short float x ); // see [library.c] float asinh ( float x ); // see [library.c] double asinh ( double x ); long double asinh ( long double x ); // see [library.c] float asinhf ( float x ); long double asinhl ( long double x ); short float atanh ( short float x ); // see [library.c] float atanh ( float x ); // see [library.c] double atanh ( double x ); long double atanh ( long double x ); // see [library.c] float atanhf ( float x ); long double atanhl ( long double x ); short float cosh ( short float x ); // see [library.c] float cosh ( float x ); // see [library.c] double cosh ( double x ); long double cosh ( long double x ); // see [library.c] float coshf ( float x ); long double coshl ( long double x ); short float sinh ( short float x ); // see [library.c] float sinh ( float x ); // see [library.c] double sinh ( double x ); long double sinh ( long double x ); // see [library.c] float sinhf ( float x ); long double sinhl ( long double x ); short float tanh ( short float x ); // see [library.c] float tanh ( float x ); // see [library.c] double tanh ( double x ); long double tanh ( long double x ); // see [library.c] float tanhf ( float x ); long double tanhl ( long double x ); short float exp ( short float x ); // see [library.c] float exp ( float x ); // see [library.c] double exp ( double x ); long double exp ( long double x ); // see [library.c] float expf ( float x ); long double expl ( long double x ); short float exp2 ( short float x ); // see [library.c] float exp2 ( float x ); // see [library.c] double exp2 ( double x ); long double exp2 ( long double x ); // see [library.c] float exp2f ( float x ); long double exp2l ( long double x ); short float expm1 ( short float x ); // see [library.c] float expm1 ( float x ); // see [library.c] double expm1 ( double x ); long double expm1 ( long double x ); // see [library.c] float expm1f ( float x ); long double expm1l ( long double x ); short float frexp ( short float value , int * exp ); // see [library.c] float frexp ( float value , int * exp ); // see [library.c] double frexp ( double value , int * exp ); long double frexp ( long double value , int * exp ); // see [library.c] float frexpf ( float value , int * exp ); long double frexpl ( long double value , int * exp ); int ilogb ( short float x ); // see [library.c] int ilogb ( float x ); // see [library.c] int ilogb ( double x ); int ilogb ( long double x ); // see [library.c] int ilogbf ( float x ); int ilogbl ( long double x ); short float ldexp ( short float x , int exp ); // see [library.c] float ldexp ( float x , int exp ); // see [library.c] double ldexp ( double x , int exp ); long double ldexp ( long double x , int exp ); // see [library.c] float ldexpf ( float x , int exp ); long double ldexpl ( long double x , int exp ); short float log ( short float x ); // see [library.c] float log ( float x ); // see [library.c] double log ( double x ); long double log ( long double x ); // see [library.c] float logf ( float x ); long double logl ( long double x ); short float log10 ( short float x ); // see [library.c] float log10 ( float x ); // see [library.c] double log10 ( double x ); long double log10 ( long double x ); // see [library.c] float log10f ( float x ); long double log10l ( long double x ); short float log1p ( short float x ); // see [library.c] float log1p ( float x ); // see [library.c] double log1p ( double x ); long double log1p ( long double x ); // see [library.c] float log1pf ( float x ); long double log1pl ( long double x ); short float log2 ( short float x ); // see [library.c] float log2 ( float x ); // see [library.c] double log2 ( double x ); long double log2 ( long double x ); // see [library.c] float log2f ( float x ); long double log2l ( long double x ); short float logb ( short float x ); // see [library.c] float logb ( float x ); // see [library.c] double logb ( double x ); long double logb ( long double x ); // see [library.c] float logbf ( float x ); long double logbl ( long double x ); short float modf ( short float value , short float * iptr ); // see [library.c] float modf ( float value , float * iptr ); // see [library.c] double modf ( double value , double * iptr ); long double modf ( long double value , long double * iptr ); // see [library.c] float modff ( float value , float * iptr ); long double modfl ( long double value , long double * iptr ); short float scalbn ( short float x , int n ); // see [library.c] float scalbn ( float x , int n ); // see [library.c] double scalbn ( double x , int n ); long double scalbn ( long double x , int n ); // see [library.c] float scalbnf ( float x , int n ); long double scalbnl ( long double x , int n ); short float scalbln ( short float x , long int n ); // see [library.c] float scalbln ( float x , long int n ); // see [library.c] double scalbln ( double x , long int n ); long double scalbln ( long double x , long int n ); // see [library.c] float scalblnf ( float x , long int n ); long double scalblnl ( long double x , long int n ); short float cbrt ( short float x ); // see [library.c] float cbrt ( float x ); // see [library.c] double cbrt ( double x ); long double cbrt ( long double x ); // see [library.c] float cbrtf ( float x ); long double cbrtl ( long double x ); // [c.math.abs], absolute values int abs ( int j ); long int abs ( long int j ); long long int abs ( long long int j ); short float abs ( short float j ); float abs ( float j ); double abs ( double j ); long double abs ( long double j ); short float fabs ( short float x ); // see [library.c] float fabs ( float x ); // see [library.c] double fabs ( double x ); long double fabs ( long double x ); // see [library.c] float fabsf ( float x ); long double fabsl ( long double x ); short float hypot ( short float x , short float y ); // see [library.c] float hypot ( float x , float y ); // see [library.c] double hypot ( double x , double y ); long double hypot ( long double x , long double y ); // see [library.c] float hypotf ( float x , float y ); long double hypotl ( long double x , long double y ); // [c.math.hypot3], three-dimensional hypotenuse short float hypot ( short float x , short float y , short float z ); float hypot ( float x , float y , float z ); double hypot ( double x , double y , double z ); long double hypot ( long double x , long double y , long double z ); short float pow ( short float x , short float y ); // see [library.c] float pow ( float x , float y ); // see [library.c] double pow ( double x , double y ); long double pow ( long double x , long double y ); // see [library.c] float powf ( float x , float y ); long double powl ( long double x , long double y ); short float sqrt ( short float x ); // see [library.c] float sqrt ( float x ); // see [library.c] double sqrt ( double x ); long double sqrt ( long double x ); // see [library.c] float sqrtf ( float x ); long double sqrtl ( long double x ); short float erf ( short float x ); // see [library.c] float erf ( float x ); // see [library.c] double erf ( double x ); long double erf ( long double x ); // see [library.c] float erff ( float x ); long double erfl ( long double x ); short float erfc ( short float x ); // see [library.c] float erfc ( float x ); // see [library.c] double erfc ( double x ); long double erfc ( long double x ); // see [library.c] float erfcf ( float x ); long double erfcl ( long double x ); short float lgamma ( short float x ); // see [library.c] float lgamma ( float x ); // see [library.c] double lgamma ( double x ); long double lgamma ( long double x ); // see [library.c] float lgammaf ( float x ); long double lgammal ( long double x ); short float tgamma ( short float x ); // see [library.c] float tgamma ( float x ); // see [library.c] double tgamma ( double x ); long double tgamma ( long double x ); // see [library.c] float tgammaf ( float x ); long double tgammal ( long double x ); short float ceil ( short float x ); // see [library.c] float ceil ( float x ); // see [library.c] double ceil ( double x ); long double ceil ( long double x ); // see [library.c] float ceilf ( float x ); long double ceill ( long double x ); short float floor ( short float x ); // see [library.c] float floor ( float x ); // see [library.c] double floor ( double x ); long double floor ( long double x ); // see [library.c] float floorf ( float x ); long double floorl ( long double x ); short float nearbyint ( short float x ); // see [library.c] float nearbyint ( float x ); // see [library.c] double nearbyint ( double x ); long double nearbyint ( long double x ); // see [library.c] float nearbyintf ( float x ); long double nearbyintl ( long double x ); short float rint ( short float x ); // see [library.c] float rint ( float x ); // see [library.c] double rint ( double x ); long double rint ( long double x ); // see [library.c] float rintf ( float x ); long double rintl ( long double x ); long int lrint ( short float x ); // see [library.c] long int lrint ( float x ); // see [library.c] long int lrint ( double x ); long int lrint ( long double x ); // see [library.c] long int lrintf ( float x ); long int lrintl ( long double x ); long long int llrint ( short float x ); // see [library.c] long long int llrint ( float x ); // see [library.c] long long int llrint ( double x ); long long int llrint ( long double x ); // see [library.c] long long int llrintf ( float x ); long long int llrintl ( long double x ); short float round ( short float x ); // see [library.c] float round ( float x ); // see [library.c] double round ( double x ); long double round ( long double x ); // see [library.c] float roundf ( float x ); long double roundl ( long double x ); long int lround ( short float x ); // see [library.c] long int lround ( float x ); // see [library.c] long int lround ( double x ); long int lround ( long double x ); // see [library.c] long int lroundf ( float x ); long int lroundl ( long double x ); long long int llround ( short float x ); // see [library.c] long long int llround ( float x ); // see [library.c] long long int llround ( double x ); long long int llround ( long double x ); // see [library.c] long long int llroundf ( float x ); long long int llroundl ( long double x ); short float trunc ( short float x ); // see [library.c] float trunc ( float x ); // see [library.c] double trunc ( double x ); long double trunc ( long double x ); // see [library.c] float truncf ( float x ); long double truncl ( long double x ); short float fmod ( short float x , short float y ); // see [library.c] float fmod ( float x , float y ); // see [library.c] double fmod ( double x , double y ); long double fmod ( long double x , long double y ); // see [library.c] float fmodf ( float x , float y ); long double fmodl ( long double x , long double y ); short float remainder ( short float x , short float y ); // see [library.c] float remainder ( float x , float y ); // see [library.c] double remainder ( double x , double y ); long double remainder ( long double x , long double y ); // see [library.c] float remainderf ( float x , float y ); long double remainderl ( long double x , long double y ); short float remquo ( short float x , short float y , int * quo ); // see [library.c] float remquo ( float x , float y , int * quo ); // see [library.c] double remquo ( double x , double y , int * quo ); long double remquo ( long double x , long double y , int * quo ); // see [library.c] float remquof ( float x , float y , int * quo ); long double remquol ( long double x , long double y , int * quo ); short float copysign ( short float x , short float y ); // see [library.c] float copysign ( float x , float y ); // see [library.c] double copysign ( double x , double y ); long double copysign ( long double x , long double y ); // see [library.c] float copysignf ( float x , float y ); long double copysignl ( long double x , long double y ); double nan ( const char * tagp ); float nanf ( const char * tagp ); long double nanl ( const char * tagp ); short float nextafter ( short float x , short float y ); // see [library.c] float nextafter ( float x , float y ); // see [library.c] double nextafter ( double x , double y ); long double nextafter ( long double x , long double y ); // see [library.c] float nextafterf ( float x , float y ); long double nextafterl ( long double x , long double y ); short float nexttoward ( short float x , long double y ); // see [library.c] float nexttoward ( float x , long double y ); // see [library.c] double nexttoward ( double x , long double y ); long double nexttoward ( long double x , long double y ); // see [library.c] float nexttowardf ( float x , long double y ); long double nexttowardl ( long double x , long double y ); short float fdim ( short float x , short float y ); // see [library.c] float fdim ( float x , float y ); // see [library.c] double fdim ( double x , double y ); long double fdim ( long double x , long double y ); // see [library.c] float fdimf ( float x , float y ); long double fdiml ( long double x , long double y ); short float fmax ( short float x , short float y ); // see [library.c] float fmax ( float x , float y ); // see [library.c] double fmax ( double x , double y ); long double fmax ( long double x , long double y ); // see [library.c] float fmaxf ( float x , float y ); long double fmaxl ( long double x , long double y ); short float fmin ( short float x , short float y ); // see [library.c] float fmin ( float x , float y ); // see [library.c] double fmin ( double x , double y ); long double fmin ( long double x , long double y ); // see [library.c] float fminf ( float x , float y ); long double fminl ( long double x , long double y ); short float fma ( short float x , short float y , short float z ); // see [library.c] float fma ( float x , float y , float z ); // see [library.c] double fma ( double x , double y , double z ); long double fma ( long double x , long double y , long double z ); // see [library.c] float fmaf ( float x , float y , float z ); long double fmal ( long double x , long double y , long double z ); // [c.math.fpclass], classification / comparison functions int fpclassify ( short float x ); int fpclassify ( float x ); int fpclassify ( double x ); int fpclassify ( long double x ); bool isfinite ( short float x ); bool isfinite ( float x ); bool isfinite ( double x ); bool isfinite ( long double x ); bool isinf ( short float x ); bool isinf ( float x ); bool isinf ( double x ); bool isinf ( long double x ); bool isnan ( short float x ); bool isnan ( float x ); bool isnan ( double x ); bool isnan ( long double x ); bool isnormal ( short float x ); bool isnormal ( float x ); bool isnormal ( double x ); bool isnormal ( long double x ); bool signbit ( short float x ); bool signbit ( float x ); bool signbit ( double x ); bool signbit ( long double x ); bool isgreater ( short float x , short float y ); bool isgreater ( float x , float y ); bool isgreater ( double x , double y ); bool isgreater ( long double x , long double y ); bool isgreaterequal ( short float x , short float y ); bool isgreaterequal ( float x , float y ); bool isgreaterequal ( double x , double y ); bool isgreaterequal ( long double x , long double y ); bool isless ( short float x , short float y ); bool isless ( float x , float y ); bool isless ( double x , double y ); bool isless ( long double x , long double y ); bool islessequal ( short float x , short float y ); bool islessequal ( float x , float y ); bool islessequal ( double x , double y ); bool islessequal ( long double x , long double y ); bool islessgreater ( short float x , short float y ); bool islessgreater ( float x , float y ); bool islessgreater ( double x , double y ); bool islessgreater ( long double x , long double y ); bool isunordered ( short float x , short float y ); bool isunordered ( float x , float y ); bool isunordered ( double x , double y ); bool isunordered ( long double x , long double y ); [...]
Note: mathematical special functions for
are not provided, out of concern about precision. They are still
callable with a
value thanks to a promotion to
.
Modify Header
synopsis [cmath.syn] paragraph 2:
For each set of overloaded functions within
, with the exception of
< cmath > , there shall be additional overloads sufficient to ensure:
abs
If any argument of arithmetic type corresponding to a
parameter has type
double , then all arguments of arithmetic type corresponding to
long double parameters are effectively cast to
double .
long double Otherwise, if any argument of arithmetic type corresponding to a
parameter has type
double or an integer type, then all arguments of arithmetic type corresponding to
double parameters are effectively cast to
double .
double Otherwise,
all argumentsif any argument of arithmetic type corresponding to a
double parameters haveparameter has type
float ., then all arguments of arithmetic type corresponding toparameters are effectively cast to
double .
float - Otherwise, all arguments of arithmetic type corresponding to
parameters have type
double .
short float
Modify Absolute values [c.math.abs] as follows:
[ Note: The headers
and declare the functions described in this subclause. — end note ]
int abs ( int j );
long int abs ( long int j );
long long int abs ( long long int j );
float abs ( float j );
double abs ( double j );
long double abs ( long double j );
Effects: The abs functions have the semantics specified in the C standard library for the functions
,
abs ,
labs ,
llabs ,
fabsf , and
fabs .
fabsl Remarks: If
is called with an argument of type
abs () for which
X is
is_unsigned_v < X > true
and ifcannot be converted to
X by integral promotion, the program is ill-formed. [ Note: Arguments that can be promoted to
int are permitted for compatibility with C. — end note ]
int
short float abs ( short float j );
Effects: as if by
.
static_cast < short float > ( abs ( static_cast < float > ( j )))
Modify Three-dimensional hypotenuse [c.math.hypot3] as follows:
short float hypot ( short float x , short float y , short float z );
float hypot ( float x , float y , float z );
[...]
Modify Classification / comparison functions [c.math.fpclass] paragraph 1:
The classification / comparison functions behave the same as the C macros with the corresponding names defined in the C standard library. Each function is overloaded for the
threefour floating-point types.
Modify Class template
[locale.num.get], adding new overloads:
[...]
iter_type get ( iter_type in , iter_type end , ios_base & , ios_base :: iostate & err , unsigned long long & v ) const ; iter_type get ( iter_type in , iter_type end , ios_base & , ios_base :: iostate & err , short float & v ) const ; iter_type get ( iter_type in , iter_type end , ios_base & , ios_base :: iostate & err , float & v ) const ; [...]
virtual iter_type do_get ( iter_type , iter_type , ios_base & , ios_base :: iostate & err , unsigned long long & v ) const ; virtual iter_type do_get ( iter_type , iter_type , ios_base & , ios_base :: iostate & err , short float & v ) const ; virtual iter_type do_get ( iter_type , iter_type , ios_base & , ios_base :: iostate & err , float & v ) const ;
Modify
members [facet.num.get.members], mentioning the new overload:
[...]
iter_type get ( iter_type in , iter_type end , ios_base & str , ios_base :: iostate & err , unsigned long long & val ) const ; iter_type get ( iter_type in , iter_type end , ios_base & str , ios_base :: iostate & err , short float & val ) const ; iter_type get ( iter_type in , iter_type end , ios_base & str , ios_base :: iostate & err , float & val ) const ; [...]
Modify
virtual functions [facet.num.get.virtuals], mentioning the new overload:
iter_type do_get ( iter_type in , iter_type end , ios_base & str , ios_base :: iostate & err , unsigned long long & val ) const ; iter_type do_get ( iter_type in , iter_type end , ios_base & str , ios_base :: iostate & err , short float & val ) const ; iter_type do_get ( iter_type in , iter_type end , ios_base & str , ios_base :: iostate & err , float & val ) const ;
In
virtual functions [facet.num.get.virtuals] paragraph 3 stage 3, insert a new item before item 3:
[...]
For an unsigned integer value, the function
.
strtoull - For a
value, the function
short float .
strtof For a
value, the function
float .
strtof [...]
Modify Class template
[istream], adding a new overload:
[...]
basic_istream < charT , traits >& operator >> ( unsigned long long & n ); basic_istream < charT , traits >& operator >> ( short float & f ); basic_istream < charT , traits >& operator >> ( float & f ); [...]
Modify Arithmetic extractors [istream.formatted.arithmetic] to mention the new overload:
[...]
operator >> ( unsigned long long & val );
operator >> ( short float & val );
operator >> ( float & val ); [...]
Modify Class template
[ostream], adding a new overload:
[...]
basic_ostream < charT , traits >& operator << ( unsigned long long n ); basic_ostream < charT , traits >& operator << ( short float f ); basic_ostream < charT , traits >& operator << ( float f ); [...]
Modify Arithmetic inserters [ostream.inserters.arithmetic] as follows:
operator << ( bool val );
operator << ( short val );
operator << ( unsigned short val );
operator << ( int val );
operator << ( unsigned int val );
operator << ( long val );
operator << ( unsigned long val );
operator << ( long long val );
operator << ( unsigned long long val );
operator << ( short float val );
operator << ( float val );
operator << ( double val );
operator << ( long double val );
operator << ( const void * val );
Effects: The classes
and
num_get <> handle locale-dependent numeric formatting and parsing. These inserter functions use the imbued locale value to perform numeric formatting. When val is of type
num_put <> ,
bool ,
long ,
unsigned long ,
long long ,
unsigned long long ,
double , or
long double , the formatting conversion occurs as if it performed the following code fragment:
const void * bool failed = use_facet < num_put < charT , ostreambuf_iterator < charT , traits >> > ( getloc ()). put ( * this , * this , fill (), val ). failed (); When val is of type
the formatting conversion occurs as if it performed the following code fragment:
short ios_base :: fmtflags baseflags = ios_base :: flags () & ios_base :: basefield ; bool failed = use_facet < num_put < charT , ostreambuf_iterator < charT , traits >> > ( getloc ()). put ( * this , * this , fill (), baseflags == ios_base :: oct || baseflags == ios_base :: hex ? static_cast < long > ( static_cast < unsigned short > ( val )) : static_cast < long > ( val )). failed (); When val is of type
the formatting conversion occurs as if it performed the following code fragment:
int ios_base :: fmtflags baseflags = ios_base :: flags () & ios_base :: basefield ; bool failed = use_facet < num_put < charT , ostreambuf_iterator < charT , traits >> > ( getloc ()). put ( * this , * this , fill (), baseflags == ios_base :: oct || baseflags == ios_base :: hex ? static_cast < long > ( static_cast < unsigned int > ( val )) : static_cast < long > ( val )). failed (); When val is of type
or
unsigned short the formatting conversion occurs as if it performed the following code fragment:
unsigned int bool failed = use_facet < num_put < charT , ostreambuf_iterator < charT , traits >> > ( getloc ()). put ( * this , * this , fill (), static_cast < unsigned long > ( val )). failed (); When val is of type
or
short float the formatting conversion occurs as if it performed the following code fragment:
float bool failed = use_facet < num_put < charT , ostreambuf_iterator < charT , traits >> > ( getloc ()). put ( * this , * this , fill (), static_cast < double > ( val )). failed ();
Modify Specializations for floating-point types [atomics.ref.float] paragraph 1:
There are specializations of the
class template for the floating-point types
atomic_ref ,
short float ,
float , and
double . For each such type floating-point, the specialization
long double provides additional atomic operations appropriate to floating-point types.
atomic_ref < floating - point >
Modify Specializations for floating-point types [atomics.types.float] paragraph 1:
There are specializations of the
class template for the floating-point types
atomic ,
short float ,
float , and
double . For each such type floating-point, the specialization
long double provides additional atomic operations appropriate to floating-point types.
atomic < floating - point >
4.2. Wording for library aliases
4.2.1. Library wording
Modify Headers [headers], table 18, by adding the new header to the list of C++ headers:
[...]
< contract >
< cstdfloat >
< deque > [...]
Modify Freestanding implementation [compliance], table 21:
Subclause Header(s) [...] [...] [...] 16.4 Integer types
< cstdint > 16.? Floating-point types
< cstdfloat > 16.5 Start and termination
< cstdlib > [...] [...] [...]
Modify General [support.general] table 34:
Subclause Header(s) [...] [...] [...] 16.4 Integer types
< cstdint > 16.? Floating-point types
< cstdfloat > 16.5 Start and termination
< cstdlib > [...] [...] [...]
Insert a new subclause into Language support library [language.support] after Integer types [cstdint]:
16.? Floating-point types [cstdfloat] 16.?.1 Headersynopsis [cstdfloat.syn]
< cstdfloat > 16.?.1.1 Exact-width floating-point types
namespace std { using float16_t = floating - point type ; // optional using float32_t = floating - point type ; // optional using float64_t = floating - point type ; // optional }
- The typedef name
designates a floating-point type with width X, no padding bits, and a representation conforming to that defined as
std :: floatX_t format in ISO/IEC/IEEE 60559.
binaryX - These types are optional. However, if an implementation provides floating-point types with widths of 16, 32, or 64 bits, no padding bits, and that have a representation conforming to that defined above, it shall define the corresponding typedef names.