ISO/IEC JTC1 SC22 WG21 N4448  20150412
Lawrence Crowl, Lawrence@Crowl.org
C++ currently provides relatively poor facilities for controlling rounding. It has even fewer facilities for controlling overflow. The lack of such facilities often leads programmers to ignore the issue, making software less robust than it could be (and should be).
This paper presents the issues and provides some candidate enumerations and operations. The intent of the paper is to gather feedback on support for and direction of future work.
Rounding is necessary whenever the resolution of a variable is coarser than the resolution of a value to be placed in that variable.
The numeric_limits
field round_style
provides information on the style of rounding employed by a type.
namespace std {
enum float_round_style {
round_indeterminate = 1, //
indeterminableround_toward_zero = 0, //
toward zeroround_to_nearest = 1, //
to the nearest representable valueround_toward_infinity = 2, //
toward [positive] infinityround_toward_neg_infinity = 3 //
toward negative infinity};
}
This specification is incomplete in that it fails to specify what happens when the value is equally far from the two nearest representable values.
The standard also says
"Specializations for integer types
shall return round_toward_zero
."
This requirement is somewhat misleading as
a rightshift operation on a two's complement representation
does not round toward zero.
Headers <cfenv>
and <fenv.h>
provide functions for setting and getting
the floatingpoint rounding mode,
fesetround
and fegetround
, respectively.
The mode is specified via a macro constant:
Constant  Explanation 

FE_DOWNWARD 
rounding towards negative infinity 
FE_TONEAREST 
rounding towards nearest integer 
FE_TOWARDZERO 
rounding towards zero 
FE_UPWARD 
rounding towards positive infinity 
Again, the specification
is incomplete with respect to FE_TONEAREST
The base requirements on a round function are:
Given a value x and two adjacent representable values y < z such that y ≤ x ≤ z then
if x = y then round(x) = y,
if x = z then round(x) = z,
and otherwise round(x) = y or round(x) = z.
Given an additional value w such that y ≤ w ≤ x ≤ z then
y ≤ round(w) ≤ round(x) ≤ z
The number of rounding modes is perhaps unlimited. However, we can explore the space of reasonably efficient rounding modes with two notions, its direction and its domain.
There are six preciselydefined rounding directions and at least three additional practical directions. They are:
towards negative infinity  towards positive infinity 
towards zero  away from zero 
towards even  towards odd 
fastest execution time  smallest generated code 
whatever, I'm not picky 
Of these directions, only towards even and towards odd are unbiased.
Rounding towards odd has two desirable properties. First, the direction will not induce a carry out of the units position. This property avoids overflow and increased representation size. Second, because most operations tend to preserve zeros in the lowest bit, the towardseven direction carries less information than towardsodd. This effect increases as the number of bits decreases. However, rounding towards even produces numbers that are "nicer" than those produced by rounding towards odd. For example, you are more likely to get 10 than 9.9999999 with rounding towards even.
There are at least two direction domains:
All values between two representable values move in the given direction.
Only values midway between two representable values move in the given direction. Other values move to the nearest representable value. That is, the direction is a tie breaker.
Several of the precise rounding modes are in current use.
direction  domain  

all  tie  
towards negative infinity  interval arithmetic lower bound two's complement right shift 

towards positive infinity  interval arithmetic upper bound  
towards zero  C/C++ integer division signedmagnitude right shift 

away from zero  schoolbook rounding the <cmath> round functions  
towards nearest even  IEEE floatingpoint default  
towards nearest odd  some accounting rules 
We represent the mode in C++ as an enumeration:
enum class rounding {
all_to_neg_inf, all_to_pos_inf,
all_to_zero, all_away_zero,
all_to_even, all_to_odd,
all_fastest, all_smallest,
all_unspecified,
tie_to_neg_inf, tie_to_pos_inf,
tie_to_zero, tie_away_zero,
tie_to_even, tie_to_odd,
tie_fastest, tie_smallest,
tie_unspecified
};
Some of these modes may not be needed.
Within the definition of the following functions, we use a defining function, which we do not expect will be directly represented in C++. It is T round(mode,U) where U either
has a finer resolution than T or
is evaluated as a real number expression.
We already have rounding functions for converting floatingpoint numbers to integers. However, the facility extends to different sizes of floatingpoint and between other numeric types.
template<rounding mode, typename T, typename U>
T convert(U value)
The result is round(mode, U)
.
A division function has obvious utility.
template<rounding mode, typename T>
T divide(T dividend, T divisor)
The result is round(mode,dividend
/divisor)
.
Remember that division is evaluates as a real number.
Obviously, the implementation will use a different strategy,
but it must yield the same result.
Division by a power of two has substantial implementation efficiencies, and is used heavily in fixedpoint arithmetic as a scaling mechanism. We represent the conjunction of these approaches with a rounding right shift.
template<rounding mode, typename T>
T rshift(T value, int bits)
The result is round(mode,dividend
/2^{bits})
.
We can add other functions as needed.
Overflow is possible whenever the range of an expression exceeds the range of a variable.
Signed integer overflow is undefined behavior. Programmers attempting to detect and handle overflow often get it wrong, in that they end up using overflow to detect overflow. Suffice it to say that present solutions are inadequate.
Unsigned integer overflow is defined to be mod 2^{bitsintype}. While this definition is exactly right when coding in modular arithmetic, it is counterproductive when one is using unsigned arithmetic to state that the value is nonnegative. In the latter environment, undefined behavior on overflow is better, as it enables tools to detect problems.
Floatingpoint overflow can be detected and altered
via
fegetexceptflag
,
fesetexceptflag
,
feclearexcept
, and
feraiseexcept
with the value FE_OVERFLOW
.
However, such checking requires additional outofband effort.
That is, any checking takes place
in code separate from the operations themselves.
The base requirements on a overflow function are:
Given a value x and a representable range y ≤ z such that y ≤ x ≤ z then an overflow does not occur and
overflow(x) = x.
Otherwise, an overflow has occured and the function may, for all overflow values, choose either:
Consider the expression an error, handling it or not as appropriate.
Return a normal value w = overflow(x) such that y ≤ w ≤ z.
Return a special value indicating overflow, e.g. IEEE infinities. This choice implies defining the result of operations given this special value as an argument.
Several overflow modes are possible. We categorize them based on the choices in the base requirements. Other modes may be possible or desirable as well.
Some error modes are as follows.
 impossible
Mathematically, overflow cannot occur. This mode is useful when an overflow specification is necessary, but compilerbased range propogation is insufficient to eliminating a check. The mode is an assertion on the part of the programmer. It invites reviewers to examine the accompanying proof. Ignoring overflow and letting the program stray into undefined behavior is a suitable implementation.
 undefined
The programmer states that overflow is sufficiently rare so that overflow is not a concern. Aborting on overflow is a suitable implementation. So is ignoring the issue and letting the program stray into undefined behavior.
 abort
Abort the program on overflow. Detection is required.
 exception
Throw an exception on overflow. Detection is required.
A special substitution mode is as follows. Detection is required.
 special
Return one of possibly several special values indicating overflow.
Some normal substitution modes are as follows. Detection is required.
 saturate
Return the nearest value within the valid range.
 modulo with shifted scale
For unsigned arguments and range from 0 to z, the result is simply x mod (z+1). Shifting the range such that 0 < y ≤ z requires a more complicated expression, y + ((x–y) mod (z–y+1)). We can also use this expression when y < 0. That is, it is a general purpose definition. However, it may not yield results consistent with division.
 modulo with sign from divided
With y = –z, the expression x–(z+1)×trunc(x/(z+1)) produces values consistent with truncated division, i.e. normal C/C++ division. For unbalanced ranges, e.g. the range of two'scomplement representation, the situation is more complicated. A significant property of this approach is that the sign of the remainder matches the sign of the dividend, enabling a strategy of using to different methods depending on the sign of the value. On can either use the smallest bound as the divisor, or use the bound corresponding to the sign of the dividend. The former fails to cycle through all elements of the range. The later produces different periods depending on sign. The situtation is yet more complicated when the range does not span zero.
 modulo with sign from divisor
With y = –z, the expression x–(z+1)×floor(x/(z+1)) produces values consistent with floored division. Given that z is positive, all results are nonnegative, using only half the range. Many of the same issues arise here as well.
 modulo with positive sign
With y ≤ 0 < z, the expression x–(z+1)×sgn(z+1)×floor(x/abs(z+1)) produces values consistent with Euclidean division, All results are nonnegative, using only half the range. Many of the same issues arise here as well.
Various overflow modes are in current use.
mode  uses 

impossible  wellanalyzed programs 
undefined  C/C++ signed integers C (TR 18037) unsaturated fixedpoint types most programs 
abort  
exception  Ada integers C# integers in checked context 
special  IEEE floatingpoint 
saturate  C (TR 18037) unsaturated fixedpoint types digital signal processing hardware 
modulo with shifted scale  two'scomplement wraparound C/C++ unsigned integers C# integers in unchecked context Java signed integers 
modulo with sign from dividend  
modulo with sign from divisor  
modulo with positive sign 
We represent the mode in C++ as an enumeration:
enum class overflow {
impossible, undefined, abort, exception,
special,
saturate, modulo_shifted, modulo_dividend, modulo_divisor, modulo_positive
};
Within the definition of the following functions, we use a defining function, which we do not expect will be directly represented in C++. It is T overflow(mode,T lower,T upper,U value) where U either
has a wider range than T or
is evaluated as a real number expression.
Many C++ conversions already reduce the range of a value, but they do not provide programmer control of that reduction. We can give programmers control.
template<overflow mode, typename T, typename U>
T convert(U value)
The result is
overflow(mode,
numeric_limits<T>::min,
numeric_limits<T>::max,
value)
.
Being able to specify overflow between variables of the same type is also helpful.
template<overflow mode, typename T>
T limit(T lower, T upper, T value)
The result is
overflow(mode,
lower,
upper,
value)
.
Common arguments can be elided with convenience functions.
template<overflow mode, typename T>
T limit_positive(T upper, T value)
The result is
overflow(mode,
0,
upper,
value)
.
template<overflow mode, typename T>
T limit_signed(T upper, T value)
The result is
overflow(mode,
upper,
upper,
value)
.
Two'scomplement numbers are a slight variant on the above.
template<overflow mode, typename T>
T limit_twoscomp(T upper, T value)
The result is
overflow(mode,
upper1,
upper,
value)
.
For binary representations, we can also specify bits instead. While this specification may seem redundant, it enables faster implementations.
template<overflow mode, typename T>
T limit_positive_bits(T upper, T value)
The result is
overflow(mode,
0,
2^{upper}1,
value)
.
template<overflow mode, typename T>
T limit_signed_bits(T upper, T value)
The result is
overflow(mode,
(
2^{upper}1),
2^{upper}1,
value)
.
template<overflow mode, typename T>
T limit_twoscomp_bits(T upper, T value)
The result is
overflow(mode,

2^{upper},
2^{upper}1,
value)
.
Embedding overflow detection within regular operations can lead to enhanced performance. In particular, left shift is a important candidate operation within fixedpoint arithmetic.
template<overflow mode, typename T>
T lshift(T value, int count)
The result is
overflow(mode,
numeric_limits<T>::min,
numeric_limits<T>::max,
value
×2^{count})
.
As before, finer specification of the limits is reasonable.
We can add other functions as needed.
Some operations may reasonably both require rounding and require overflow detection.
First and foremost, conversion from floatingpoint to integer may require handling a floatingpoint value that has both a finer resolution and a larger range than the integer can handle. The problem generalizes to arbitrary numeric types.
template<overflow omode, rounding rmode, typename T, typename U>
T convert(U value)
The result is
overflow(omode,
numeric_limits<T>::min,
numeric_limits<T>::max,
round(rmode,value))
.
Consider shifting as multiplication by a power of two. It has an analogy in a bidirectional shift, where a positive power is a left shift and a negative power is a right shift.
template<overflow omode, rounding rmode, typename T>
T bshift(T value, int count)
The result is
count
< 0
? round(rmode,value
×2^{count})
: overflow(omode,
numeric_limits<T>::min,
numeric_limits<T>::max,
value
×2^{count})
.
The above functions pass the modes as template arguments. This approach seems to be the primary use case. It also permits incremental development of both modes and the types they apply to. Furthermore, it also permits not specifying combinations of mode and type that make no sense. In the event that dynamic dispatch is needed, a dispatch function is not a significant task.
The problem with using template parameters is that the functions need to be partially specialized. They cannot be overloaded because the mode does not appear in the function signature. Unfortunately, there is no direct support for function template partial specialization. Working around this problem requires defining an artificial class to attach the partial specialization. This will increase the complexity of specification.