This proposal adds support for low level bitwise and logical operations to C++.
This proposal is a pure library extension.
It does not require any changes in the core language and does not depend on any other library extensions.
The proposal is composed entirely of free functions. The proposed functions are added to the <cmath>
and <memory>
headers. No new headers are introduced.
While this proposal can be implemented entirely in standard C++14, optimal implementations will require additional support from the compiler to detect and replace function calls with native instructions when available. See [BitOpsRef] for a reference implementation written in C++14.
The C and C++ languages provide an abstraction over the machine. The machine provides the common arithmetic and logical operations, which are accessed using the built in operators inherited from C. These operations are the primitives which are used to implement higher level abstractions.
We construct algorithms by combining these basic operations. Sometimes significant performance benefits can be gained by directly manipulating the binary quantities contained within the registers which make up this numerical abstraction. Many online and print references including [Anderson01], [Dietz01], [Neumann01], [Warren01], and [HACKMEM] are devoted to discovering these algorithms and implementing them efficiently.
Hardware vendors have understood the importance of high performance bitwise manipulation routines. Many of them have provided additional hardware which can perform these bitwise operations directly with a single instruction. These instructions are often much more efficient than computing the algorithm manually in C or assembly.
Other bitwise manipulation algorithms can be implemented using clever but non-intuitive combinations of arithmetic and logical operations. Most importantly, for some bitwise algorithms, the most efficient implementation varies between hardware platforms. These differences create an unreasonably large maintenance burden on the programmer who wishes to write efficient and portable code.
As a motivating example, consider the various implementations of
the count trailing zeroes algorithm presented in [Kostjuchenko01].
In order to implement an SSE2 optimized strlen()
function,
the author had to implement, test, and profile many different versions of
count trailing zeroes. None of them take advantage of native instructions.
One who wishes to exploit
the bsf
or tzcnt
instructions on Intel must rely on non-standard
compiler intrinsics or inline assembly. One must also provide
backup implementations for other platforms which do not have such instructions.
Adding support for native instructions requires a nest of #ifdef
s and
deep knowledge of different processor architectures. This is a heavy cost
in programmer time.
Bitwise algorithms are general purpose tools which can be used in a wide variety of domains and are the key to unlocking high performance in many important algorithms. A bitwise operations library has been badly needed in the C and C++ standard libraries for many years.
We present a bitwise operations library which exposes these native instructions wherever possible and provides backup implementations in C++ if they do not exist. Because this library would be part of the standard, these commonly used routines could be implemented once and profiled on each platform. Finally, this library offers an interface which takes full advantage of the latest processor features including ARMv8 and Intel BMI2.
There are seemingly endless ways one can manipulate binary quantities. How does one go about choosing which ones to include in a library and which ones to exclude? How does one choose proper names for each function? Which algorithms can be trivially converted into single instructions by the optimizer and which actually require the programmer to declare their use through a function call? We will address these questions with the following design goals.
In 1970, the Digital Equipment Corporation announced the PDP11.
This 16 bit machine has 3 instructions of interest, ROR
(rotate right),
ROL
(rotate left), and SWAB
(swap bytes).
These operations along with their later 32 and 64 bit variants are provided
by many more modern machines as will be shown. As of 2013,
the programmer still does not have direct access to these instructions in modern C and C++.
Therefore, the first and most important goal of this proposal is to provide the programmer with better access to the machine via a new set of primitives which go beyond simple arithmetic and logical operations. We will present new functions for the standard library which can be implemented using only few instructions if supported by the machine, using backup implementations if no such support is provided by the hardware.
In designing this proposal, we wish not just to limit ourselves to operations which may have native machine instruction implementations on at least one platform. We would like to provide a standard library of primitives which are commonly found to be reimplemented time and time again in different code bases. The standard library already provides a rich set of generic containers and algorithms. What is missing is a set of bitwise manipulation primitives.
Of particular emphasis are algorithms whose most efficient implementations depend on the implementations of other bitwise operations. A motivating example is checking whether a number is a power of 2.
Consider the following implementations:
bool ispow2(unsigned x) { return popcount(x) == 1; }
bool ispow2(unsigned x) { return x != 0 && (x & (x -1)) == 0; }
In the above example, popcount()
is the population count or number of 1 bits in x
.
On a machine with a popcount instruction, the first implementation uses less instructions
and no branches. Without a popcount instruction, the second version is the better choice
as computing popcount requires much more than a few logical operations and comparisons
[Dietz01]. In order to implement ispow2()
, the programmer is faced with
the same set of dilemnas as with the count trailing zeroes example from the Motivation
section.
The following terminology is used in the remainder of this document to describe the technical aspects of this proposal.
true
if its value is 1, otherwise return false
.CHAR_BIT
, usually 8~T(0)
: This statement represents a quantity of type T
where all of the bits are 1. We avoid the more commonly used T(-1)
as it assumes 2's compliment signed integers and is a little less intuitive to the unitiated.We will now describe the additions to <cmath>
and <memory>
. This is a procedural library implemented
entirely using constexpr
templated free functions.
In addition, each function has been qualified with noexcept
. These
operations are most often used in highly optimized numerical code where the overhead of exception
handling would be inappropriate. For functions which have pre-conditions on their inputs, we
have opted for undefined return values if these pre-conditions are ever violated.
The functions are classified into different groups to aid analysis and discussion
and each group will be presented one at a time.
We have chosen to support all signed and unsigned integral types in this proposal. It
is often suggested that signed integers represent "numbers" and unsigned integers
represent "bit fields" and that they should never be used together. While we
generally agree with this philosophy, many of these algorithms have real use cases
for signed and unsigned integral values.
The primary danger of using both
signed and unsigned integers comes from the pitfalls of comparing signed and unsigned values. None of the functions in this
proposal require or encourage comparing or combining of signed and unsigned types.
The template arguments for each proposed function are named integral
to indicate generic support
for all builtin integral types, signed and unsigned. Functions which take more than one
integral argument of different types will use a single letter suffix, for example
integrall
and integralr
for left and right hand arguments.
With regards to signed integers, this proposal does not require signed integers be implemented using 2's compliment. However, the design of this proposal considers the practical reality that almost all modern hardware does in fact use 2's compliment. All example code, including the reference implementation [BitOpsRef] assume 2's compliment signed integers with undefined behavior on overflow and underflow. Adding support for other signed representations is an exercise left for the reader.
Each section will describe the full technical specifications of the functions in that group, noting their return values and undefined behavior if any. We will also discuss the background and justification for each of the functions and list some applications where necessary. For the more complicated algorithms, examples will be provided to help illustrate how they work.
The following sections describe the additions to the <cmath>
header.
Bit shifting is provided in C++ with operator<<
and operator>>
for integral types. It is
a very simplistic abstraction with many deficiencies and some subtle caveats.
First as noted earlier, there is no primitive for rotational shifts even though these shifts can be found in the instruction
set of almost every machine. Second,
operator>>
for signed types has implementation defined behavior with regards to filling in the high order bits, making
it nearly useless when writing portable code.
Writing a portable arithmetic right shift cumbersome at best and inefficient at worst. Finally, performing a logical right shift on a signed
quantity is also cumbersome because it requires casts which obscure the meaning of the code.
//SHift Logical Left
template <class integral>
constexpr integral shll(integral x, int s) noexcept;
x << s
s < 0 || s > sizeof(x) * CHAR_BIT
//SHift Logical Right
template <class integral>
constexpr integral shlr(integral x, int s) noexcept;
x
with all of it's bits shifted right by s
positions. The s
high order bits of the result are reset.s < 0 || s > sizeof(x) * CHAR_BIT
//SHift Arithmetic Left
template <class integral>
constexpr integral shal(integral x, int s) noexcept;
x << s
s < 0 || s > sizeof(x) * CHAR_BIT
shll()
and is only provided for symmetry.//SHift Arithmetic Right
template <class integral>
constexpr integral shar(integral x, int s) noexcept;
x
with all of it's bits shifted right by s
positions. The s
high order bits of the result are set to the value of most significant bit of x
.s < 0 || s > sizeof(x) * CHAR_BIT
//ROTate Left
template <class integral>
constexpr integral rotl(integral x, int s) noexcept;
x
with all of it's bits shifted left by s
positions.
The s
low order bits are set to the s
high order bits of x
.s < 0 || s > sizeof(x) * CHAR_BIT
//ROTate Right
template <class integral>
constexpr integral rotr(integral x, int s) noexcept;
x
with all of it's bits shifted right by s
positions.
The s
high order bits are set to the s
low order bits of x
.s < 0 || s > sizeof(x) * CHAR_BIT
Bit counting is used to construct efficient implementations of other higher level algorithms. Many of these operations have native support on a wide variety of modern and antiquated hardware. Some example applications will be provided below.
//CouNT Trailing 0's
template <class integral>
constexpr int cntt0(integral x) noexcept;
x
, or sizeof(x) * CHAR_BIT
if x == 0
.//CouNT Leading 0's
template <class integral>
constexpr int cntl0(integral x) noexcept;
x
, or sizeof(x) * CHAR_BIT
if x == 0
.//CouNT Trailing 1's
template <class integral>
constexpr int cntt1(integral x) noexcept;
x
, or sizeof(x) * CHAR_BIT
if x == ~integral(0)
.//CouNT Leading 1's
template <class integral>
constexpr int cntl1(integral x) noexcept;
x
, or sizeof(x) * CHAR_BIT
if x == ~integral(0)
.//POPulation COUNT
template <class integral>
constexpr int popcount(integral x) noexcept;
x
.//PARITY
template <class integral>
constexpr int parity(integral x) noexcept;
x
is odd, otherwise 0.One application of cntt0
is in computing the greatest common divisor of 2 numbers. Credit goes
to Howard Hinnant for bringing this to our attention.
template <typename unsigned-integral>
T gcd(T x, T y)
{
if (x == 0)
return y;
if (y == 0)
return x;
int cf2 = std::cntt0(x | y);
x >>= std::cntt0(x);
while (true)
{
y >>= std::cntt0(y);
if (x == y)
break;
if (x > y)
std::swap(x, y);
if (x == 1)
break;
y -= x;
}
return x << cf2;
}
As mentioned earlier, we can use popcount()
to detect whether or not an integer (signed or unsigned) is a power of 2.
template <typename integral>
bool ispow2(integral x) {
return x > 0 && popcount(x) == 1;
}
The following functions perform simple manipulations on the rightmost bits of the given quantity. All of these operations can be trivially implemented using a few arithmetic and logical operations. Therefore, these functions are only provided as usability wrappers in order to allow the programmer to avoid having to spend time looking them up, reimplementing, and/or unit testing them. Because their simple implementations, we have included the implementations in the function listing. Credit goes to [Warren01] and [ChessProg] for providing these implementations and insight into their usage.
Most of the operations in the section are implemented in hardware on Intel and AMD machines which have Intel BMI and/or AMD TBM extensions (see Survey of Hardware Support). All of these functions were tested on gcc 4.8 using the provided C++ implementations. We found that on BMI and TBM enabled compilation, the optimizer was successfully able to compile the C++ expression a single BMI or TBM instruction. Therefore an implementation of this section can simply use the provided trivial implementations and rely on the optimizer for hardware support if available.
//ReSeT Least Significant 1 Bit
template <class integral>
constexpr integral rstls1b(integral x) noexcept;
x
with it's least significant 1 bit reset, or 0 if x == 0
.x & (x - 1)
//SET Least Significant 0 Bit
template <class integral>
constexpr integral setls0b(integral x) noexcept;
x
with it's least significant 0 bit set, or x
if x == ~integral(0)
.x | (x + 1)
//ISOlate Least Significant 1 Bit
template <class integral>
constexpr integral isols1b(integral x) noexcept;
x
is set and all of the other bits reset. Returns 0 if x == 0
.(~x) & (-x)
//ISOlate Least Significant 0 Bit
template <class integral>
constexpr integral isols0b(integral x) noexcept;
x
is set and all of the other bits reset. Returns x
if x == ~integral(0)
.(~x) & (x + 1)
//ReSeT Trailing 1's
template <class integral>
constexpr integral rstt1(integral x) noexcept;
x
, or 0 if x == ~integral(0)
.x & (x + 1)
//SET Trailing 0's
template <class integral>
constexpr integral sett0(integral x) noexcept;
x
, or x
if x == 0
.x | (x - 1)
//MaSK Trailing 0's
template <class integral>
constexpr integral maskt0(integral x) noexcept;
x
are set, the remaining bits reset.(~x) & (x - 1)
//MaSK Trailing 1's
template <class integral>
constexpr integral maskt1(integral x) noexcept;
x
are set, the remaining bits reset.~((~x) | (x + 1))
//MaSK Trailing 0's and Least Significant 1 Bit
template <class integral>
constexpr integral maskt0ls1b(integral x) noexcept;
x
are set, the remaining bits reset.x ^ (x - 1)
//MaSK Trailing 1's and Least Significant 0 Bit
template <class integral>
constexpr integral maskt1ls0b(integral x) noexcept;
x
are set, the remaining bits are reset.x ^ (x + 1)
Most programmers have at some point in their career have needed to index and manipulate a single bit within a given integral quantity. These functions are trivial to implement and are provided only for usability.
//SET BIT
template <class integral>
constexpr integral setbit(integral x, int b) noexcept;
b
of x
.b < 0 || b >= sizeof(x) * CHAR_BIT
.x | (integral(1) << b)
//ReSeT BIT
template <class integral>
constexpr integral rstbit(integral x, int b) noexcept;
b
of x
.b < 0 || b >= sizeof(x) * CHAR_BIT
.x & ~(integral(1) << b)
//FLIP BIT
template <class integral>
constexpr integral flipbit(integral x, int b) noexcept;
b
of x
.b < 0 || b >= sizeof(x) * CHAR_BIT
.x ^ (integral(1) << b)
//TEST BIT
template <class integral>
constexpr bool testbit(integral x, int b) noexcept;
b
of x
is set, otherwise 0.b < 0 || b >= sizeof(x) * CHAR_BIT
.bool(x & (integral(1) << b))
The following operations manipulate ranges of bits above or below a given index. One of them is implemented in hardware (see Survey of Hardware Support), the rest are provided for usability and completeness.
//ReSeT BITS Greater than or Equal to
template <class integral>
constexpr integral rstbitsge(integral x, int b) noexcept;
x
in positions greater than or equal to b
.b < 0 || b >= sizeof(x) * CHAR_BIT
.//ReSeT BITS Less than or Equal to
template <class integral>
constexpr integral rstbitsle(integral x, int b) noexcept;
x
in positions less than or equal to b
.b < 0 || b >= sizeof(x) * CHAR_BIT
.//SET BITS Greater than or Equal to
template <class integral>
constexpr integral setbitsge(integral x, int b) noexcept;
x
in positions greater than or equal to b
.b < 0 || b >= sizeof(x) * CHAR_BIT
.//SET BITS Less than or Equal to
template <class integral>
constexpr integral setbitsle(integral x, int b) noexcept;
x
in positions less than or equal to b
.b < 0 || b >= sizeof(x) * CHAR_BIT
.//FLIP BITS Greater than or Equal to
template <class integral>
constexpr integral flipbitsge(integral x, int b) noexcept;
x
in positions greater than or equal to b
.b < 0 || b >= sizeof(x) * CHAR_BIT
.//FLIP BITS Less than or Equal to
template <class integral>
constexpr integral flipbitsle(integral x, int b) noexcept;
x
in positions less than or equal to b
.b < 0 || b >= sizeof(x) * CHAR_BIT
.These functions provide a generic interface for permuting the bits and bytes in a word. Each function takes the following form:
template <typename integral>
constexpr integral permute_bits(integral x, /*arg2, arg3, ...*/ subword_bits=1, num_swar_words=1) noexcept;
In the above example, x
is the value to permute, with additional arguments following it if needed.
Each function operates on subwords of size subword_bits
measured in bits. In this case,
permute_bits(x)
will perform the permutation on all of the bits of x
, permute_bits(x, 2)
will
permute each pair of bits in x
, permute_bits(x, 4)
each nibble in x
, and finally permute_bits(x, CHAR_BIT)
will permute the bytes of x
.
The num_swar_words
parameter (Number of Simd Within A Register Words) enables parallel operation
on multiple words within x
. The size in bits of each individual word to permute will be sizeof(integral) * CHAR_BIT / num_swar_words
.
For example, permute_bits<uint32_t>(x, 1, 2)
will independently permute the 16 high order bits of x
and the 16 low order bits of x
. Another example, permute_bits<uint32_t>(x, 2, 4)
will permute the
pairs of bits in each byte (assuming CHAR_BIT == 8
) of x
.
Finally, for each bitwise permutation routine, we also provide a corresponding bytewise permutation routine. These are provided for usability. They operate exactly like their bitwise cousins, except that their subword size is computed in bytes instead of bits.
template <typename integral>
constexpr integral permute_bytes(integral x, /*arg2, arg3, ...*/ subword_bytes=1, num_swar_words=1) noexcept;
The bytewise routines are trivially implemented as simple wrappers over the bitwise routines where
we perform the simple conversion: subword_bits = subword_bytes * CHAR_BITS
.
num_swar_words < 0 || num_swar_words > sizeof(integral) * CHAR_BIT
(sizeof(integral) * CHAR_BIT) % num_swar_words != 0
subword_bits < 0 || subword_bits > ((sizeof(integral) * CHAR_BIT) / num_swar_words)
((sizeof(integral) * CHAR_BIT) / num_swar_words) % subword_bits != 0
template <class integral>
constexpr integral reverse_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
x
into num_swar_words
"words" of equal size and then independently reverse the subwords of size subword_bits
bits in each word.template <class integral>
constexpr integral reverse_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
reverse_bits(x, subword_bytes * CHAR_BIT, num_swar_words);
template <class integral>
constexpr integral outer_perfect_shuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
x
into num_swar_words
"words" of equal size and then independently outer perfect shuffle the subwords of size subword_bits
bits in each word.template <class integral>
constexpr integral outer_perfect_shuffle_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
outer_perfect_shuffle_bits(x, subword_bytes * CHAR_BIT, num_swar_words);
template <class integral>
constexpr integral outer_perfect_unshuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
x
into num_swar_words
"words" of equal size and then independently outer perfect unshuffle the subwords of size subword_bits
bits in each word.template <class integral>
constexpr integral outer_perfect_unshuffle_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
outer_perfect_unshuffle_bits(x, subword_bytes * CHAR_BIT, num_swar_words);
template <class integral>
constexpr integral inner_perfect_shuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
x
into num_swar_words
"words" of equal size and then independently inner perfect shuffle the subwords of size subword_bits
bits in each word.template <class integral>
constexpr integral inner_perfect_shuffle_bytes(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
inner_perfect_shuffle_bits(x, subword_bytes * CHAR_BIT, num_swar_words);
template <class integral>
constexpr integral inner_perfect_unshuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
x
into num_swar_words
"words" of equal size and then independently inner perfect unshuffle the subwords of size subword_bits
bits in each word.template <class integral>
constexpr integral inner_perfect_unshuffle_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
inner_perfect_unshuffle_bits(x, subword_bytes * CHAR_BIT, num_swar_words);
template <class integral>
constexpr integral deposit_bits_right(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
x
into num_swar_words
"words" of equal size and then independently deposit the subwords of size subword_bits
bits in each word identified by the bits of mask
to the low order subwords of the result. The remaining subwords are set to 0.template <class integral>
constexpr integral deposit_bytes_right(integral x, integral mask, int subword_right=1, int num_swar_words=1) noexcept;
deposit_bits_right(x, mask, subword_bytes * CHAR_BIT, num_swar_words);
template <class integral>
constexpr integral deposit_bits_left(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
x
into num_swar_words
"words" of equal size and then independently deposit the subwords of size subword_bits
bits in each word identified by the bits of mask
to the high order subwords of the result. The remaining subwords are set to 0.template <class integral>
constexpr integral deposit_bytes_left(integral x, integral mask, int subword_bytes=1, int num_swar_words=1) noexcept;
deposit_bits_left(x, mask, subword_bytes * CHAR_BIT, num_swar_words);
template <class integral>
constexpr integral extract_bits_right(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
x
into num_swar_words
"words" of equal size and then independently extract the low order subwords of size subword_bits
bits in each word to the subwords of the result identified by the bits of mask
. The remaining subwords are set to 0.template <class integral>
constexpr integral extract_bytes_right(integral x, integral mask, int subword_bytes=1, int num_swar_words=1) noexcept;
extract_bits_right(x, mask, subword_bytes * CHAR_BIT, num_swar_words);
template <class integral>
constexpr integral extract_bits_left(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
x
into num_swar_words
"words" of equal size and then independently extract the high order subwords of size subword_bits
bits in each word to the subwords of the result identified by the bits of mask
. The remaining subwords are set to 0.template <class integral>
constexpr integral extract_bytes_left(integral x, integral mask, int subword_bytes=1, int num_swar_words=1) noexcept;
extract_bits_left(x, mask, subword_bytes * CHAR_BIT, num_swar_words);
The following table shows how some example 8 bit binary values would be permuted by each function. We use the C++14 binary literal syntax here. The bits of a given value are represented by letters to show how a generic value would be permuted. For a more detailed treatment of these operations, refer to [Neumann01]] and Chapter 7 of [Warren01].
reverse_bits(ABCDEFGHb)
-> GHFEDCBA
reverse_bits(ABCDEFGHb, 1, 2)
-> DCBAHGFEb
reverse_bits(ABCDEFGHb, 1, 4)
-> BADCFEHGb
reverse_bits(ABCDEFGHb, 2)
-> GHEFCDABb
reverse_bits(ABCDEFGHb, 2, 2)
-> CDABGHEFb
reverse_bits(ABCDEFGHb, 4)
-> EFGHABCDb
outer_perfect_shuffle(ABCDEFGHb)
-> EAFBGCHDb
inner_perfect_shuffle(ABCDEFGHb)
-> AEBFCGEHb
outer_perfect_unshuffle(ABCDEFGHb)
-> BDFHACEGb
inner_perfect_unshuffle(ABCDEFGHb)
-> ACEGBDFHb
outer_perfect_unshuffle(outer_perfect_shuffle(x))
-> x
inner_perfect_unshuffle(inner_perfect_shuffle(x))
-> x
deposit_bits_right(ABCDEFGHb, 01110110b)
-> 0000CDFGb
deposit_bits_right(ABCDEFGHb, 10001000b)
-> 000000AEb
deposit_bits_left(ABCDEFGHb, 00010110b)
-> DEF00000b
deposit_bits_left(ABCDEFGHb, 10111000b)
-> ACDE0000b
extract_bits_right(ABCDEFGHb, 11000110b)
-> EF000GH0b
extract_bits_right(ABCDEFGHb, 10101010b)
-> E0F0G0H0b
extract_bits_left(ABCDEFGHb, 00110110b)
-> 00AB0CD0b
extract_bits_left(ABCDEFGHb, 11111001b)
-> ABCDE00Fb
The following functions detect and compute powers of 2.
//IS POWer of 2
template <class integral>
constexpr bool ispow2(integral x) noexcept;
true
if x
is a positive power of 2, otherwise false
.//CEILing Power of 2
template <class integral>
constexpr integral ceilp2(integral x) noexcept;
n
where ispow2(n) && n >= x
.n
is too large to be represented by type integral
.//FLOOR Power of 2
template <class integral>
constexpr integral floorp2(integral x) noexcept;
n
where ispow2(n) && N <= x
.x <= 0
.Saturated arithmetic is useful in digital signal processing applications [EmbedGuru01]. It is also provided as a hardware instruction on some machines. In our efforts to better expose hardware features, we have included saturated addition and subtraction functions in this proposal.
//SATurated ADDition
template <class integral_l, class integral_r>
constexpr auto satadd(integral_l l, integral_r r) noexcept -> decltype(l + r);
l + r
std::numeric_limits<decltype(l + r)>::max()
std::numeric_limits<decltype(l + r)>::min()
//SATurated SUBtraction
template <class integral_l, class integral_r>
constexpr auto satsub(integral_l l, integral_r r) noexcept -> decltype(l - r);
l - r
std::numeric_limits<decltype(l + r)>::max()
std::numeric_limits<decltype(l + r)>::min()
This section describes the additions to the <memory>
header.
These are primitives used for aligning objects in memory. They supplement use cases with which std::align
is
not designed to handle. These are very useful operations and all of them have trivial implementations.
They can often be found in operating system kernels and device drivers reimplemented time and time
again as macros.
template <class integral>
constexpr bool is_aligned(integral x, size_t align) noexcept;
true
if x
is a multiple of align
.(x & (a - 1)) == 0
bool is_aligned(void* val, size_t align) noexcept;
is_aligned(uintptr_t(val), align)
.template <class integral>
constexpr integral align_up(integral x, size_t align) noexcept;
n
such that is_aligned(n, align) && n >= x
.(x + (a - 1)) & -a
void* align_up(void* val, size_t align) noexcept;
(void*)align_up(uintptr_t(val), align)
.template <class integral>
constexpr integral align_down(integral x, size_t align) noexcept;
n
such that is_aligned(n, align) && n <= x
.x & (-a)
void* align_down(void* val, size_t align) noexcept;
(void*)align_down(uintptr_t(val), align)
.We currently have std::align
in the standard for doing alignment calculations.
The function std::align
has one very specific use case, that is to carve out an aligned buffer of a known size within a larger buffer.
In order to use std::align
, the user must a priori know the size of the aligned buffer
they require. Unfortunately in some use cases, even calculating the size of this buffer
as an input to std::align
itself requires doing alignment calculations.
Consider the following example of using aligned SIMD registers to process a memory buffer.
The alignment calculations here cannot be done with std::align
.
void process(char* b, char* e) {
char* pb = std::min((char*)std::align_up(b, sizeof(simd16)), e);
char* pe = (char*)std::align_down(e, sizeof(simd16));
for(char* p = b; p < pb; ++p) {
process1(p);
}
for(char* p = pb; p < pe; p += sizeof(simd16)) {
simd16 x = simd16_aligned_load(p);
process16(x);
simd16_aligned_store(x, p);
}
for(char* p = pe; p < e; ++p) {
process1(p);
}
}
We conclude that std::align
is much too specific for general alignment calculations. It has a very narrow
use case and should only be considered as a helper function for when that use case is needed.
Those who wish to implement the functions provided by this proposal must consider the following guidelines:
bsf
instruction followed by a cmov
instruction to handle the case where the input is 0. In many contexts
the optimizer is able to prove that the input is never 0 and thus the
cmov instruction can be omitted. These optimizations are often impossible with inline assembly.The following is a list of compiler intrinsics and native instructions which can be used to implement the proposal on various platforms. Several machine architectures were surveyed for their instruction references. The purpose of this section is to demonstrate the current state of the art on many different machines. We have also noted when one operation is trivially implementable from another bitops proposal operation.
cntt0(x)
bsf
, cmov
tzcnt
cttz
x == 0 ? sizeof(x) * CHAR_BIT : __builtin_ctz(x)
cntt1(~x)
cntl0(x)
bsr
, cmov
lzcnt
CLZ
clz
cntlzd
CLZ
(x == 0 ? sizeof(x) * CHAR_BIT : __builtin_clz(x))
cntl1(~x)
cntt1(x)
cntt0(~x)
cntl1(x)
CLS
SIGNBITS
NORM
SBC
CLO
cntl0(~x)
popcount(x)
popcnt
popcnt
CTPOP
popcntb
POPC
__builtin_popcount(x)
parity(x)
__builtin_parity(x)
popcount(x) & 1
rstls1b(x)
BLSR
setls0b(x)
BLCS
isols0b(x)
BLCI
, NOT
BLCIC
isols1b(x)
BLSI
BLSIC
, NOT
rstt1(x)
BLCFILL
sett0(x)
BLSFILL
maskt0(x)
TZMSK
maskt1(x)
T1MSKC
, NOT
maskt0ls1b(x)
BLSMSK
maskt1ls0b(x)
BLCMSK
rstbitsge(x, b)
BZHI
reverse_bits<uint32_t>(x)
RBIT
BITR
reverse_bits<uint64_t>(x)
RBIT
reverse_bits<uint8_t>(x, 4)
SWAP
reverse_bytes<uint16_t>(x)
SWAB
__builtin_bswap16(x)
reverse_bytes<uint32_t>(x)
bswap
REV
__builtin_bswap32(x)
reverse_bytes<uint64_t>(x)
bswap
REV
__builtin_bswap64(x)
reverse_bytes<uint32_t>(x, 1, 2)
REV16
reverse_bytes<uint64_t>(x, 1, 4)
REV16
reverse_bytes<uint64_t>(x, 1, 2)
REV32
reverse_bytes<uint32_t>(x, 2)
SWAP
int32_t(reverse_bytes<int16_t>(x))
REVSH
int32_t(reverse_bytes<uint16_t>(x))
REVSH
deposit_bits_right(x)
extract_bits_right(x)
satadd(l, r)
satsub(l, r)
Naming is one of the most difficult problems in software development.
One one extreme are terse names such as std::ctz()
for Count
Trailing Zeroes. This naming style mimics assembler mnemonics and is also
an artifact of the old days when programming languages had limits on the
length of names of identifiers.
These short names do have some merits. They reduce the amount of typing required by
the programmer and more importantly they can be used within complex expressions.
The downside is the ambiguity that can come with some short names. Consider
a hypothetical Count Leading Sign bits function std::cls()
. This name could
be interpreted in other contexts such as CLear Screen.
On the other extreme are verbose names such as std::mask_least_significant_1_bit_and_trailing_zeroes()
. While these names
remove all ambiguity they are very cumbersome to type. They also cannot be
used easily in complex expressions with other operations.
We have opted to make a compromise. The current naming scheme adheres to the following rules:
cnttz()
is pretty obviously zero, the o in cntto()
is not so obviously meant to be 1. Therefore we stick to the numbers to remove all ambiguity.As always, the naming question is continuously up for debate and reconsideration. Some other styles have been suggested on the std-proposals discussion forum.
ctz()
ct0()
cntt0()
countt0()
count_t0()
count_trailing_zeroes()
count_trailing_0_bits()
count_trailing<bool>()
Many people have suggesting adding support for std::bitset
. While this is certainly a good idea, we believe that is outside of the scope of
this proposal. Once this proposal is finished and the interface is agreed upon, adding a follow up proposal for std::bitset
would be easy to do.
This library would also be very useful for the C community. Many of these bitwise operations
are used by embedded developers and they often choose to implement in C. While C compatibility is a noble goal, we
do not want to make sacrifices to the C++ interface in the name of C compatibility. Particularly with regards to
templates, overloading, and constexpr
.
This is first and foremost a C++ proposal which takes advantage of the latest C++ techniques
to provide a modern procedural interface.
If the C community shows interest we may consider a C interface that uses the generic macro feature. This
may allow interoperability, using macros for C and templates for C++. The constexpr
qualifier
could be used in the C++ version while inline
is used in the C version. If the C community shows interest,
we will consider a joint C proposal and flesh out the technical details of the interface and compatibility.
Thank you to everyone on the std proposals forum for feedback and suggestions.