N3330
More Modern Bit Utilities

Published Proposal,

Previous Revisions:
n3104 (r2), n3094 (r1), n3074 (r0)
Predecessor:
n3022 (r4)
Author:
Latest:
https://thephd.dev/_vendor/future_cxx/papers/C - More Modern Bit Utilities.html
Paper Source:
github.com/ThePhD/future_cxx
Issue Tracking:
GitHub
Proposal Category:
Feature Request
Target:
C2y
Project:
ISO/IEC 9899 Programming Languages — C, ISO/IEC JTC1/SC22/WG14

Abstract

Byte swapping, big endian / little endian load and store, and bit rotate functions widely implemented and optimized by existing compiler infrastructure today.

1. Changelog

1.1. Revision 3 - September 4th, 2024

1.2. Revision 2 - February 7th, 2023

1.3. Revision 1 - January 31st, 2023

1.4. Revision 0 - December 14th, 2022

2. Polls

No polls except for the ones noted in the older version of this paper for C23 was taken ([n3022]).

3. Introduction & Motivation

There is a lot of proposals and work that goes into figuring out the "byte order" of integer values that occupy more than 1 octet (8 bits). This is nominally important when dealing with data that comes over network interfaces and is read from files, where the data can be laid out in various orders of octets for 2-, 3-, 4-, 6-, or 8-tuples of octets. The most well-known endian structures on existing architectures include "Big Endian", where the least significant bit comes "last" and is featured prominently in network protocols and file protocols; and, "Little Endian", where the least significant bit comes "first" and is typically the orientation of data for processor and user architectures most prevalent today.

In more legacy architectures (Honeywell, PDP), there also exists other orientations called "mixed" or "middle" endian. The uses of such endianness are of dubious benefit and are vanishingly rare amongst commodity and readily available hardware today, but nevertheless still represent an applicable ordering of octets.

In other related programming interfaces, the C functions/macros ntoh ("network to host") and hton ("host to network") (usually suffixed with l or ul or others to specify which native data type it was being performed on such as long) were used to change the byte order of a value ([ntohl]). This became such a common operation that many compilers - among them Clang and GCC - optimized the code down to use an intrinsic __builtin_bytewap(...)/__builtin_bswap(...) (for MSVC, for Clang, and for GCC). These intrinsics often compiled into binary code representing cheap, simple, and fast byte swapping instructions available on many CPUs for 16, 32, 64, and sometimes 128 bit numbers. The bswap/byteswap intrinsics were used as the fundamental underpinning for the ntoh and hton functions, where a check for the translation-time endianness of the program determined if the byte order would be flipped or not.

This proposal puts forth the fundamentals that make a homegrown implementation of htonl, ntoh, and other endianness-based functions possible in Standard C code. It also addresses many of the compiler-based intrinsics found to generate efficient machine code, with a few simpler utilities layered on top of it.

4. Design

This is a library addition. It is meant to expose both macros and functions that can be used for translation time-suitable checks. It provides a way to check endianness within the preprocessor, and gives definitive names that allow for knowing whether the endianness is big, little, or neither. We state big, little, or neither, because there is no settled-upon name for the legacy endianness of "middle" or "mixed", nor any agreed upon ordering for such a "middle" or "mixed" endianness between architectures. This is not the case for big endian or little endian, where one is simply the reverse of the other, always, in every case, across architectures, file protocols, and network specifications.

The next part of the design is functions for working with groupings of 8 bits. They are meant to communicate with network or file protocols and formats that have become ubiquitous in computing for the last 30 years.

Finally, this design adds new left/right bit rotation functions, all within the #include <stdbit.h> header.

4.1. Preliminary: Why the stdc_ prefix?

We use the stdc_ prefix for these functions so that we do not have to struggle with taking common words away from the end user. Because we now have 31 bytes of linker name significance, we can afford to have some sort of prefix rather than spend all of our time carving out reserved words or header-specific extensions. This will let us have good names that very clearly map to industry practice, without replacing industry code or being forced to be compatible with existing code that already has taken the name with sometimes-conflicting argument conventions.

4.2. Charter: unsigned char const ptr[static sizeof(uintN_t)] and More?

There are 2 choices on how to represent sized pointer arguments. The first is a void* ptr convention for functions arguments in this proposal. The second is an unsigned char ptr[static n]/unsigned char ptr[sizeof(uintN_t)] convention.

To start, we still put any size + ptr arguments in the proper "size first, pointer second" configuration so that implementation extensions which allow void [static n] can exist no matter what choice is made here. That part does not change. The void* argument convention mean that pointers to structures, or similar, can be passed to these functions without needing a cast. This represents the totality of the ease of use argument. The unsigned char ptr[static n] argument convention can produce both better compile-time safety and articulate requirements using purely the function declaration, without needing to look up prose from the C Standard or implementation documentation. The cost is that any use of the function will require a cast in strictly conforming code.

One of the tipping arguments in favor of our choice of unsigned char ptr[static n] is that void* can be dangerous, especially since we still do not have a nullptr constant in the language and 0 can be used for both the size and the pointer argument. (Which is, very sadly, an actual bug that happens in existing code. Especially when users mix memset and memcpy calls and use the wrong 0 argument because of writing one and meaning the other, and copying values over a large part of their 0-pointer in their low-level driver code.) Using an unsigned char* (or its statically-sized array function argument form) means that usage of the functions below would require explicit casting on the part of the user. This is, in fact, the way it is presented in [portable-endianness]: as far as existing practice is concerned, users of the code would rather cast and preserve safety rather than easily use something like stdc_memreverse8 with the guts of their structure.

4.3. Signed vs. Unsigned

This paper has gone back and forth between signed vs. unsigned count offsets for the rotl/rotr instruction-based functions, and similarly the return types for many of the types which return purely a "count"-style value. Some important properties and facts follow:

This brings up a lot of questions about whether or not the functions here should be signed or unsigned. We will analyze this primarily from the standpoint of rotate_left and rotate_right, as that has the greatest impacts for the portability and semantics of the code presented here.

4.3.1. In Defense of Signed Integers

Let us consider a universe where stdc_rotate_left and friends take a signed count. This allows negative numbers to be passed to the count value for the rotate left. So, when stdc_rotate_left_uc(1, -1) is called, it will call itself again with stdc_rotate_right_uc(value, -count); if (e.g.) stdc_rotate_right_uc(1, -1) is called, it will call itself again with stdc_rotate_left_uc(value, -count). This is because, specification-wise, these functions are symmetric and cyclical in what they are meant to do. This matches the behavior from C++ and avoids undefined behavior for negative numbers, while also avoiding too-large shift errors from signed-to-unsigned conversions.

SDCC and several other compilers optimize for left and right shifts ([sdcc]). Texas Instruments and a handful of other specialist architectures also have "variable shift" instructions (SSHVL), which uses the sign of the argument to shift in one direction or the other ([ti-tms320c64x]). Having a rotate_left where the a negative number produces the opposite rotate_right cyclic operation (and vice-versa) means that both of these architectures can optimize efficiently in the case of hardcoded constants, and still produce well-defined behavior otherwise (SSHVL instructions just deploy a "negated by default" for the count value or not, depending on whether the left or right variant is called, other architectures propagate the information to shift left or right). This also follows existing practice with analogous functions from the C++ standard library.

To test code generation for using a signed integer and 2’s complement arithmetic, we used both C++ and C code samples. The C++ code is based on its accepted C++20 proposal, [p0463]. It’s a fairly accurate predictor of how notable compilers handle this kind of specification. The generated assembly for the compilers turns out to be optimal, so long as an implementation does not do a literal copy-paste of the specification’s text

Using non-constant offset, with generated x86_64 assembly:

#include <bit>

extern unsigned int x;
extern int offset;

int main () {
  int l = std::rotl(x, offset);
  int r = std::rotr(x, offset);
  return l + r;
}
main: # @main
  mov     eax, dword ptr [rip + x]
  mov     cl, byte ptr [rip + offset]
  mov     edx, eax
  rol     edx, cl
  ror     eax, cl
  add     eax, edx
  ret

— And, using constant offset, with generated x86_64 assembly.

#include <bit>

extern unsigned int x;

int main () {
  int l = std::rotl(x, -13);
  int r = std::rotr(x, -13);
  return l + r;
}
main: # @main
  mov     eax, dword ptr [rip + x]
  mov     ecx, eax
  rol     ecx, 19
  rol     eax, 13
  add     eax, ecx
  ret

The generated code shows that the compiler understands the symmetric nature of the operations (from the constant code) and also shows that it will appropriately handle it even when it cannot see through constant values. The same can be shown when writing C code using a variety of styles, as shown here:

#if UNSIGNED_COUNT == 1

static unsigned int rotate_right(unsigned int value, unsigned int count);

inline static unsigned int rotate_left(unsigned int value, unsigned int count) {
  unsigned int c = count % 32;
  return value >> c 
    | value << (32 - c);
}

inline static unsigned int rotate_right(unsigned int value, unsigned int count) {
  unsigned int c = count % 32;
  return value << c
    | value >> (32 - c);
}

#elif TWOS_COMPLEMENT_CAST == 1

static unsigned int rotate_right(unsigned int value, int count);

inline static unsigned int rotate_left(unsigned int value, int count) {
  unsigned int c = (unsigned int)count;
  c = c % 32;
  return value >> c
    | value << (32 - c);
}

inline static unsigned int rotate_right(unsigned int value, int count) {
  unsigned int c = (unsigned int)count;
  c = c % 32;
  return value << c
    | value >> (32 - c);
}

#else

static unsigned int rotate_right(unsigned int value, int count);

inline static unsigned int rotate_left(unsigned int value, int count) {
  int c = count % 32;
  if (c < 0)  {
    return rotate_right(value, -c);
  }
  return value >> c
    | value << (32 - c);
}

  inline static unsigned int rotate_right(unsigned int value, int count) {
  int c = count % 32;
  if (c < 0)  {
    return rotate_left(value, -c);
  }
  return value << c
    | value >> (32 - c);
}

#endif

#if UNSIGNED_COUNT == 1
unsigned int f (unsigned int x, unsigned int offset) {
#else
unsigned int f (unsigned int x, int offset) {
#endif
  unsigned int l = rotate_left(x, offset);
  unsigned int r = rotate_right(x, offset);
  return l + r;
}

When using the various definitions, we find that the generated assembly for f is identically good using either the internal unsigned "two’s complement" cast, or by just using an unsigned number. Because of how poorly basic mathematics with unsigned numbers happens, we want to avoid a situation where negation or subtraction with unsigned qualities may yield undesirable results or promotions. Therefore, we used signed integers for both the offset count and the return values of these functions. Note that even in purely standard C, converting from a signed integer to an unsigned integer is perfectly well-defined behavior and does not raise any signals:

2   Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.

— §6.3.1.3, ¶2, ISO/IEC 9899:202x "C2x" Standard

Finally, the vast majority of existing practice takes the offset value in as a signed integer, and all the return types are also still some form of signed integer (unless the intrinsic is returning the exact same unsigned value put in that was manipulated). It also allows "plain math" being done on the type to naturally manifest negative numbers without accidentally having roundtripping or signed/unsigned conversion issues.

4.3.2. In Defense of Unsigned

Unsigned, on the other hand, has existing practice in hardware. While the intrinsics defined by glibc, C++'s standard libraries, and many more use signed integers, they are conceptually unsigned in their implementations. For example, for a 32-bit rotate, most standard libraries taking an int offset parameter perform:

count = count & 31;

This is critical for optimization here. Note that, if we were to provide a specification using a signed offset, our specification has to very deliberately specify that we are going to negate the value and then pass it to the rotate of the opposite direction. This is, effectively, the same as obliterating the sign value and then calling the (symmetrical, cyclical) rotate: a 32-bit rotate therefore can get identical codegen as a signed variant by using the a bit AND (NOT a normal % 32, as that preserves the sign as we do NOT want that). For an unsigned variant, no such trickery is necessary. Simply truncating the value using:

count = count % 32;

produces optimal code generation for most compilers, as they understand that bit AND for hexadecimal 0x1F (decimal 31) is identical to modulus of decimal 32. This means that, by default, unsigned values are the same here. Abusing 2’s complement, one can save this by simply doing unsigned u_count = (unsigned)count; and then perform modulus to get the same behavior as performing bit AND with 31. The "obvious" code is the efficient code here, as shown by the example of the assembly above.

Rust is one of the few languages that provides optimal versions of this code using unsigned [rust-rotate_left]. Their code is optimal under both optimizations and a lack thereof, compared to C and C++ code which struggles with function call elision and similar. This may be aided in the future by having this paper put into the C standard, which would allow compilers to treat standard-specific rotate calls as intrinsics to be replaced with the instructions directly.

All in all, unsigned naturally optimizes better and matches the size type of C. It has no undefined behavior on overflow and produces better assembly in-general when it comes to bit intrinsics. Shifting behavior is also well-defined for unsigned types and not signed types, further compounding unsigned types as far better than their signed counterparts.

4.3.3. Which Does This Paper Choose?

Ultimately, this paper chooses unsigned integer types. The core of the reasoning is that this avoids undefined behavior in the specification and is also morally and spiritually what architectures expect (at least for 2’s complement). Since C23 embraces 2’s complement, using unsigned here would work more or less identically to a signed integer anyways and therefore it just does not matter enough to want to write an exception for INT_MIN/INT_MAX into the specification for handling e.g. the count parameter of stdc_rotate_left or stdc_rotate_right.

4.4. Generic 8-bit Memory Reverse and Exact-width 8-bit Memory Reverse

In order to accommodate both a wide variety of architectures but also support minimum-width integer optimized intrinsics, this proposal takes from the industry 2 forms of byteswap:

These end up inhabiting the stdbit.h header and have the following interface:

#include <stdbit.h>
#include <limits.h>
#include <stdint.h>

#if ((CHAR_BIT) % 8 == 0)
void stdc_memreverse8(size_t n, unsigned char ptr[static n]);
uintN_t stdc_memreverse8uN(uintN_t value);
#endif

where N is one of the exact-width integer types such as 8, 24, 16, 32, 64, 128, and others. On most architectures, this matches the builtins (MSVC, Clang, GCC) and the result of compiler optimizations that produce instructions for many existing architectures as shown in the README of this portable endianness function implementation. We use the exact-width values for the uN-suffixed functions because we expect that C compilers would want to lower the stdc_memreverse8uN call to existing practice of byteswapN instructions and compiler intrinsics. Using uint_leastN_t reduces the ability to match these existing optimizations in the case where uintN_t functions are not defined. This makes for the following code on most implementations to not trigger a failure:

// On typical CHAR_BIT == 8, i686/x86_32/x86_64/ARM32/AARCH64 implementations
const uint32_t normal = 0xAABBCCDDu;
const uint32_t reversed = 0xDDCCBBAAu;
assert(stdc_memreverse8u32(normal) == reversed);

4.4.1. But Memory Reverse Is Dangerous?

Byte swapping, by itself, is absolutely dangerous in terms of code portability. Users often program strictly for their own architecture when doing serialization, and do not take into consideration that their endianness can change. This means that, while stdc_memreverse functions can compile down to intrinsics, those intrinsics get employed to change "little endian" to "big endian" without performing the necessary "am I already in the right endianness" check. Values that are already in the proper byte order for their target serialization get swapped, resulting in an incorrect byte order for the target network protocol, file format, or other binary serialization target.

The inclusion of the <stdbit.h> header reduces this problem by giving access to the __STDC_NATIVE_ENDIAN__ macro definition, but does not fully eliminate it. This is why many Linux and BSDs include functions which directly transcribe from one endianness to another. This is why the Byte Order Fallacy has spread so far in Systems Programming communities, and why many create their own versions of this both in official widespread vendor code ([linux-endian]) and in more personal code used for specific distributions ([portable-endianness]). Thusly, this proposal includes some endianness functions, specified further below.

4.4.2. Vetting the Implementation / Algorithm for memreverse

In previous iterations of the paper, there were various off-by-one errors in transcribing the algorithm used to get the job done. Therefore, we more directly lifted the code for the algorithm from the example implementation here. To further prove that it works on "bytes" that may be larger than 8 bits, we also took the following steps.

All of the tests pass across the three major compilers (MSVC, GCC, and Clang) and across platforms (Windows, Linux, Mac OS). We find this to be compelling enough to ensure that the implementation and the algorithm in the wording is suitably correct. Nevertheless, any wording failures present here represent the authors' collective inability to properly serialize wording, not that an implementation is not possible or too inventive.

Regardless of how well it’s tested and checked, the paper still restricts the content to CHAR_BIT == 8 bits, as we know that this has both wide implementation experience and is implementable.

4.5. stdc_load8_*/stdc_store8_* Endian-Aware Functions

Functions meant to transport bytes to a specific endianness need 3 pieces of information:

To represent any operation that goes from/to the byte order that things like long longs are kept in, the Linux/BSD/etc. APIs use the term "host", represented by h. Every other operation is represented by explicitly naming it, particularly as be or le for "big endian" or "little endian". Again, because of the severe confusion that comes from what the exact byte order a "mixed endian" multi byte scalar is meant to be in, there seems not to exist any widely available practice regarding what to call a PDP/Honeywell endian configuration. Therefore, mixed/bi/middle-endian is not included in this proposal. It can be added at a later date if the community ever settles on a well-defined naming convention that can be shared between codebases, standards, and industries.

The specification for the endianness functions borrows from many different sources listed above, and is as follows:

#include <stdbit.h>
#include <limits.h>
#include <stdint.h>

#if (CHAR_BIT) == 8
void stdc_store8_leuN(uint_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_beuN(uint_leastN_t value,
  unsigned char ptr[static (N / 8)]);
uint_leastN_t stdc_load8_leuN(
  const unsigned char ptr[static (N / 8)]);
uint_leastN_t stdc_load8_beuN(
  const unsigned char ptr[static (N / 8)]);
void stdc_store8_aligned_leuN(uint_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_aligned_beuN(uint_leastN_t value,
  unsigned char ptr[static (N / 8)]);
uint_leastN_t stdc_load8_aligned_leuN(
  const unsigned char ptr[static (N / 8)]);
uint_leastN_t stdc_load8_aligned_beuN(
  const unsigned char ptr[static (N / 8)]);

void stdc_store8_lesN(int_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_besN(int_leastN_t value,
  unsigned char ptr[static (N / 8)]);
int_leastN_t stdc_load8_lesN(
  const unsigned char ptr[static (N / 8)]);
int_leastN_t stdc_load8_besN(
  const unsigned char ptr[static (N / 8)]);
void stdc_store8_aligned_lesN(int_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_aligned_besN(int_leastN_t value,
  unsigned char ptr[static (N / 8)]);
int_leastN_t stdc_load8_aligned_lesN(
  const unsigned char ptr[static (N / 8)]);
int_leastN_t stdc_load8_aligned_besN(
  const unsigned char ptr[static (N / 8)]);
#endif

Thanks to some feedback from implementers and librarians, this first implementation would also need an added signed variant to the load and store functions as well as aligned and unaligned loads and stores. While C23 mandates a two’s complement representation for integers, because we are using the uint_leastN_t functions (which may be larger than the intended N == 24 or N == 32 specification), it is important for the sign bit to be properly serialized an transported. Therefore, during stdc_load8_(le/be)sN/stdc_load8_(le/be)uN operations, the sign bit will be directly serialized into resulting signed value or byte array where necessary.

This specification is marginally more complicated than the stdc_memreverseuN functions because they operate on uint_leastN_t, where N is the minimum-width bit value. These functions, on most normal implementations, will just fill in the exact number of 8, 16, 32, 64, etc. bits.

We are fine with not making these precisely uintN_t/intN_t because the C23 Standard includes a specific requirement that if uintN_t/intN_t exist, then uint_leastN_t/int_leastN_t must match their exact-width counterparts exactly, which has been existing practice on almost all implementations for quite some time now.

4.5.1. Vetting the Implementation / Algorithm for 8-bit loads and stores

In previous iterations of the paper, getting the algorithm written down properly in a way that does not rely on any kind of implementation-defined behavior for signed and unsigned endian-aware loads and stores was tough and resulted in many errors in the wording. Still, we know that the implementation is solid because we have tested it (both theoretically and factually) by writing implementations which base "unit" for writing into has a width greater than CHAR_BIT. It is similar to the design

All of the tests pass across the three major compilers (MSVC, GCC, and Clang) and across platforms (Windows, Linux, Mac OS). We find this to be compelling enough to ensure that the implementation is suitably correct, even if the wording may not be proper or ideal. Therefore, we hope this can serve as a good basis in establishing that, at the very least, this is both implementable and usable. This also corroborates additional materials outside of compilers who always target CHAR_BIT == 8, such as the F2838x/F28069 series, and C28x series, of chips from Texas Instruments. For example, the TMS320C28x reference guide gives a listing for how to properly and effectively swap 8-bit bytes of a 32-bit integer, despite being a 16-bit architecture (Page 292). It is, at least in some cases, important enough to include in reference material and programming guides for these chips, even if the authors could not personally find implementations of publicly-discussable compilers which provided a C-style intrinsic for a CHAR_BIT == 16/32/64 platform.

Regardless of how well it’s tested and checked, the paper still restricts the content to CHAR_BIT == 8 bits, as we know that this has both wide implementation experience and is implementable.

4.6. Additional Modern Bit Utilities

Additionally to this, upon first review of the paper there was a strong supporting groundswell for bit operations that have long been present in both hardware and as compiler intrinsics. This idea progressed naturally from the bswap and __builtin_bswap discussion. As indicated in [p0553] (merged into C++20 already), here’s a basic rundown of some common architectures and their support for various bit functionality:

operation Intel/AMD ARM PowerPC
rotl ROL - rldicl
rotr ROR ROR, EXTR -

Many of the below bit functions are defined below to ease portability to these architectures. For places where specific compiler idioms and automatic detection are not possible, similar assembly tricks or optimized implementations can be provided by C. Further bit functions were also merged into C++, resulting in the current state of the C++ bit header.

There is further a bit of an "infamous" page amongst computer scientists for Bit Twiddling Hacks. These may not all map directly to instructions but they provide a broad set of useful functionality commonly found in not only CPU-based programming libraries, but GPU-based programming libraries and other high performance computing resources as well.

We try to take the most useful subset of these functions that most closely represent functionality on both old and new CPU architectures as well as common, necessary operations that have been around in the last 25 years for various industries. We have left out operations such as sign extension, parity computation, bit merging, clear/setting bits, fast negation, bit swaps, lexicographic next bit permutation, and bit interleaving. The rest are more common and appear across a wide range of industries from cryptography to graphics to simulation to efficient property lookup and kernel scheduling.

4.6.1. Type-Generic Macros and Counts for Types

All of the functions below have type generic macros associated with them. This can bring up an interesting question: if the return value depends on the type of the argument going into the function (i.e. for rotate_left, and rotate_right), is it bad for literal arguments? The answer to this question, however, is the same as its always been when dealing with literal values in C: use the suffix for the appropriate type, or cast, or put it in a const variable so that it can be used with the expected semantics. We cannot sink macro-based generic code use cases in the off-chance that someone calls stdc_rotate_left(0b01100, 2) and thinks it returns a dependable answers. Integers (and their literals) are the least portable part of Standard C code: use the exact-width types if you are expecting exact-width semantics. Or, call the fundamental-type suffixed versions to get answers dependable for that given fundamental type (e.g., stdc_rotate_left_ui(0b01100, 2)), even if it means the size of that fundamental type might change.

4.6.2. Argument Types

Many of the functions below are defined over the fundamental unsigned integer types, rather than their minimum width or exact width counterparts. This is done to provide maximum portability: users can combine information from the recently-introduced (S/U)CHAR/(U)SHRT/(U)INT/(U)LONG/(U)LLONG_WIDTH macros to determine the width of the sizes at translation time as well as enjoy a disjoint and distinct set of fundamental types over which generic selection will always work.

The (u)int_leastN_t types also have WIDTH macros, but those macros are not exactly guaranteed to cover a wide range of actual bit sizes either (if the uintN_t types do not exist, then a conforming implementation can simply just name all of the types as typedefs for (unsigned) long long and call it a day). While an implementation could also define each of the distinct fundamental types from (unsigned) char to (unsigned) long long to all be the same width as well, we are at the very least guaranteed that they are, in fact, distinct types. This makes selection over types in _Generic predictable and usable (i.e. _Generic(x, uint_least32_t: 0, uint_least64_t: 1) is not guaranteed to compile since those types are not required to form a mutually exclusive or disjoint set).

The exact-width types suffer from non-availability on specific platforms, which makes little sense for functions which do not depend on a no-padding bits requirement. As long as the values read from the array only involve N bits (including the sign bit), and the rest are zero-initialized, we can have predictable semantics.

Extended integer types, least-width integer types, and exact-width integer types, can all be used with the type-generic macros since the type-generic macros are required to work over all standard (unsigned) integer types and extended (unsigned) integer types, while excluding bool and bit-precise (_BitInt(N)) integer types that do not match pre-existing type widths. This provides a complete set of functionality that is maximally portable while also allowing for precise semantic control with exact or least-width types.

Finally, in general bool objects are disallowed from the above functions. There is just not a meaningful body of functionality that can be provided, and there is a fundamental difference between something that is expected to be a boolean value and something that is expected to be a 1-bit number (even if they can both serve similar purposes). It is also questionable to compute things such as rotation for bool objects. If we can grow a consistent set of answers for these operations across the industry, than we can weaken the requirements and add the behavior in. (Note that if we put it in now and choose a behavior, we cut off any improvements made in the future, so it is best to be conservative here.)

4.6.3. Return Types

Ostensibly, part of the motivation to capture here should be that the types used to do things such as rotations should be identical to the return type used to do things like count zeros, e.g. stdc_rotate_left(value, stdc_count_zeros(value));. This is mostly non-problematic until someone uses _BitInt: Clang already supports several megabyte-large _BitInt. On platforms where unsigned int is actually 16 bits, this is far too small to accommodate even a 1 MB _BitInt.

At the moment, the functions do not accept all bit-precise integer types (just ones that are bit-width equivalent to the existing standard and extended integer types), so this is technically a non-issue. But, if and when bit-precise integer types are given better handling in _Generic macros or similar features that make them more suitable for type-generic macro implementations, this could become a problem. At the moment, we use wording to defer the issue by saying that type generic macros return a type suitably large for the range of the computed value. This allows us forward compatibility while fixing non-type-generic macro return types to unsigned int. The type-generic macros will have the flexibility from the specification to return larger signed integer types to aid in a smooth transition once bit-precise integer types sees more standard support.

4.6.4. stdc_rotate_left/stdc_rotate_right

stdc_rotate_left/stdc_rotate_right are common CPU instructions and the forms of the commonly-used circular shifts. They are common operations with applications in cyclic codes. They are commonly expressed (for 32-bit numbers) as value << count | value >> (32 - count) (rotate left) or value >> count | value << (32 - count) (rotate right).

#include <stdbit.h>

unsigned char stdc_rotate_left_uc(unsigned char value,
  unsigned int count);
unsigned short stdc_rotate_left_us(unsigned short value,
  unsigned int count);
unsigned int stdc_rotate_left_ui(unsigned int value,
  unsigned int count);
unsigned long stdc_rotate_left_ul(unsigned long value,
  unsigned int count);
unsigned long long stdc_rotate_left_ull(unsigned long long value,
  unsigned int count);

unsigned char stdc_rotate_right_uc(unsigned char value,
  unsigned int count);
unsigned short stdc_rotate_right_us(unsigned short value,
  unsigned int count);
unsigned int stdc_rotate_right_ui(unsigned int value,
  unsigned int count);
unsigned long stdc_rotate_right_ul(unsigned long value,
  unsigned int count);
unsigned long long stdc_rotate_right_ull(unsigned long long value,
  unsigned int count);

// type-generic macro
generic_value_type stdc_rotate_left(
  generic_value_type value, generic_count_type count);
generic_value_type stdc_rotate_right(
  generic_value_type value, generic_count_type count);

They cover all of the built-in unsigned integer types. A discussion of signed vs. unsigned integer types for the count type and the return type can be found in a previous section, here § 4.3 Signed vs. Unsigned.

As for choosing a single function like stdc_rotate(unsigned-integer-type value, some-integer-type count); that chooses left / right based on the value, it unfortunately imposes the worst code generation properties of all the options. When using entirely runtime values, unless you have a deliberately have a variable-rotate/shift instruction, you are required to emit a branch in order to handle the two cases, as rotate left / right - despite being symmetric - need some help. Here is the assembly for a technically optimal left/right rotate:

f:  # @f
  mov     r8d, edi
  mov     ecx, esi
  rol     r8d, cl
  mov     edx, edi
  ror     edx, cl
  mov     ecx, esi
  neg     ecx
  mov     eax, edi
  rol     eax, cl
  ror     edi, cl
  test    esi, esi
  cmovs   edx, r8d
  cmovle  eax, edi
  add     eax, edx
  ret

This is more than double the size of the rotates found using left/right directly in § 4.3 Signed vs. Unsigned. Due to this, we decided that it was not advantageous to have a signed count with an unknown left/right: it is important to be capable of biasing the optimizer to whether a given rotate is left/right oriented.

5. Wording

The following wording is relative to the latest draft standard.

5.1. Add a new §7.18.✨17 and §7.18.✨18 sub-sub-clauses for "8-bit Memory Reversal" in §7.18

7.18.✨17 8-bit Memory Reversal
Synopsis
#include <stdbit.h>
#include <limit.h>

#if (CHAR_BIT) == 8
void stdc_memreverse8(size_t n, unsigned char ptr[static n]);
#endif
Description

The stdc_memreverse8 function reverses the order of the bytes pointed to by ptr. For every unsigned char at offset index in the range [0, n/2], inclusive, swap the values at ptr[index] and ptr[n - index - 1], except when index == n - index - 1 in which case nothing is done.

§7.18.✨18 Exact-width 8-bit Memory Reversal
Synopsis
#include <stdbit.h>
#include <limits.h>
#include <stdint.h>

#if (CHAR_BIT) == 8
uintN_t stdc_memreverse8uN(uintN_t value);
#endif
Description

The stdc_memreverse8uN functions provide an interface to swap the bytes of a corresponding uintN_t object, where N matches one of the exact-width integer types (7.20.1.1). If an implementation provides the corresponding uintN_t typedef, it shall define the corresponding exact-width memory reversal function for that value of N.

Returns

The stdc_memreverse8uN functions returns the 8-bit memory reversed uintN_t value, as if by invoking stdc_memreverse8(sizeof(value), (unsigned char*)&value) and then returning value.

5.2. Add a new §7.18.✨19 sub-sub-clause for "Endian-Aware" functions in §7.18

7.18.✨19 Endian-Aware 8-bit Load
Synopsis
#include <stdbit.h>

#if (CHAR_BIT) == 8
uint_leastN_t stdc_load8_leuN(const unsigned char ptr[static (N / 8)]);
uint_leastN_t stdc_load8_beuN(const unsigned char ptr[static (N / 8)]);
uint_leastN_t stdc_load8_aligned_leuN(const unsigned char ptr[static (N / 8)]);
uint_leastN_t stdc_load8_aligned_beuN(const unsigned char ptr[static (N / 8)]);

int_leastN_t stdc_load8_lesN(const unsigned char ptr[static (N / 8)]);
int_leastN_t stdc_load8_besN(const unsigned char ptr[static (N / 8)]);
int_leastN_t stdc_load8_aligned_lesN(const unsigned char ptr[static (N / 8)]);
int_leastN_t stdc_load8_aligned_besN(const unsigned char ptr[static (N / 8)]);
#endif
Description

The 8-bit load family of functions functions read an int_leastN_t or uint_leastN_t object from the provided ptr in an endian-aware (7.18.2) manner, where N matches an existing minimum-width integer type (7.20.1.2). If an implementation provides the corresponding uint_leastN_t typedef, it shall define the corresponding endian-aware load function function for that value of N. If this function is present, N shall be a multiple of 8 and CHAR_BIT shall be 8. The functions containing _aligned in the name shall assume that ptr is suitably aligned to access a signed or unsigned integer of width N for a signed or unsigned variant of the function, respectively. If the function name contains the sN suffix in the name, it is a signed variant. Otherwise, the function is an unsigned variant. If the function name contains the lesN or leuN suffix, it is a little-endian variant. Otherwise, if the function name contains the besN or beuN suffix, it is a big-endian variant.

Returns

Let the computed value $result$ be:

$$\sum_{index = 0}^{(N \div{} 8) - 1} b_{index} \times{} 2^{8 \times{} index}$$

where $b_{index}$ is:

ptr[index], if the function is the little-endian variant;

— otherwise, ptr[N - index - 1], if the function is the the big-endian variant.

If the function is an unsigned variant, return $result$. Otherwise, if the function is a signed variant, return:

  • $result$, if $result$ is less than $2^{N-1}$;

  • otherwise, $result - 2^{N}$.

7.18.✨20 Endian-Aware 8-bit Store
Synopsis
#include <stdbit.h>

#if (CHAR_BIT) == 8
void stdc_store8_leuN(uint_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_beuN(uint_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_aligned_leuN(uint_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_aligned_beuN(uint_leastN_t value,
  unsigned char ptr[static (N / 8)]);

void stdc_store8_lesN(int_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_besN(int_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_aligned_lesN(int_leastN_t value,
  unsigned char ptr[static (N / 8)]);
void stdc_store8_aligned_besN(int_leastN_t value,
  unsigned char ptr[static (N / 8)]);
#endif
Description

The 8-bit store family of functions functions write a int_leastN_t or uint_leastN_t object into the provided ptr in an endian-aware (7.18.2) manner, where N matches an existing minimum-width integer type (7.20.1.2). If an implementation provides the corresponding uint_leastN_t typedef, it shall define the corresponding endian-aware store function function for that value of N. If this function is present, N shall be a multiple of 8 and CHAR_BIT shall be 8. The functions containing _aligned in the name shall assume that ptr is suitably aligned to access a signed or unsigned integer of width N. If the function name contains the sN suffix in the name, it is a signed variant. Otherwise, the function is an unsigned variant. If the function name contains the lesN or leuN suffix, it is a little-endian variant. Otherwise, if the function name contains the besN or beuN suffix, it is a big-endian variant.

Let value_unsigned be value if the function is a unsigned variant. Otherwise, let value_unsigned be the conversion of value to its corresponding unsigned type, if the function is a signed variant.

Let index be an integer in a sequence that

— starts from 0 and increments by 8 in the range of [0, N), if the function is a little-endian variant;

— starts from N - 8 and decrements by 8 in the range of [0, N), if the function is a big-endian variant.

Let ptr_index be an integer that starts from 0. For each index in the order of the above-specified sequence:

  1. Sets the 8 bits in ptr[ptr_index] to (value_unsigned >> index) & 0xFF.

  1. Increments ptr_index by 1.

5.3. Add a new §7.18.✨21 sub-sub-clause for Rotate Left and Rotate Right Bit Utilities in §7.18

7.18.✨21 Rotate Left
Synopsis
unsigned char stdc_rotate_left_uc(unsigned char value, unsigned int count);
unsigned short stdc_rotate_left_us(unsigned short value, unsigned int count);
unsigned int stdc_rotate_left_ui(unsigned int value, unsigned int count);
unsigned long stdc_rotate_left_ul(unsigned long value, unsigned int count);
unsigned long long stdc_rotate_left_ull(unsigned long long value, unsigned int count);

generic_value_type stdc_rotate_left(
  generic_value_type value, generic_count_type count);
Description
The stdc_rotate_left functions perform a bitwise rotate left. This operation is typically known as a left circular shift.
Returns

Let N be the width corresponding to the type of the input value. Let no_promote_value be (unsigned _BitInt(N)value. Let r be count % N.

— If r is 0, returns value;

— otherwise, returns (no_promote_value << r) | (no_promote_value >> (N - r))FN0✨).

The type-generic function (marked by its generic_value_type argument) returns the above described result for a given input value so long as the generic_value_type is

— a standard unsigned integer type, excluding bool;

— an extended unsigned integer type;

— or, a bit-precise unsigned integer type whose width matches any standard or extended integer type, excluding bool.

The generic_count_type count argument shall be a non-negative value of signed or unsigned integer type, or char.

FN0✨) Bit-precise integer types do not perform integer promotion in these expressions and therefore prevent potential undefined behavior from a literal implementation which would use value in place of no_promote_value (e.g., from a function call such as stdc_rotate_left_us(0xFFFF, 15) with an unsigned short width of 16 with an int width of 24).


7.18.✨22 Rotate Right
Synopsis
unsigned char stdc_rotate_right_uc(unsigned char value, unsigned int count);
unsigned short stdc_rotate_right_us(unsigned short value, unsigned int count);
unsigned int stdc_rotate_right_ui(unsigned int value, unsigned int count);
unsigned long stdc_rotate_right_ul(unsigned long value, unsigned int count);
unsigned long long stdc_rotate_right_ull(unsigned long long value, unsigned int count);

generic_value_type stdc_rotate_right(
  generic_value_type value, generic_count_type count);
Description
The stdc_rotate_right functions perform a bitwise rotate right. This operation is typically known as a right circular shift.
Returns

Let N be the width corresponding to the type of the input value. Let no_promote_value be (unsigned _BitInt(N))value. Let r be count % N.

— If r is 0, returns value;

— otherwise, returns (no_promote_value >> r) | (no_promote_value << (N - r))FN1✨);

FN1✨) Bit-precise integer types do not perform integer promotion in these expressions and therefore prevent potential undefined behavior from a literal implementation which would use value in place of no_promote_value (e.g., from a function call such as stdc_rotate_right_us(0xFFFF, 15) with an unsigned short width of 16 with an int width of 24).

The type-generic function (marked by its generic_value_type argument) returns the above described result for a given input value so long as the generic_value_type is

— a standard unsigned integer type, excluding bool;

— an extended unsigned integer type;

— or, a bit-precise unsigned integer type whose width matches any standard or extended integer type, excluding bool.

The generic_count_type count argument shall be a non-negative value of signed or unsigned integer type, or char.

6. Appendix

A collection of miscellaneous and helpful bits of information and implementation.

6.1. Example Implementations in Publicly-Available Libraries

Optimized routines following the naming conventions present in this paper can be found in the Shepherd’s Oasis Industrial Development Kit (IDK) library, compilable with a conforming C11 compiler and tested on MSVC, GCC, and Clang on Windows, Mac, and Linux:

Optimized routines following the basic principles present in this paper and used as motivation to improve several C++ Standard Libraries can be found in the Itsy Bitsy Bit Libraries, compilable with a conforming C++17 and tested on MSVC, GCC, and Clang on Windows, Mac, and Linux:

Endianness routines and original motivation that spawned this proposal came from David Seifert’s Portable Endianness library and its deep dive into compiler optimizations and efficient code generation when alignment came into play:

7. Acknowledgements

Many thanks to David Seifert, Aaron Bachmann, Jens Gustedt, Tony Finch, Erin AO Shepherd, and many others who helped fight to get the semantics and wording into the right form, providing motivation, giving example code, pointing out existing libraries, and helping to justify this proposal.

References

Informative References

[ANDERSON-BIT-HACKS]
Sean Eron Anderson. Bit Twiddling Hacks. May 5th, 2005. URL: https://graphics.stanford.edu/~seander/bithacks.html
[CLANG-BUILTINS]
LLVM Foundation; Clang Contributors. Clang Language Extensions: Clang Documentation. September 1st, 2021. URL: https://clang.llvm.org/docs/LanguageExtensions.html#intrinsics-support-within-constant-expressions
[ENDIAN-FALLACY]
Rob Pike. The Byte Order Fallacy. April 3rd, 2012. URL: https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html
[GCC-BUILTINS]
GCC Contributors. Other Built-in Functions Provided by GCC. September 1st, 2021. URL: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
[LINUX-ENDIAN]
Linux; BSD. endian(3). September 1st, 2021. URL: https://linux.die.net/man/3/endian
[MSVC-BUILTINS]
Microsoft. _byteswap_uint64, _byteswap_ulong, _byteswap_ushort. November 4th, 2016. URL: https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/byteswap-uint64-byteswap-ulong-byteswap-ushort?view=msvc-160
[N3022]
JeanHeyd Meneide. N3022 - Modern Bit Utilities. July 6th, 2022. URL: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3022.htm
[NTOHL]
Linux. ntohl(3). September 30th, 2021. URL: https://linux.die.net/man/3/ntohl
[P0463]
Howard E. Hinnant. endian. Just endian.. July 13th, 2017. URL: https://wg21.link/p0463
[P0553]
Jens Maurer. Bit operations. March 1st, 2019. URL: https://wg21.link/p0553
[PORTABLE-ENDIANNESS]
David Seifert. portable-endianness. May 16th, 2021. URL: https://github.com/SoapGentoo/portable-endianness
[RUST-ROTATE_LEFT]
Rust Standard Library Collaborators. u32 methods: rotate_left. November 10th, 2021. URL: https://doc.rust-lang.org/std/primitive.u32.html#method.rotate_left
[SDCC]
Dr. Philipp K. Krause. SDCC Manual §8.1.9 - Bit Rotations. September 25th, 2021. URL: http://sdcc.sourceforge.net/doc/sdccman.pdf
[TI-TMS320C64X]
Texas Instruments. TMS320C64x/C64x+ DSP: CPU and Instruction Set. July 31st, 2010. URL: https://www.ti.com/lit/ug/spru732j/spru732j.pdf