Document number: P3330R0
Date: 2024-6-17.
Authors: Gonzalo Brito Gadeschi, Damien Lebrun-Grandie.
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>.
Audience: SG1.

Table of Contents

User-defined Atomic Read-Modify-Write Operations

We propose extending the read-modify-write APIs of std::atomic and std::atomic_ref with user-defined read-modify-write atomic operations. These move the responsibility of writing CAS-like loops from user code into the implementation, improving the portability, performance, and forward-progress of C++ applications.

Motivation

Without these extensions, users are required to implement user-defined read-modify-write operations with compare_exchange loops (CAS-loops from now on), which:

Design

Note: a portable reference implementation is provided here for exposition purposes.

We propose extending std::atomic and std::atomic_ref with user-defined read-modify-write atomic operations:

template <typename T, typename UnaryOp>
bool fetch_update(T& old, UnaryOp uop, 
                  memory_order order = memory_order::seq_cst) const noexcept;

template <typename T, typename UnaryOp>
bool fetch_update(T& old, UnaryOp uop, 
                  memory_order success = memory_order::seq_cst,
                  memory_order failure = memory_order::seq_cst) const noexcept;

The result of the unary operation uop is used to atomically update the value refered by the atomic type and it may be of type T or optional<T>. If uop returns:

In both cases, the value fetched by fetch_update and passed to uop is written to old.

This API is similar to that of compare_exchange operations and enables exposing vendor-specific atomic operations (e.g., fetch_fp_exp):

Before

After

#include <atomic> atomic<float> a = 0.f; int main() { float old = a.load(), next; do { next = std::fexp(old, 2.f); } while(a.compare_exchange_strong(old, next)); return 0; }
#include <atomic> atomic<float> a = 0.f; int main() { float old; a.fetch_update(old, [](float o) { return std::fexp(o, 2.f); }); return 0; }

The following example shows how to use the API to update a pair depending on the value of one of its sub-objects, aborting the update and turning the operation into an atomic read when a certain condition is met:

atomic<pair<char, short>> atom; pair<char, short> old; bool success = atom.fetch_update(old, [](pair<char, short> p) -> optional<par<char, short>> { if (p.first > 42) return nullopt; return make_pair(p.first+1, p.second+2)}; }); assert((success && old.first <= 42) || (!success && old.first > 42));

Implementation strategies

The semantics this API may provide are restricted by the implementation strategies allowed and vice-versa. This design chooses to support, among others, following implementation strategies:

  1. CAS-loop based implementation using atomic CAS operations.
  2. Lock based implementations that take the lock:
    • Multiple-times (e.g. lock-based CAS-loop).
    • Only once.
  3. LL/SC based implementations.
  4. Hardware Transactional Memory (HTM) based implementations.

The design also aims at simplifying the implementation requirements for leveraging hardware acceleration.

Requirements on the user-defined operation

To support these implementation strategies, the UnaryOp(T) call-operator and uop(old) call-expression are required to:

This requirements should imply that:

Requirements on the implementation

The effect of fetch_update is to call uop(old) exactly once. Implementation strategies that call uop(old) multiple times are allowed as long as they behave "as if" uop(old) was called exactly once.

For example, CAS-loop based implementations must use that UnaryOp is implicit-lifetime to back up the uop object, and call uop(old) on a fresh copy of the backup every iteration, to discard any modifications that uop(old) may have done to the uop object itself.

Limitations for users

The major downsides of these requirements are that:

Do the requirements protect from ABA? (TODO: add an example showing that they do not).

Alternatives

Does restricting UnaryOp to be constexpr suffice?

No, constexpr functions in C++26 may not return in a finite set of steps, e.g., on free-standing implementations that do not replace the empty iteration statement in while(1); with a call to the non-constexpr API std::this_thread::yield(), enabling while(1) within constant evaluation:

constexpr void hang() { while(1); }

Furthermore, P3309 proposes constexpr atomics and synchronization operations.

Notes on Forward progress

Notice that relaxing UnaryOp constraints to:

would make the following program legal:

// Litmus test 0: discovered concurrency
atomic<int> atom, counter = 0;
static_assert(atomic<int>::is_lockfree);

int main() {
  std::jthread t0([&] { atom.fetch_update([&](int) {
    counter++;
    while(counter.load() != 2);
    return 0;
  })});

  std::jthread t1([&] { atom.fetch_update([&](int) {
    counter++;
    while(counter.load() != 2);
    return 0;
  })});

  return 0;
}

If the specification would require that this program terminates, then the following implementations become invalid for this execution:

  1. Taking a single lock for the whole duration of fetch_update:
    • Such an implementation deadlocks for the execution above.
  2. Hardware transactional memory (HTM):
    • Such an implementation does not make progress: as transaction time out is exceeded, the transaction is reverted, including the counter++ update, before yielding to the other thread.

It would not prevent a lock-based implementation that uses a CAS-loop, since an implementation that takes the lock multiple times within fetch_update makes progress.

The restrictions above preserve these two implementation opportunities.

Portable CAS-loop

The proposed fetch_update semantics could be weakened by removing all restrictions on UnaryOp, since atomic read-modify-write operations are atomic w.r.t. the modification order of the memory location being operated on. This is the implementation pursued by, e.g., Rust's atomic fetch_update APIs.

Then concurrent forward progress guarantees that Litmust test 0 (discovered concurrency) terminates, which implementations that take a single-lock for the whole fetch_update or use HTM fail to uphold in that case.

That does not mean that single-lock or HTM cannot be used, but rather that compiler analysis is required to prove that UnaryOp does not perform any of the operations that would enable those implementations to be valid (if the compiler can prove that, then they can still be used).

The proposed semantics are forward compatible with this weakening, i.e., it can be done later without breaking any valid code.

Transactional guarantees

The proposed fetch_update semantics could be strengthened by guaranteeing that every UnaryOp call happens-before another fetch_update call as part of an atomic access to the same memory location.

This would require implementing fetch_update as an atomic transaction (e.g. as per P1875: Transactional Memory Lite), and would guarantee the absence of data-races in the following litmus test:

// Litmus Test 1: atomic transaction atomic<int> atom; int counter = 0; int main() { std::jthread t0([&] { atom.fetch_update([&](int) { counter++; return 0; })}); std::jthread t1([&] { atom.fetch_update([&](int) { counter++; return 0; })}); return 0; }

This strengthening would require implementing fetch_update:

The proposed semantics are forward compatible with this strengthening, i.e., it can be done later without breaking any valid code.

std::expected-based API

An alternative API choice is to use std::expected to return the value in the success or failure case:

template <typename UnaryOp>
expected<T, T> fetch_update(UnaryOp uop, 
               memory_order order = memory_order::seq_cst) const noexcept;

atomic<pair<char, short>> atom;
    
auto r = atom.fetch_update([](pair<char, short> p) -> optional<pair<char, short>> {
    if (p.first > 42) return nullopt;
    return make_pair(p.first+1, p.second+2)};
});
if (r) assert(r.value().first <= 42);
else assert(r.error().first > 42);

Function object return type

The function object return type is mandated to be T or optional<T>.

Alternatives considered:

Prior art

Rust's fetch_update. Kokkos uses a similar API internally.

Wording

Add feature test macro to <version> synopsis:

#define __cpp_lib_atomic_fetch_update 202XXXL // also in <atomic>

For each of the following specializations [atomics.ref.int], [atomics.ref.float], [atomics.types.int], [atomics.types.float], and the generic [atomics.ref.generic.general] and [atomics.types.generic.general], add the following public function after their last public function:

... template <typename T, typename UnaryOp> bool fetch_update(T& old, UnaryOp uop, memory_order order = memory_order::seq_cst) const noexcept; template <typename T, typename UnaryOp> bool fetch_update(T& old, UnaryOp uop, memory_order success = memory_order::seq_cst, memory_order failure = memory_order::seq_cst) const noexcept; ...