Document number: P3330R0
Date: 2024-6-17.
Authors: Gonzalo Brito Gadeschi, Damien Lebrun-Grandie.
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>.
Audience: SG1.
Table of Contents
User-defined Atomic Read-Modify-Write Operations
We propose extending the read-modify-write APIs of std::atomic
and std::atomic_ref
with user-defined read-modify-write atomic operations. These move the responsibility of writing CAS-like loops from user code into the implementation, improving the portability, performance, and forward-progress of C++ applications.
Motivation
Without these extensions, users are required to implement user-defined read-modify-write operations with compare_exchange
loops (CAS-loops from now on), which:
- are hard to write correctly without compromising the QoI of forward progress,
- complicate compiler optimizations (no compiler optimizes a CAS-loop that does an
add
into an atomic read-modify-write fetch_add
), and
- do not perform well for non-lockfree atomics since CAS-loops take and release the lock multiple times per iteration.
Design
Note: a portable reference implementation is provided here for exposition purposes.
We propose extending std::atomic
and std::atomic_ref
with user-defined read-modify-write atomic operations:
template <typename T, typename UnaryOp>
bool fetch_update(T& old, UnaryOp uop,
memory_order order = memory_order::seq_cst) const noexcept;
template <typename T, typename UnaryOp>
bool fetch_update(T& old, UnaryOp uop,
memory_order success = memory_order::seq_cst,
memory_order failure = memory_order::seq_cst) const noexcept;
The result of the unary operation uop
is used to atomically update the value refered by the atomic type and it may be of type T
or optional<T>
. If uop
returns:
- a value of type
T
, or of type optional<T>
with has_value()
true, then fetch_update
writes the value returned by uop
to memory and returns true
to indicate its effect is that of an atomic read-modify-write operation,
nullopt
, then fetch_update
returns false
to indicate its effect is that of an atomic read operation.
In both cases, the value fetched by fetch_update
and passed to uop
is written to old
.
This API is similar to that of compare_exchange
operations and enables exposing vendor-specific atomic operations (e.g., fetch_fp_exp
):
Before
|
After
|
#include <atomic>
atomic<float> a = 0.f;
int main() {
float old = a.load(), next;
do {
next = std::fexp(old, 2.f);
} while(a.compare_exchange_strong(old, next));
return 0;
}
|
#include <atomic>
atomic<float> a = 0.f;
int main() {
float old;
a.fetch_update(old, [](float o) {
return std::fexp(o, 2.f);
});
return 0;
}
|
The following example shows how to use the API to update a pair
depending on the value of one of its sub-objects, aborting the update and turning the operation into an atomic read when a certain condition is met:
atomic<pair<char, short>> atom;
pair<char, short> old;
bool success
= atom.fetch_update(old, [](pair<char, short> p) -> optional<par<char, short>> {
if (p.first > 42) return nullopt;
return make_pair(p.first+1, p.second+2)};
});
assert((success && old.first <= 42) || (!success && old.first > 42));
Implementation strategies
The semantics this API may provide are restricted by the implementation strategies allowed and vice-versa. This design chooses to support, among others, following implementation strategies:
- CAS-loop based implementation using atomic CAS operations.
- Lock based implementations that take the lock:
- Multiple-times (e.g. lock-based CAS-loop).
- Only once.
- LL/SC based implementations.
- Hardware Transactional Memory (HTM) based implementations.
The design also aims at simplifying the implementation requirements for leveraging hardware acceleration.
Requirements on the user-defined operation
To support these implementation strategies, the UnaryOp(T)
call-operator and uop(old)
call-expression are required to:
- satisfy
regular_invocable
(for same inputs, produces same output), and
- be an implicit-lifetime type, and
noexcept(declval<UnaryOp>()(declval<T>()))
is true, and,
- only accesses its operands or its non-static data-members, and
- not perform a library I/O function call, a synchronization operation, or an atomic operation, and
- eventually return.
This requirements should imply that:
- the only side-effects
uop(old)
may have are modifying its non-static data members,
- for the same inputs it always produces the same outputs, independetly of whether other state of the program has changed because
uop(old)
can't access such state.
Requirements on the implementation
The effect of fetch_update
is to call uop(old)
exactly once. Implementation strategies that call uop(old)
multiple times are allowed as long as they behave "as if" uop(old)
was called exactly once.
For example, CAS-loop based implementations must use that UnaryOp
is implicit-lifetime to back up the uop
object, and call uop(old)
on a fresh copy of the backup every iteration, to discard any modifications that uop(old)
may have done to the uop
object itself.
Limitations for users
The major downsides of these requirements are that:
- they prevent the use of
printf
-based debugging within UnaryOp
,
Do the requirements protect from ABA? (TODO: add an example showing that they do not).
Alternatives
Does restricting UnaryOp
to be constexpr
suffice?
No, constexpr
functions in C++26 may not return in a finite set of steps, e.g., on free-standing implementations that do not replace the empty iteration statement in while(1);
with a call to the non-constexpr
API std::this_thread::yield()
, enabling while(1)
within constant evaluation:
constexpr void hang() { while(1); }
Furthermore, P3309 proposes constexpr
atomics and synchronization operations.
Notes on Forward progress
Notice that relaxing UnaryOp
constraints to:
- allowing
UnaryOp
to write to memory outside the UnaryOp
itself, and
- calling synchronization functions (like atomics),
would make the following program legal:
atomic<int> atom, counter = 0;
static_assert(atomic<int>::is_lockfree);
int main() {
std::jthread t0([&] { atom.fetch_update([&](int) {
counter++;
while(counter.load() != 2);
return 0;
})});
std::jthread t1([&] { atom.fetch_update([&](int) {
counter++;
while(counter.load() != 2);
return 0;
})});
return 0;
}
If the specification would require that this program terminates, then the following implementations become invalid for this execution:
- Taking a single lock for the whole duration of
fetch_update
:
- Such an implementation deadlocks for the execution above.
- Hardware transactional memory (HTM):
- Such an implementation does not make progress: as transaction time out is exceeded, the transaction is reverted, including the counter++ update, before yielding to the other thread.
It would not prevent a lock-based implementation that uses a CAS-loop, since an implementation that takes the lock multiple times within fetch_update
makes progress.
The restrictions above preserve these two implementation opportunities.
Portable CAS-loop
The proposed fetch_update
semantics could be weakened by removing all restrictions on UnaryOp
, since atomic read-modify-write operations are atomic w.r.t. the modification order of the memory location being operated on. This is the implementation pursued by, e.g., Rust's atomic fetch_update APIs.
Then concurrent forward progress guarantees that Litmust test 0 (discovered concurrency) terminates, which implementations that take a single-lock for the whole fetch_update
or use HTM fail to uphold in that case.
That does not mean that single-lock or HTM cannot be used, but rather that compiler analysis is required to prove that UnaryOp
does not perform any of the operations that would enable those implementations to be valid (if the compiler can prove that, then they can still be used).
The proposed semantics are forward compatible with this weakening, i.e., it can be done later without breaking any valid code.
Transactional guarantees
The proposed fetch_update
semantics could be strengthened by guaranteeing that every UnaryOp
call happens-before another fetch_update
call as part of an atomic access to the same memory location.
This would require implementing fetch_update
as an atomic transaction (e.g. as per P1875: Transactional Memory Lite), and would guarantee the absence of data-races in the following litmus test:
atomic<int> atom;
int counter = 0;
int main() {
std::jthread t0([&] { atom.fetch_update([&](int) {
counter++;
return 0;
})});
std::jthread t1([&] { atom.fetch_update([&](int) {
counter++;
return 0;
})});
return 0;
}
This strengthening would require implementing fetch_update
:
- with a single lock for the whole
fetch_update
, or
- hardware transactional memory.
The proposed semantics are forward compatible with this strengthening, i.e., it can be done later without breaking any valid code.
std::expected
-based API
An alternative API choice is to use std::expected
to return the value in the success or failure case:
template <typename UnaryOp>
expected<T, T> fetch_update(UnaryOp uop,
memory_order order = memory_order::seq_cst) const noexcept;
atomic<pair<char, short>> atom;
auto r = atom.fetch_update([](pair<char, short> p) -> optional<pair<char, short>> {
if (p.first > 42) return nullopt;
return make_pair(p.first+1, p.second+2)};
});
if (r) assert(r.value().first <= 42);
else assert(r.error().first > 42);
Function object return type
The function object return type is mandated to be T
or optional<T>
.
Alternatives considered:
UnaryOp(T, bool&) -> T
and signal failure through bool&
: requires constructing a T
dead value for the return.
UnaryOp(T) -> T
and signal failure by throwing an exception: no.
- Don't support failure: e.g. by returning the same value as passed in, would either perform an extra write, or whether an atomic read-modify-write vs atomic read happens would be "implicit" depending on whether the value
memcmp
's the same.
Prior art
Rust's fetch_update. Kokkos uses a similar API internally.
Wording
Add feature test macro to <version>
synopsis:
#define __cpp_lib_atomic_fetch_update 202XXXL
For each of the following specializations [atomics.ref.int], [atomics.ref.float], [atomics.types.int], [atomics.types.float], and the generic [atomics.ref.generic.general] and [atomics.types.generic.general], add the following public function after their last public function:
...
template <typename T, typename UnaryOp>
bool fetch_update(T& old, UnaryOp uop,
memory_order order = memory_order::seq_cst) const noexcept;
template <typename T, typename UnaryOp>
bool fetch_update(T& old, UnaryOp uop,
memory_order success = memory_order::seq_cst,
memory_order failure = memory_order::seq_cst) const noexcept;
...
-
Mandates: regular_invocable<UnaryOp, T>
is true, noexcept(uop(declval<T>()))
is true, and either is_same_v<invoke_result_t<UnaryOp, T>, T>
or is_same_v<invoke_result_t<UnaryOp, T>, optional<T>>
is true.
-
Preconditions: failure
is memory_order::relaxed
, memory_order::consume
, memory_order::acquire
, or memory_order::seq_cst
. The UnaryOp
call operator selected by the uop(old)
call expression:
- only accesses its operand or the non-static data-members of
uop
, and
- does not perform a library I/O function call, a synchronization operation, or an atomic operation, and
- eventually returns [intro.progress].
-
Effects: Atomically:
- retrieves the value referenced by
*ptr
into old
, and
- if
invoke_result_t<UnaryOp, T>
is T
, replaces the value referenced by *ptr
with the result of uop(old)
, in which case memory is affected according to the value of success
, or
uop(old).has_value()
is true
, replaces the value referenced by *ptr
with the result of uop(old).value()
, in which case memory is affected according to the value of success
, or
- memory is affected according to the value of
failure
.
When only one memory_order
argument is supplied, the value of success
is order
, and the value of failure
is order
except that a value of memory_order::acq_rel
shall be replaced by the value memory_order::acquire
and a value of memory_order::release
shall be replaced by the value memory_order::relaxed
.
-
Returns: true
if the operation replaced the value of *ptr
and false
otherwise.
[Note: if fetch_update
returns true
, it is an atomic read-modify-write operation according to [intro.races], and an atomic read otherwise - end note]
Document number: P3330R0
Date: 2024-6-17.
Authors: Gonzalo Brito Gadeschi, Damien Lebrun-Grandie.
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>.
Audience: SG1.
Table of Contents
User-defined Atomic Read-Modify-Write Operations
We propose extending the read-modify-write APIs of
std::atomic
andstd::atomic_ref
with user-defined read-modify-write atomic operations. These move the responsibility of writing CAS-like loops from user code into the implementation, improving the portability, performance, and forward-progress of C++ applications.Motivation
Without these extensions, users are required to implement user-defined read-modify-write operations with
compare_exchange
loops (CAS-loops from now on), which:add
into an atomic read-modify-writefetch_add
), andDesign
We propose extending
std::atomic
andstd::atomic_ref
with user-defined read-modify-write atomic operations:The result of the unary operation
uop
is used to atomically update the value refered by the atomic type and it may be of typeT
oroptional<T>
. Ifuop
returns:T
, or of typeoptional<T>
withhas_value()
true, thenfetch_update
writes the value returned byuop
to memory and returnstrue
to indicate its effect is that of an atomic read-modify-write operation,nullopt
, thenfetch_update
returnsfalse
to indicate its effect is that of an atomic read operation.In both cases, the value fetched by
fetch_update
and passed touop
is written toold
.This API is similar to that of
compare_exchange
operations and enables exposing vendor-specific atomic operations (e.g.,fetch_fp_exp
):Before
After
The following example shows how to use the API to update a
pair
depending on the value of one of its sub-objects, aborting the update and turning the operation into an atomic read when a certain condition is met:Implementation strategies
The semantics this API may provide are restricted by the implementation strategies allowed and vice-versa. This design chooses to support, among others, following implementation strategies:
The design also aims at simplifying the implementation requirements for leveraging hardware acceleration.
Requirements on the user-defined operation
To support these implementation strategies, the
UnaryOp(T)
call-operator anduop(old)
call-expression are required to:regular_invocable
(for same inputs, produces same output), andnoexcept(declval<UnaryOp>()(declval<T>()))
is true, and,This requirements should imply that:
uop(old)
may have are modifying its non-static data members,uop(old)
can't access such state.Requirements on the implementation
The effect of
fetch_update
is to calluop(old)
exactly once. Implementation strategies that calluop(old)
multiple times are allowed as long as they behave "as if"uop(old)
was called exactly once.For example, CAS-loop based implementations must use that
UnaryOp
is implicit-lifetime to back up theuop
object, and calluop(old)
on a fresh copy of the backup every iteration, to discard any modifications thatuop(old)
may have done to theuop
object itself.Limitations for users
The major downsides of these requirements are that:
printf
-based debugging withinUnaryOp
,Do the requirements protect from ABA? (TODO: add an example showing that they do not).
Alternatives
Does restricting
UnaryOp
to beconstexpr
suffice?No,
constexpr
functions in C++26 may not return in a finite set of steps, e.g., on free-standing implementations that do not replace the empty iteration statement inwhile(1);
with a call to the non-constexpr
APIstd::this_thread::yield()
, enablingwhile(1)
within constant evaluation:Furthermore, P3309 proposes
constexpr
atomics and synchronization operations.Notes on Forward progress
Notice that relaxing
UnaryOp
constraints to:UnaryOp
to write to memory outside theUnaryOp
itself, andwould make the following program legal:
If the specification would require that this program terminates, then the following implementations become invalid for this execution:
fetch_update
:It would not prevent a lock-based implementation that uses a CAS-loop, since an implementation that takes the lock multiple times within
fetch_update
makes progress.The restrictions above preserve these two implementation opportunities.
Portable CAS-loop
The proposed
fetch_update
semantics could be weakened by removing all restrictions onUnaryOp
, since atomic read-modify-write operations are atomic w.r.t. the modification order of the memory location being operated on. This is the implementation pursued by, e.g., Rust's atomic fetch_update APIs.Then concurrent forward progress guarantees that Litmust test 0 (discovered concurrency) terminates, which implementations that take a single-lock for the whole
fetch_update
or use HTM fail to uphold in that case.That does not mean that single-lock or HTM cannot be used, but rather that compiler analysis is required to prove that
UnaryOp
does not perform any of the operations that would enable those implementations to be valid (if the compiler can prove that, then they can still be used).The proposed semantics are forward compatible with this weakening, i.e., it can be done later without breaking any valid code.
Transactional guarantees
The proposed
fetch_update
semantics could be strengthened by guaranteeing that everyUnaryOp
call happens-before anotherfetch_update
call as part of an atomic access to the same memory location.This would require implementing
fetch_update
as an atomic transaction (e.g. as per P1875: Transactional Memory Lite), and would guarantee the absence of data-races in the following litmus test:This strengthening would require implementing
fetch_update
:fetch_update
, orThe proposed semantics are forward compatible with this strengthening, i.e., it can be done later without breaking any valid code.
std::expected
-based APIAn alternative API choice is to use
std::expected
to return the value in the success or failure case:Function object return type
The function object return type is mandated to be
T
oroptional<T>
.Alternatives considered:
UnaryOp(T, bool&) -> T
and signal failure throughbool&
: requires constructing aT
dead value for the return.UnaryOp(T) -> T
and signal failure by throwing an exception: no.memcmp
's the same.Prior art
Rust's fetch_update. Kokkos uses a similar API internally.
Wording
Add feature test macro to
<version>
synopsis:For each of the following specializations [atomics.ref.int], [atomics.ref.float], [atomics.types.int], [atomics.types.float], and the generic [atomics.ref.generic.general] and [atomics.types.generic.general], add the following public function after their last public function:
Mandates:
regular_invocable<UnaryOp, T>
is true,noexcept(uop(declval<T>()))
is true, and eitheris_same_v<invoke_result_t<UnaryOp, T>, T>
oris_same_v<invoke_result_t<UnaryOp, T>, optional<T>>
is true.Preconditions:
failure
ismemory_order::relaxed
,memory_order::consume
,memory_order::acquire
, ormemory_order::seq_cst
. TheUnaryOp
call operator selected by theuop(old)
call expression:uop
, andEffects: Atomically:
*ptr
intoold
, andinvoke_result_t<UnaryOp, T>
isT
, replaces the value referenced by*ptr
with the result ofuop(old)
, in which case memory is affected according to the value ofsuccess
, oruop(old).has_value()
istrue
, replaces the value referenced by*ptr
with the result ofuop(old).value()
, in which case memory is affected according to the value ofsuccess
, orfailure
.When only one
memory_order
argument is supplied, the value ofsuccess
isorder
, and the value offailure
isorder
except that a value ofmemory_order::acq_rel
shall be replaced by the valuememory_order::acquire
and a value ofmemory_order::release
shall be replaced by the valuememory_order::relaxed
.Returns:
true
if the operation replaced the value of*ptr
andfalse
otherwise.[Note: if
fetch_update
returnstrue
, it is an atomic read-modify-write operation according to [intro.races], and an atomic read otherwise - end note]