Document number: P3111R0.
Date: 2024-5-16.
Authors: Gonzalo Brito Gadeschi, Simon Cooksey, Daniel Lustig.
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>.
Audience: SG1, SG6.
Table of Contents
Atomic Reduction Operations
Atomic Reduction Operations are atomic read-modify-write (RMW) operations (like fetch_add
) that do not "fetch" the old value and are not reads from the Memory Model perspective. This enables implementations to leverage hardware acceleration available in modern CPU and GPU architectures.
Furthermore, we propose to allow atomic memory operations that aren't reads in unsequenced execution, and to extend atomic arithmetic reduction operations for floating-point types with operations that assume floating-point arithmetic is associative.
Introduction
Concurrent algorithms performing atomic RMW operations that discard the old fetched value are very common in high-performance computing, e.g., finite-element matrix assembly, data analytics (e.g. building histograms), etc.
Consider the following parallel algorithm to build a histogram ():
span<unsigned> data;
array<atomic<unsigned>, N> buckets;
constexpr T bucket_sz = numeric_limits<T>::max() / (T)N;
unsigned nthreads = thread::hardware_concurrency();
for_each_n(execution::par_unseq, views::iota(0).begin(), nthreads,
[&](int thread) {
unsigned data_per_thread = data.size() / nthreads;
T* data_thread = data.data() + data_per_thread * thread;
for (auto e : span<T>(data_thread, data_per_thread))
buckets[e / bucket_sz].fetch_add(1, memory_order_relaxed);
});
This program has two main issues:
- Correctness (undefined behavior): The program should have used the
par
execution policy to avoid undefined behavior since the atomic operation is not:
- Potentially concurrent and therefore exhibits a data-race in unsequenced contexts ([intro.execution]).
- Vectorization-safe [algorithms.parallel.defns#5] since it is specificed to synchronize with other function invocations.
- Performance: Sophisticated compiler analysis required to optimize the above program for scalable hardware architectures with atomic reduction operations.
Atomic reduction operations address both shortcomings:
Before (compiler-explorer) |
After |
#include <algorithm>
#include <atomic>
#include <execution>
using namespace std;
using execution::par_unseq;
int main() {
size_t N = 10000;
vector<int> v(N, 0);
atomic<int> atom = 0;
for_each_n(par_unseq,
v.begin(), N,
[&](auto& e) {
atom.fetch_add(e);
});
return atom.load();
}
|
#include <algorithm>
#include <atomic>
#include <execution>
using namespace std;
using execution::par_unseq;
int main() {
size_t N = 10000;
vector<int> v(N, 0);
atomic<int> atom = 0;
for_each_n(par_unseq,
v.begin(), N,
[&](auto& e) {
atom.add(e);
});
return atom.load();
}
|
This new operation can then be used in the Histogram Example (example 0), to replace the fetch_add
with just add
.
Motivation
Hardware Exposure
The following ISAs provide Atomic Reduction Operation:
Some of these instructions lack a destination operand (red
, STADD
, AADD). Others change semantics if the destination register used discards the result (Arm's LDADD RZ
, SWP RZ
, CAS RZ
).
All ISAs provide the same sematics: these are not loads from the point-of-view of the Memory Model, and therefore do not participate in acquire sequences, but they do participate in release sequences:
These architectures provide both "relaxed" and "release" orderings for the reductions (e.g. red.relaxed
/red.release
, STADD
/STADDL
).
On hardware architectures that implement these as far atomics, the exposed latency of Atomic Reduction Operations may be as low as half that of "fetch_<key>
" operations.
Example: on an NVIDIA Hopper H100 GPU, replacing atomic.fetch_add
with atomic.add
on the Histogram Example (Example 0) improves throughput by 1.2x.
Design
For each atomic fetch_{OP}
in the atomic<T>
and atomic_ref<T>
class templates and their specializations, we introduce new {OP}
member functions that return void
.
These can be conservatively implemented on top of fetch_{OP}
APIs.
Alternative: optimizations for fetch_<key>
Attempting to improve application performance by implementing compiler-optimizations to leverage Atomic Reduction Operations from fetch_<key>
APIs has become a rite of passage for compiler engineers, e.g., GCC#509632, LLVM#68428, LLVM#72747, … Unfortunately, "simple" optimization strategies break backward compatibility in the following litmus tests (among others).
Litmus Test 0: from Will Deacon. Performing the optimization to replace the introduces the y == 2 && r0 == 1 && r1 == 0
outcome:
void thread0(atomic_int* y,atomic_int* x) {
atomic_store_explicit(x,1,memory_order_relaxed);
atomic_thread_fence(memory_order_release);
atomic_store_explicit(y,1,memory_order_relaxed);
}
void thread1(atomic_int* y,atomic_int* x) {
atomic_fetch_add_explicit(y,1,memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);
int r0 = atomic_load_explicit(x,memory_order_relaxed);
}
void thread2(atomic_int* y) {
int r1 = atomic_load_explicit(y,memory_order_relaxed);
}
Litmus Test 1: from Luke Geeson. Performing the optimization of replacing the exchange with a store introduces the r0 == 0 && y == 2
outcome:
void thread0(atomic_int* y,atomic_int* x) {
atomic_store_explicit(x,1,memory_order_relaxed);
atomic_thread_fence(memory_order_release);
atomic_store_explicit(y,1,memory_order_relaxed);
}
void thread1(atomic_int* y,atomic_int* x) {
atomic_exchange_explicit(y,2,memory_order_release);
atomic_thread_fence(memory_order_acquire);
int r0 = atomic_load_explicit(x,memory_order_relaxed);
}
In some architectures, Atomic Reduction Operations can write to memory pages that are not readable, and need a reliable programming model that does not depend on compiler-optimizations for functionality.
Forward progress
Currently, all atomic memory operations are vectorization-unsafe and therefore not allowed in element access functions of parallel algorithms when the unseq
or par_unseq
execution policies are used (see [algorithms.parallel.exec.5] and [algorithms.parallel.exec.7]). Atomic memory operations that "read" (e.g. load
, fetch_<key>
, compare_exchange
, exchange
, …) enable building synchronization edges that block, which within unseq
/par_unseq
leads to dead-locks. N4070 solved this by tightening the wording to disallow any synchronization API from being called from within unseq
/par_unseq
.
Allowing Atomic Writes and Atomic Reduction Operations in unsequenced execution increases the set of concurrent algorithms that can be implemented in the lowest-common denominator of hardware that C++ supports. In particular, many hardware architectures that can accelerate unseq
/par_unseq
but cannot accelerate par
(e.g. most non-NVIDIA GPUs), provide acceleration for atomic reduction operations.
We propose to make lock-free atomic operations that are not reads vectorization safe to enable calling them from unsequenced execution. Atomic operations that read remain vectorization-unsafe and therefore UB:
for_each(par_unseq, ..., [&](...) {
assert(atom.is_lockfree);
atom.store(42);
atom.add(1);
atom.fetch_add(1);
atom.exchange(42);
while (atom.load() < 42);
});
The rest of this proposal is not blocked on this extension, which could be split into a separate proposal.
Implementation impact
There is no impact on implementations using:
- OpenMP simd pragma for
unseq
and par_unseq
, since OpenMP supports atomics within simd
regions.
- pragma clang loop is a hint.
Generalized Atomic Reduction Operations
The outcome x == a + (b + c)
is not valid for the following litmus test because either the atomic reduction of thread0 happens-before that of thread 1, or vice-versa, and floating-point arithmetic is not associative:
atomic<float> x = a;
void thread0() { x.add(b, memory_order_relaxed); }
void thread1() { x.add(c, memory_order_relaxed); }
The value of this limitation seems small, since each execution may pick a different non-deterministic order.
The cost of this limitation is significant, since it requires implementations to perform reductions sequentially using O(N)
operations instead of with O(log(N))
tree-reduction algorithms. On GPU architectures, performing an horizontal reduction for then issuing a single atomic operation per thread group, reduces the number of atomic operation issued by up to the size of the thread group.
We propose providing generalized atomic reduction operations, defined in an analogous way to GENERALIZED_SUM
(see [numerics.defns]). The "add
" and "add_generalized
" operations below are different, and the latter provides implementations with the flexibility to perform a tree-reduction.
We could either only provide operations with the GENERALIZED_...
semantics for floating-point
, or prodivde them as separate methods:
template <floating-point>
class atomic {
void add(floating-point, memory_order);
void add_generalizedg(floating-point, memory_order);
};
Given the non-determinism inherent to concurrent atomic operations, the value of providing a version that differs from GENERALIZED_..
for floating-point
seems low. That is, we give the atomic<floating-point>::add/sub/...
methods GENERALIZED_...
semantics, and do not provide any explicit methods to pick.
The rest of this proposal is not blocked on this extension, which could be split into a separate proposal.
Memory Ordering
We choose to support memory_order_relaxed
, memory_order_release
, and memory_order_seq_cst
, since we do not see any issues with doing that. If this proves to be controversial, only exposing memory_order_relaxed
would still be valuable.
Herd already support these for STADD
on Arm, and the NVIDIA Volta Memory Model supports these for red
on PTX. If we decide to pursue this exposure direction, this proposal should be blocked on extending Herd's
Wording
Unsequenced support
Add to [intro.execution]:
Except where noted, evaluations of operands of individual operators and of subexpressions of individual expressions are unsequenced.
[Note 5: In an expression that is evaluated more than once during the execution of a program, unsequenced and indeterminately sequenced evaluations of its subexpressions need not be performed consistently in different evaluations. — end note]
The value computations of the operands of an operator are sequenced before the value computation of the result of the operator. If a side effect on a memory location ([intro.memory]) is unsequenced relative to either another side effect on the same memory location or a value computation using the value of any object in the same memory location, and they are not lock-free atomic read operations ([atomics]) or potentially concurrent ([intro.multithread]), the behavior is undefined.
[Note 6: The next subclause imposes similar, but more complex restrictions on potentially concurrent computations. — end note]
[Example 3:
void g(int i) {
i = 7, i++, i++;
i = i++ + 1;
i = i++ + i;
i = i + 1;
}
— end example]
Add to [algorithms.parallel.defns]:
Note: we should consider exempting stores as well.
A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation, or another function invocation is specified to synchronize with it, and if it is not a memory allocation or deallocation function
[Note 2: Implementations must ensure that internal synchronization inside standard library functions does not prevent forward progress when those functions are executed by threads of execution with weakly parallel forward progress guarantees. — end note]
[Example 2:
int x = 0;
std::mutex m;
void f() {
int a[] = {1,2};
std::for_each(std::execution::par_unseq, std::begin(a), std::end(a), [&](int) {
std::lock_guard<mutex> guard(m);
++x;
});
}
The above program may result in two consecutive calls to m.lock() on the same thread of execution (which may deadlock), because the applications of the function object are not guaranteed to run on different threads of execution. — end example]
Forward progress
Modify intro.progress#1 as follows:
Note: we should consider exempting stores as well.
The implementation may assume that any thread will eventually do one of the following:
- terminate,
- make a call to a library I/O function,
- perform an access through a volatile glvalue, or
- perform a synchronization operation or an atomic operation that is not an atomic reduction operation .
[Note 1: This is intended to allow compiler transformations such as removal of empty loops, even when termination cannot be proven. — end note]
Modify intro.progress#3 as follows:
Note: we should consider exempting stores as well.
During the execution of a thread of execution, each of the following is termed an execution step:
- termination of the thread of execution,
- performing an access through a volatile glvalue, or
- completion of a call to a library I/O function, or a synchronization operation
, or an atomic operation that is not an atomic reduction operation.
No acquire sequences support
Modify [atomics.fences] as follows:
33.5.11 Fences[atomics.fences]
- This subclause introduces synchronization primitives called fences. Fences can have acquire semantics, release semantics, or both. A fence with acquire semantics is called an acquire fence. A fence with release semantics is called a release fence.
- A release fence A synchronizes with an acquire fence B if there exist atomic operation
s X and non-reduction-atomic operation Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.
- A release fence A synchronizes with an atomic operation B that performs an acquire operation on an atomic object M if there exists an atomic operation X such that A is sequenced before X, X modifies M, and B reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.
- An atomic operation A that is a release operation on an atomic object M synchronizes with an acquire fence B if there exists some atomic operation X on M such that X is sequenced before B and reads the value written by A or a value written by any side effect in the release sequence headed by A.
Atomic Reduction Operation APIs
The following operations perform arithmetic computations. The correspondence among key, operator, and computation is specified in Table 145:
key |
op |
computation |
add |
+ |
addition |
sub |
- |
subtraction |
max |
|
maximum |
min |
|
minimum |
and |
& |
bitwise and |
or |
| |
bitwise inclusive or |
xor |
^ |
bitwise exclusive or |
Add to [atomics.ref.int]:
void key(integral-type operand, memory_order order = memory_order_seq_cst) const noexcept;
- Preconditions:
order
is memory_order_relaxed
, memory_order_release
, or memory_order_seq_cst
.
- Effects: Atomically replaces the value referenced by
*ptr
with the result of the computation applied to the value referenced by *ptr
and the given operand
. Memory is affected according to the value of order
. These operations are atomic read-modify-write operations ([intro.races]) and atomic reduction operations ([atomics.fences]). Lock-free atomic reduction operations are vectorization-safe ([algorithms.parallel.defns]).
- Remarks: For signed integer types, the result is as if the object value and parameters were converted to their corresponding unsigned types, the computation performed on those types, and the result converted back to the signed type.
[Note 2: There are no undefined results arising from the computation. — end note]
- Remarks: For floating point types,
add
and sub
perform generalized [numerics.defns] arithmetic.
Analogously for all other std::atomic
and std::atomic_ref
types and specializations: [atomics.types.int], [atomics.types.float], [atomics.types.pointer] (with difference_type
operand), [atomics.ref.float], [atomics.ref.pointer] (with difference_type
operand), etc.
Document number: P3111R0.
Date: 2024-5-16.
Authors: Gonzalo Brito Gadeschi, Simon Cooksey, Daniel Lustig.
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>.
Audience: SG1, SG6.
Table of Contents
Atomic Reduction Operations
Atomic Reduction Operations are atomic read-modify-write (RMW) operations (like
fetch_add
) that do not "fetch" the old value and are not reads from the Memory Model perspective. This enables implementations to leverage hardware acceleration available in modern CPU and GPU architectures.Furthermore, we propose to allow atomic memory operations that aren't reads in unsequenced execution, and to extend atomic arithmetic reduction operations for floating-point types with operations that assume floating-point arithmetic is associative.
Introduction
Concurrent algorithms performing atomic RMW operations that discard the old fetched value are very common in high-performance computing, e.g., finite-element matrix assembly, data analytics (e.g. building histograms), etc.
Consider the following parallel algorithm to build a histogram (full implementation):
This program has two main issues:
par
execution policy to avoid undefined behavior since the atomic operation is not:Atomic reduction operations address both shortcomings:
This new operation can then be used in the Histogram Example (example 0), to replace the
fetch_add
with justadd
.Motivation
Hardware Exposure
The following ISAs provide Atomic Reduction Operation:
red
.LDADD RZ
,STADD
,SWP RZ
,CAS RZ
.Some of these instructions lack a destination operand (
red
,STADD
, AADD). Others change semantics if the destination register used discards the result (Arm'sLDADD RZ
,SWP RZ
,CAS RZ
).All ISAs provide the same sematics: these are not loads from the point-of-view of the Memory Model, and therefore do not participate in acquire sequences, but they do participate in release sequences:
red
is "not a read operation".These architectures provide both "relaxed" and "release" orderings for the reductions (e.g.
red.relaxed
/red.release
,STADD
/STADDL
).Performance
On hardware architectures that implement these as far atomics, the exposed latency of Atomic Reduction Operations may be as low as half that of "
fetch_<key>
" operations.Example: on an NVIDIA Hopper H100 GPU, replacing
atomic.fetch_add
withatomic.add
on the Histogram Example (Example 0) improves throughput by 1.2x.Design
For each atomic
fetch_{OP}
in theatomic<T>
andatomic_ref<T>
class templates and their specializations, we introduce new{OP}
member functions that returnvoid
.These can be conservatively implemented on top of
fetch_{OP}
APIs.Alternative: optimizations for
fetch_<key>
Attempting to improve application performance by implementing compiler-optimizations to leverage Atomic Reduction Operations from
fetch_<key>
APIs has become a rite of passage for compiler engineers, e.g., GCC#509632, LLVM#68428, LLVM#72747, … Unfortunately, "simple" optimization strategies break backward compatibility in the following litmus tests (among others).Litmus Test 0: from Will Deacon. Performing the optimization to replace the introduces the
y == 2 && r0 == 1 && r1 == 0
outcome:Litmus Test 1: from Luke Geeson. Performing the optimization of replacing the exchange with a store introduces the
r0 == 0 && y == 2
outcome:In some architectures, Atomic Reduction Operations can write to memory pages that are not readable, and need a reliable programming model that does not depend on compiler-optimizations for functionality.
Forward progress
Currently, all atomic memory operations are vectorization-unsafe and therefore not allowed in element access functions of parallel algorithms when the
unseq
orpar_unseq
execution policies are used (see [algorithms.parallel.exec.5] and [algorithms.parallel.exec.7]). Atomic memory operations that "read" (e.g.load
,fetch_<key>
,compare_exchange
,exchange
, …) enable building synchronization edges that block, which withinunseq
/par_unseq
leads to dead-locks. N4070 solved this by tightening the wording to disallow any synchronization API from being called from withinunseq
/par_unseq
.Allowing Atomic Writes and Atomic Reduction Operations in unsequenced execution increases the set of concurrent algorithms that can be implemented in the lowest-common denominator of hardware that C++ supports. In particular, many hardware architectures that can accelerate
unseq
/par_unseq
but cannot acceleratepar
(e.g. most non-NVIDIA GPUs), provide acceleration for atomic reduction operations.We propose to make lock-free atomic operations that are not reads vectorization safe to enable calling them from unsequenced execution. Atomic operations that read remain vectorization-unsafe and therefore UB:
The rest of this proposal is not blocked on this extension, which could be split into a separate proposal.
Implementation impact
There is no impact on implementations using:
unseq
andpar_unseq
, since OpenMP supports atomics withinsimd
regions.Generalized Atomic Reduction Operations
The outcome
x == a + (b + c)
is not valid for the following litmus test because either the atomic reduction of thread0 happens-before that of thread 1, or vice-versa, and floating-point arithmetic is not associative:The value of this limitation seems small, since each execution may pick a different non-deterministic order.
The cost of this limitation is significant, since it requires implementations to perform reductions sequentially using
O(N)
operations instead of withO(log(N))
tree-reduction algorithms. On GPU architectures, performing an horizontal reduction for then issuing a single atomic operation per thread group, reduces the number of atomic operation issued by up to the size of the thread group.We propose providing generalized atomic reduction operations, defined in an analogous way to
GENERALIZED_SUM
(see [numerics.defns]). The "add
" and "add_generalized
" operations below are different, and the latter provides implementations with the flexibility to perform a tree-reduction.We could either only provide operations with the
GENERALIZED_...
semantics forfloating-point
, or prodivde them as separate methods:Given the non-determinism inherent to concurrent atomic operations, the value of providing a version that differs from
GENERALIZED_..
forfloating-point
seems low. That is, we give theatomic<floating-point>::add/sub/...
methodsGENERALIZED_...
semantics, and do not provide any explicit methods to pick.The rest of this proposal is not blocked on this extension, which could be split into a separate proposal.
Memory Ordering
We choose to support
memory_order_relaxed
,memory_order_release
, andmemory_order_seq_cst
, since we do not see any issues with doing that. If this proves to be controversial, only exposingmemory_order_relaxed
would still be valuable.Formalization
Herd already support these for
STADD
on Arm, and the NVIDIA Volta Memory Model supports these forred
on PTX. If we decide to pursue this exposure direction, this proposal should be blocked on extending Herd's RC11 with this extension to ensure it is sound.Wording
Unsequenced support
Add to [intro.execution]:
Add to [algorithms.parallel.defns]:
Note: we should consider exempting stores as well.
Forward progress
Modify intro.progress#1 as follows:
Note: we should consider exempting stores as well.
Modify intro.progress#3 as follows:
Note: we should consider exempting stores as well.
No acquire sequences support
Modify [atomics.fences] as follows:
Atomic Reduction Operation APIs
The following operations perform arithmetic computations. The correspondence among key, operator, and computation is specified in Table 145:
+
-
&
|
^
Add to [atomics.ref.int]:
order
ismemory_order_relaxed
,memory_order_release
, ormemory_order_seq_cst
.*ptr
with the result of the computation applied to the value referenced by*ptr
and the givenoperand
. Memory is affected according to the value oforder
. These operations are atomic read-modify-write operations ([intro.races]) and atomic reduction operations ([atomics.fences]). Lock-free atomic reduction operations are vectorization-safe ([algorithms.parallel.defns]).[Note 2: There are no undefined results arising from the computation. — end note]
add
andsub
perform generalized [numerics.defns] arithmetic.Analogously for all other
std::atomic
andstd::atomic_ref
types and specializations: [atomics.types.int], [atomics.types.float], [atomics.types.pointer] (withdifference_type
operand), [atomics.ref.float], [atomics.ref.pointer] (withdifference_type
operand), etc.