Document Number: P2628R0
Date: 2022-07-01
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>
Authors: Gonzalo Brito Gadeschi
Audience: Concurrency
Extend barrier
APIs with memory_order
Motivation
std::barrier
is a synchronization primitive that orders memory visibility across threads. Its operations - arrive
, wait
, arrive_and_wait
, arrive_and_drop
- guarantee visibility of all memory operations performed before the arrive
to all threads that are unblocked from the wait
(after they are unblocked from it).
Sometimes, guaranteeing the memory visibility of all memory operations is not required nor desired.
Examples:
- Communicating “outside” the C++ abstract machine. Examples:
- Every thread participating in the barrier opens, writes to, and closes a file. Threads use a barrier to synchronize whether all files have been closed using
bar.arrive_and_wait(1, memory_order_relaxed)
. The memory visibility is ensured by the filesystem outside the C++ abstract machine. This is similar to how, e.g., MPI_Ibarrier
does not establish cumulativity across MPI ranks for memory operations on MPI_Windows
.
- Every thread participating in the barrier is responsible for configuring a part of a machine via
volatile
operations. After the machine is configured, one thread should start it before releasing any threads from the barrier. Threads achieve this by using memory_order_relaxed
arrive/wait operations together with a CompletionFunction
that is run after the last thread arrived.
- Object fences: P2535 - and P0153 before it - proposes
atomic_object_fence
and atomic_message_fence
. These fences only apply to a sub-set of all objects and memory operations. This paper enables applications to compose std::barrier
with P2535 fences. This is explored in its own section further below.
Other synchronization primitives could be extended in an analogous way as well, but we choose to focus on std::barrier
during the initial revisions of this paper. Some synchronization primitives like std::atomic::wait
already expose a memory_order
parameter.
Tony Tables
Before |
After |
// Thread 0: x = 1; atomic_object_fence(memory_order_release, x); bar.arrive(); // release fence
// Thread 1 bar.arrive_and_wait(); // acquire fence atomic_object_fence(memory_order_acquire, x); assert(x == 1); |
// Thread 0: x = 1; atomic_object_fence(memory_order_release, x); bar.arrive(1, memory_order_relaxed); // no fence
// Thread 1 bar.arrive_and_wait(memory_order_relaxed); // no fence atomic_object_fence(memory_order_acquire, x); assert(x == 1); |
Before: does not benefit from atomic_object_fence
since the barrier operations insert full-memory fences.
After: benefits from atomic_object_fence
since the barrier inserts no fences.
Wording
Note: an implementation is available here in the barrier_memory_order.hpp
file.
thread.barrier.class:
namespace std {
template<class CompletionFunction = see below>
class barrier {
public:
using arrival_token = see below;
static constexpr ptrdiff_t max() noexcept;
constexpr explicit barrier(ptrdiff_t expected,
CompletionFunction f = CompletionFunction());
~barrier();
barrier(const barrier&) = delete;
barrier& operator=(const barrier&) = delete;
[[nodiscard]] arrival_token arrive(ptrdiff_t update = 1, memory_order = memory_order_release);
void wait(arrival_token&& arrival, memory_order = memory_order_acquire) const;
void arrive_and_wait(ptrdiff_t update = 1, memory_order = memory_order_acq_rel);
void arrive_and_drop(ptrdiff_t update = 1, memory_order = memory_order_release);
private:
CompletionFunction completion; // exposition only
};
}
Unresolved question: should we add update = 1
parameters to the APIs that lack them? This revision does that only to show how that would look like.
-
Each barrier phase consists of the following steps:
- The expected count is decremented by each call to
arrive
or arrive_and_drop
.
- When the expected count reaches zero, the phase completion step is run. For the specialization with the default value of the
CompletionFunction
template parameter, the completion step is run as part of the call to arrive
or arrive_and_drop
that caused the expected count to reach zero. For other specializations, the completion step is run on one of the threads that arrived at the barrier during the phase.
- When the completion step finishes, the expected count is reset to what was specified by the expected argument to the constructor, possibly adjusted by calls to
arrive_and_drop
, and the next phase starts.
-
Each phase defines a phase synchronization point. Threads that arrive at the barrier during the phase can block on the phase synchronization point by calling wait
, and will remain blocked until the phase completion step is run.
-
The phase completion step that is executed at the end of each phase has the following effects:
- Invokes the completion function, equivalent to
completion()
.
- Unblocks all threads that are blocked on the phase synchronization point.
UNRESOLVED QUESTION: do we need to change something else around here or does the “as if” below in “4.” suffice?
The end of the completion step strongly happens before the returns from all calls that were unblocked by the completion step. For specializations that do not have the default value of the CompletionFunction template parameter, the behavior is undefined if any of the barrier object’s member functions other than wait
are called while the completion step is in progress.
-
Concurrent invocations of the member functions of barrier, other than its destructor, do not introduce data races as if they were atomic operations performed with the memory_order
associated with them. The member functions arrive
and arrive_and_drop
execute atomically.
-
CompletionFunction
shall meet the Cpp17MoveConstructible
(Table 30) and Cpp17Destructible
(Table 34) requirements. is_nothrow_invocable_v<CompletionFunction&>
shall be true.
-
The default value of the CompletionFunction
template parameter is an unspecified type, such that, in addition to satisfying the requirements of CompletionFunction
, it meets the Cpp17DefaultConstructible
requirements (Table 29) and completion()
has no effects.
-
barrier::arrival_token
is an unspecified type, such that it meets the Cpp17MoveConstructible
(Table 30), Cpp17MoveAssignable
(Table 32), and Cpp17Destructible
(Table 34) requirements.
static constexpr ptrdiff_t max() noexcept;
- Returns: The maximum expected count that the implementation supports.
constexpr explicit barrier(ptrdiff_t expected,
CompletionFunction f = CompletionFunction());
-
Preconditions: expected >= 0
is true and expected <= max()
is true.
-
Effects: Sets both the initial expected count for each barrier phase and the current expected count for the first phase to expected. Initializes completion with std::move(f)
. Starts the first phase.
[Note 1: If expected is 0 this object can only be destroyed. — end note]
-
Throws: Any exception thrown by CompletionFunction
’s move constructor.
[[nodiscard]] arrival_token arrive(ptrdiff_t update = 1, memory_order order = memory_order_release);
-
Preconditions: update > 0
is true, and update
is less than or equal to the expected count for the current barrier phase, and order
is memory_order_relaxed
or memory_order_release
.
-
Effects: Constructs an object of type arrival_token
that is associated with the phase synchronization point for the current phase. Then, decrements the expected count by update
.
-
Synchronization: The call to arrive strongly happens before the start of the phase completion step for the current phase.
-
Returns: The constructed arrival_token
object.
-
Throws: system_error
when an exception is required ([thread.req.exception]).
-
Error conditions: Any of the error conditions allowed for mutex types ([thread.mutex.requirements.mutex]).
[Note 2: This call can cause the completion step for the current phase to start. — end note]
void wait(arrival_token&& arrival, memory_order order = memory_order_acquire) const;
-
Preconditions: arrival is associated with the phase synchronization point for the current phase or the immediately preceding phase of the same barrier object, and order
is memory_order_relaxed
or memory_order_acquire
.
-
Effects: Blocks at the synchronization point associated with std::move(arrival)
until the phase completion step of the synchronization point’s phase is run.
[Note 3: If arrival is associated with the synchronization point for a previous phase, the call returns immediately. — end note]
-
Throws: system_error
when an exception is required ([thread.req.exception]).
-
Error conditions: Any of the error conditions allowed for mutex types ([thread.mutex.requirements.mutex]).
void arrive_and_wait(ptrdiff_t update = 1, memory_order order = memory_order_acq_rel);
void arrive_and_drop(ptrdiff_t update = 1, memory_order order = memory_order_release);
- Preconditions: The expected count for the current barrier phase is greater than zero,
update > 0
is true, update
is less than or equal to the expected count for the current barrier phase, and order
is memory_order_relaxed
or memory_order_release
.
Rationale for using update
for initial and current phase counts: safety, this prevents the initial count from going under the current count accidentaly.
-
Effects: Decrements the initial expected count for all subsequent phases by oneupdate. Then decrements the expected count for the current phase by oneupdate.
-
Synchronization: The call to arrive_and_drop
strongly happens before the start of the phase completion step for the current phase.
-
Throws: system_error
when an exception is required ([thread.req.exception]).
-
Error conditions: Any of the error conditions allowed for mutex types ([thread.mutex.requirements.mutex]).
[Note 4: This call can cause the completion step for the current phase to start. — end note]
Compatibility with P2535
P2535 (and P0153 before it) propose extending C++ with object fences. If we were to add object fences to C++, it could make sense to further extend barrier
APIs to support them, e.g., as follows:
int& data;
data = 42;
bar_x.arrive(1, obj_fence, data);
bar_y.arrive(1, obj_fence, data);
These APIs would be semantically similar - although not identical - to the following code using the APIs in this proposal.
int& data;
data = 42;
atomic_object_fence(memory_order_release, data);
bar_x.arrive(1, memory_order_relaxed);
atomic_object_fence(memory_order_release, data);
bar_y.arrive(1, memory_order_relaxed);
Combining both APIs is still valuable since it allows power users to write code with less fences where necessary:
int& data;
data = 42;
atomic_object_fence(memory_order_release, data);
bar_x.arrive(1, memory_order_relaxed);
bar_y.arrive(1, memory_order_relaxed);
Document Number: P2628R0
Date: 2022-07-01
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>
Authors: Gonzalo Brito Gadeschi
Audience: Concurrency
Extend
barrier
APIs withmemory_order
Motivation
std::barrier
is a synchronization primitive that orders memory visibility across threads. Its operations -arrive
,wait
,arrive_and_wait
,arrive_and_drop
- guarantee visibility of all memory operations performed before thearrive
to all threads that are unblocked from thewait
(after they are unblocked from it).Sometimes, guaranteeing the memory visibility of all memory operations is not required nor desired.
Examples:
bar.arrive_and_wait(1, memory_order_relaxed)
. The memory visibility is ensured by the filesystem outside the C++ abstract machine. This is similar to how, e.g.,MPI_Ibarrier
does not establish cumulativity across MPI ranks for memory operations onMPI_Windows
.volatile
operations. After the machine is configured, one thread should start it before releasing any threads from the barrier. Threads achieve this by usingmemory_order_relaxed
arrive/wait operations together with aCompletionFunction
that is run after the last thread arrived.atomic_object_fence
andatomic_message_fence
. These fences only apply to a sub-set of all objects and memory operations. This paper enables applications to composestd::barrier
with P2535 fences. This is explored in its own section further below.Other synchronization primitives could be extended in an analogous way as well, but we choose to focus on
std::barrier
during the initial revisions of this paper. Some synchronization primitives likestd::atomic::wait
already expose amemory_order
parameter.Tony Tables
x = 1;
atomic_object_fence(memory_order_release, x);
bar.arrive(); // release fence
// Thread 1
bar.arrive_and_wait(); // acquire fence
atomic_object_fence(memory_order_acquire, x);
assert(x == 1);
x = 1;
atomic_object_fence(memory_order_release, x);
bar.arrive(1, memory_order_relaxed); // no fence
// Thread 1
bar.arrive_and_wait(memory_order_relaxed); // no fence
atomic_object_fence(memory_order_acquire, x);
assert(x == 1);
Before: does not benefit from
atomic_object_fence
since the barrier operations insert full-memory fences.After: benefits from
atomic_object_fence
since the barrier inserts no fences.Wording
thread.barrier.class:
Each barrier phase consists of the following steps:
arrive
orarrive_and_drop
.CompletionFunction
template parameter, the completion step is run as part of the call toarrive
orarrive_and_drop
that caused the expected count to reach zero. For other specializations, the completion step is run on one of the threads that arrived at the barrier during the phase.arrive_and_drop
, and the next phase starts.Each phase defines a phase synchronization point. Threads that arrive at the barrier during the phase can block on the phase synchronization point by calling
wait
, and will remain blocked until the phase completion step is run.The phase completion step that is executed at the end of each phase has the following effects:
completion()
.UNRESOLVED QUESTION: do we need to change something else around here or does the “as if” below in “4.” suffice?
The end of the completion step strongly happens before the returns from all calls that were unblocked by the completion step. For specializations that do not have the default value of the CompletionFunction template parameter, the behavior is undefined if any of the barrier object’s member functions other than
wait
are called while the completion step is in progress.Concurrent invocations of the member functions of barrier, other than its destructor, do not introduce data races as if they were atomic operations performed with the
memory_order
associated with them. The member functionsarrive
andarrive_and_drop
execute atomically.CompletionFunction
shall meet theCpp17MoveConstructible
(Table 30) andCpp17Destructible
(Table 34) requirements.is_nothrow_invocable_v<CompletionFunction&>
shall be true.The default value of the
CompletionFunction
template parameter is an unspecified type, such that, in addition to satisfying the requirements ofCompletionFunction
, it meets theCpp17DefaultConstructible
requirements (Table 29) andcompletion()
has no effects.barrier::arrival_token
is an unspecified type, such that it meets theCpp17MoveConstructible
(Table 30),Cpp17MoveAssignable
(Table 32), andCpp17Destructible
(Table 34) requirements.Preconditions:
expected >= 0
is true andexpected <= max()
is true.Effects: Sets both the initial expected count for each barrier phase and the current expected count for the first phase to expected. Initializes completion with
std::move(f)
. Starts the first phase.[Note 1: If expected is 0 this object can only be destroyed. — end note]
Throws: Any exception thrown by
CompletionFunction
’s move constructor.Preconditions:
update > 0
is true,andupdate
is less than or equal to the expected count for the current barrier phase, andorder
ismemory_order_relaxed
ormemory_order_release
.Effects: Constructs an object of type
arrival_token
that is associated with the phase synchronization point for the current phase. Then, decrements the expected count byupdate
.Synchronization: The call to arrive strongly happens before the start of the phase completion step for the current phase.
Returns: The constructed
arrival_token
object.Throws:
system_error
when an exception is required ([thread.req.exception]).Error conditions: Any of the error conditions allowed for mutex types ([thread.mutex.requirements.mutex]).
[Note 2: This call can cause the completion step for the current phase to start. — end note]
Preconditions: arrival is associated with the phase synchronization point for the current phase or the immediately preceding phase of the same barrier object, and
order
ismemory_order_relaxed
ormemory_order_acquire
.Effects: Blocks at the synchronization point associated with
std::move(arrival)
until the phase completion step of the synchronization point’s phase is run.[Note 3: If arrival is associated with the synchronization point for a previous phase, the call returns immediately. — end note]
Throws:
system_error
when an exception is required ([thread.req.exception]).Error conditions: Any of the error conditions allowed for mutex types ([thread.mutex.requirements.mutex]).
order
ismemory_order_relaxed
ormemory_order_acq_rel
.update > 0
is true,update
is less than or equal to the expected count for the current barrier phase, andorder
ismemory_order_relaxed
ormemory_order_release
.Effects: Decrements the initial expected count for all subsequent phases by
oneupdate. Then decrements the expected count for the current phase byoneupdate.Synchronization: The call to
arrive_and_drop
strongly happens before the start of the phase completion step for the current phase.Throws:
system_error
when an exception is required ([thread.req.exception]).Error conditions: Any of the error conditions allowed for mutex types ([thread.mutex.requirements.mutex]).
[Note 4: This call can cause the completion step for the current phase to start. — end note]
Compatibility with P2535
P2535 (and P0153 before it) propose extending C++ with object fences. If we were to add object fences to C++, it could make sense to further extend
barrier
APIs to support them, e.g., as follows:These APIs would be semantically similar - although not identical - to the following code using the APIs in this proposal.
Combining both APIs is still valuable since it allows power users to write code with less fences where necessary: