Document Number: P2629R0
Date: 2022-07-08
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>
Authors: Gonzalo Brito Gadeschi
Audience: Concurrency
barrier
token-less split arrive/wait
Motivation
This proposal extends std::barrier
with better support for data-pipelining. These pipelines are very common in, e.g., Deep Learning and HPC applications [0].
In Deep Learning pipelines targetting modern NUMA systems, two groups of threads, “producers” and “consumers”, synchronize access to the shared resources of a pipeline stage using a pair of barriers. Consumer threads wait on the “consumer barrier”, process data once unblocked, and arrive on the “producer barrier” to signal that shared resources are safe to reuse. Producer threads wait on the “producer barrier”, reuse shared resources to load the next data set, and arrive on the “consumer barrier” to signal consumer threads that this data set is ready for use.
In this synchronization pattern, consumer threads wait on the consumer barrier without arriving on it, and arrive on the producer barrier without waiting on it; and vice-versa for producer threads.
However, std::barrier
APIs require threads to arrive on a barrier before waiting on it. This proposal allows producer and consumer threads to synchronize efficiently.
Tony tables
Real applications implement multi-stage pipelines with multiple shared resources. The following table shows a single stage of a consumer-producer pipeline involving one consumer thread, one producer thread, one data
buffer as a shared resource, and a pair of std::barriers
.
Note: the code of each thread runs in an infinite loop.
Before |
|
After |
|
Producer Thread |
Consumer Thread |
Producer Thread |
Consumer Thread |
producer.arrive_and_wait(); produce(data); consumer.arrive();
|
consumer.arrive_and_wait(); consume(data); producer.arrive(); |
producer.wait_parity(pp); produce(data); consumer.arrive_and_discard();
|
consumer.wait_parity(pp); consume(data); producer.arrive_and_discard(); |
Before this proposal:
- Producer thread:
- Waits on the consumer thread to signal that
data
can be reused by arriving and waiting on the producer
barrier.
- Produces
data
.
- Signals that
data
is ready by arriving at the consumer
barrier; throws token away.
- Consumer thread:
- Waits on
data
becoming ready by arriving and waiting on the consumer
barrier.
- Consumes
data
.
- Signals the producer thread that it can reuse
data
by arriving on the producer
barrier; throws token away.
These steps cause threads to unnecessarily:
- modify a barrier: by calling
arrive_and_wait
when they only need to wait,
- stall at the
arrive
until the barrier_token
is constructed when they will immediately discard it afterwards, and
- cast
[[nodiscard]]
away: to avoid the warning produced by discarding the token.
After this proposal:
- Producer thread:
- Waits on the consumer thread to signal that
data
can be reused by arriving and waiting on the producer
barrier.
- Produces
data
.
- Signals that
data
is ready by arriving at the consumer
barrier without producing a barrier_token
.; throws token away.
- Consumer thread:
- Waits on
data
being ready by arriving and waiting on the consumer
barrier.
- Consumes
data
.
- Signals the producer thread that it can reuse
data
by arriving on the producer
barrier without producing a token. throws token away.
This proposal addresses the three pre-existing issues raised above.
User guide
The arrive_and_discard
API is semantically equivalent to static_cast<void>(barrier.arrive())
but is easier to implement efficiently.
This section therefore focuses on how to use:
void wait_parity(bool phase_parity) const;
The phase completion step of std::barrier
advances the barrier to the “next phase” (thread.barrier#class-1.3).
This proposal guarantees that:
- the initial barrier phase is “phase
0
”, and
- the successor of “phase
N
” is “phase N+1
”.
This proposal defines that the parity
of:
- even barrier phases is
false
, i.e., the parity of “phase 0
” is false
(or 0
);
- odd barrier phases is
true
, i.e., the parity of “phase 1
” is true
(or 1
).
The wait_parity(bool phase_parity)
API waits on the barrier phase parity changing from the specified phase_parity
to a different parity. This is analogous to the atomic::wait(value)
API, which waits on the atomic value changing from value
to a different value.
Teaching examples
This example uses a barrier to print text in order:
void houston() {
std::barrier barrier(2);
std::thread apollo13([&barrier] {
std::cout << "Okay, Houston, we've had a problem here" << std::endl;
static_cast<void>(barrier.arrive());
});
bool phase_parity = false;
barrier.arrive_and_wait();
std::cout << "This is Houston. Say again, please." << std::endl;
apollo13.join();
}
The initial phase of a freshly initialized barrier
is 0 and its parity is false
. To implement this example with the proposed APIs we can:
- Line A: use an
expected_count
of 1
instead of 2
, since only 1 thread will arrive on each barrier phase.
- Line B: use
arrive_and_discard
, since this thread will not wait on the barrier.
- Line C: use
wait_parity(false)
to wait on the phase parity of initial barrier phase, which is “phase 0” and has the phase parity false
, to change to true
(i.e. advancing to “phase 1”):
void houston() {
std::barrier barrier(1);
std::thread apollo13([&barrier] {
std::cout << "Okay, Houston, we've had a problem here" << std::endl;
barrier.arrive_and_discard();
});
bool phase_parity = false;
barrier.wait_parity();
std::cout << "This is Houston. Say again, please." << std::endl;
apollo13.join();
}
The following example shows a thread that uses wait_pairty
to wait on all barrier
phases by alternating the phase_parity
it waits on, on every iteration:
bool phase_parity = false;
while(true) {
barrier.wait_parity(phase_parity);
phase_parity = !phase_parity;
other_barrier.arrive_and_discard();
}
This thread arrives on some other_barrier
, to make sure that other threads arriving on barrier
don’t cause it to “skip” waiting on a particular phase.
Producer-consumer pipeline example
The following code snippet (try it out in compiler-explorer) shows a full working example of the producer-consumer pipeline described above and showcases how to combine these APIs:
#include <thread>
#include <barrier>
int data = -1;
constexpr int max_count = 5;
std::barrier producer(1);
std::barrier consumer(1);
void produce() {
int count = 0;
bool phase = true;
do {
data = count++;
consumer.arrive_and_discard()
if (count == max_count) return;
producer.wait_parity(phase);
phase = !phase;
} while(true);
}
void consume() {
int count = 0;
bool phase = true;
while(true) {
consumer.wait_parity(phase);
phase = !phase;
assert(data == count++);
producer.arrive_and_discard();
if (count == max_count) return;
}
}
int main() {
std::thread t(produce);
consume();
t.join();
return 0;
}
Rationale
Applications cannot reuse std::barrier
to solve these problems, and instead must re-implement it to add these APIs.
This proposal chooses to extend std::barrier
with two APIs to enable applications to re-use it.
Rationale for waiting without a token
The proposed wait_parity
API is a replacement for arrive_and_wait
in certain situations.
The arrive_and_wait
API has significant latency:
- Sends a message to modify the memory to arrive at the barrier.
- Waits on the response with the old value, required to construct the
arrival_token
that the API returns.
- Finally calls wait to wait on the barrier phase to flip.
The wait_parity
API enables applications that “know” the parity of the barrier phase they want to wait on, and that are able to do so safely, to just wait on the barrier, without having to arrive on it first.
Rationale for arriving without a token
The arrive_and_discard
API is a replacement for arrive
in situations where the calling thread is not going to wait at the barrier.
The arrive
API:
- Sends a message to modify the memory to arrive at the barrier.
- Waits on the response with the old value, required to construct the
arrival_token
that the API returns.
The roundtrip latency of sending the message and waiting for the response is significant in large NUMA architectures.
P2588 “barrier
’s phase completion guarantees” proposes relaxing the specification of barrier
to enable implementations to complete the phase in one of the threads that waits
at the barrier by requiring that at least one thread waits for the phase to flip.
On an implementation that completes the phase on a thread that waits, and on a proper hardware architecture with support for far atomics, arrive_and_discard
is fire-and-forget. It sends a message to modify the barrier count, but it does not need to wait for a response.
During the call to wait
, the implementation then picks one thread, and uses it to complete the phase, unblocking all other threads when the phase completes.
The reference implementation of this proposal on compiler-explorer provides such an implementation.
On large NUMA architectures and data-pipelining applications, the consumer barrier can be placed close to the consumer threads, and the producer barrier can be placed close to the producer threads.
This enables threads to poll the barrier locally and benefit from HW acceleration to do so, but increases the latency of arriving at the barrier.
The arrive_and_discard
API brings the latency to zero on such an implementation, because the thread does not need to wait for a response to be able to make progress.
Impact
Impact on users
None, the semantics of existing code do not change, and it does not change the ABI.
Impact on implementations
The arrive_and_discard
API imposes minimal overhead on implementations, since it can be naively implemented as:
void arrive_and_discard(ptrdiff_t update = 1) {
static_cast<void>(arrive(update));
}
The implementation impact of wait_parity
is “to be determined”. In the reference implementation, it is implemented naively as:
void wait_parity(bool phase_parity) const {
wait(arrival_token { .phase = phase_parity });
}
Impact on/from other proposals
P2588 proposes relaxing std::barrier
’s phase completion guarantees. Depending on the outcome of this proposal, wait_parity
would, just like wait
, be able to complete the barrier’s phase. In that case, it might make sense to not make wait_parity
a const
member function.
Unresolved questions
- Do we need the “diff” in the barrier constructor to state that it initializes the barrier to the 0-th phase? That is already addressed in thread.barrier.class.1.
- Impact on existing implementations, to be resolved by prototyping the feature in libc++ and libstdc++.
References
Proposed wording
thread.barrier.class:
namespace std {
template<class CompletionFunction = see below>
class barrier {
public:
using arrival_token = see below;
static constexpr ptrdiff_t max() noexcept;
constexpr explicit barrier(ptrdiff_t expected,
CompletionFunction f = CompletionFunction());
~barrier();
barrier(const barrier&) = delete;
barrier& operator=(const barrier&) = delete;
[[nodiscard]] arrival_token arrive(ptrdiff_t update = 1);
void arrive_and_discard(ptrdiff_t update = 1);
void wait(arrival_token&& arrival) const;
void wait_parity(bool phase_parity) const;
void arrive_and_wait();
void arrive_and_drop();
private:
CompletionFunction completion; // exposition only
};
}
- The initial phase of a barrier is the 0-th phase. The successor of the n-th barrier phase is the (n+1)-th phase.
- Each barrier phase consists of the following steps:
- The expected count is decremented by each call to
arrive
or arrive_and_drop
.
- When the expected count reaches zero, the phase completion step is run. For the specialization with the default value of the
CompletionFunction
template parameter, the completion step is run as part of the call to arrive
or arrive_and_drop
that caused the expected count to reach zero. For other specializations, the completion step is run on one of the threads that arrived at the barrier during the phase.
- When the completion step finishes, the expected count is reset to what was specified by the expected argument to the constructor, possibly adjusted by calls to
arrive_and_drop
, and the nextsuccessor of the current phase starts.
- Each phase defines a phase synchronization point. Threads that arrive at the barrier during the phase can block on the phase synchronization point by calling
wait
, and will remain blocked until the phase completion step is run.
- The phase completion step that is executed at the end of each phase has the following effects:
- Invokes the completion function, equivalent to
completion()
.
- Unblocks all threads that are blocked on the phase synchronization point.
The end of the completion step strongly happens before the returns from all calls that were unblocked by the completion step. For specializations that do not have the default value of the CompletionFunction
template parameter, the behavior is undefined if any of the barrier object’s member functions other than wait
are called while the completion step is in progress.
- Concurrent invocations of the member functions of barrier, other than its destructor, do not introduce data races. The member functions
arrive
and arrive_and_drop
execute atomically.
CompletionFunction
shall meet the Cpp17MoveConstructible
(Table 30) and Cpp17Destructible
(Table 34) requirements. is_nothrow_invocable_v<CompletionFunction&>
shall be true.
- The default value of the
CompletionFunction
template parameter is an unspecified type, such that, in addition to satisfying the requirements of CompletionFunction
, it meets the Cpp17DefaultConstructible
requirements (Table 29) and completion()
has no effects.
barrier::arrival_token
is an unspecified type, such that it meets the Cpp17MoveConstructible
(Table 30), Cpp17MoveAssignable
(Table 32), and Cpp17Destructible
(Table 34) requirements.
static constexpr ptrdiff_t max() noexcept;
- Returns: The maximum expected count that the implementation supports.
constexpr explicit barrier(ptrdiff_t expected,
CompletionFunction f = CompletionFunction());
- Preconditions:
expected >= 0
is true
and expected <= max()
is true
.
- Effects: Sets both the initial expected count for each barrier phase and the current expected count for the first phase to
expected
. Initializes completion
with std::move(f)
. Starts the first phase which is the 0-th barrier phase.
- [Note 1: If
expected
is 0
this object can only be destroyed. — end note]
- Throws: Any exception thrown by
CompletionFunction
’s move constructor.
[[nodiscard]] arrival_token arrive(ptrdiff_t update = 1);
- Preconditions:
update > 0
is true
, and update
is less than or equal to the expected count for the current barrier phase.
- Effects: Constructs an object of type
arrival_token
that is associated with the phase synchronization point for the current phase. Then, decrements the expected count by update
.
- Synchronization: The call to arrive strongly happens before the start of the phase completion step for the current phase.
- Returns: The constructed
arrival_token
object.
- Throws:
system_error
when an exception is required (thread.req.exception).
- Error conditions: Any of the error conditions allowed for mutex types (thread.mutex.requirements.mutex).
- [Note 2: This call can cause the completion step for the current phase to start. — end note]
void arrive_and_discard(ptrdiff_t update = 1);
- Effects: Equivalent to:
arrive(update)
.
void wait(arrival_token&& arrival) const;
- Preconditions:
arrival
is associated with the phase synchronization point for the current phase or the immediately preceding phase of the same barrier object.
- Effects: Blocks at the synchronization point associated with
std::move(arrival)
until the phase completion step of the synchronization point’s phase is run.
- [Note 3: If
arrival
is associated with the synchronization point for a previous phase, the call returns immediately. — end note]
- Throws:
system_error
when an exception is required (thread.req.exception).
- Error conditions: Any of the error conditions allowed for mutex types (thread.mutex.requirements.mutex).
void wait_parity(bool phase_parity) const;
- Effects: Equivalent to:
wait(token)
, where token
is associated with the synchronization point of the current phase if the parity of the phase ordinal matches phase_parity
, or the previous phase otherwise.
UNRESOLVED QUESTION: if P2588 is accepted, we might not want to make wait_parity
const.
void arrive_and_wait();
- Effects: Equivalent to:
wait(arrive())
.
void arrive_and_drop();
- Preconditions: The expected count for the current barrier phase is greater than zero.
- Effects: Decrements the initial expected count for all subsequent phases by one. Then decrements the expected count for the current phase by one.
- Synchronization: The call to
arrive_and_drop
strongly happens before the start of the phase completion step for the current phase.
- Throws:
system_error
when an exception is required (thread.req.exception).
- Error conditions: Any of the error conditions allowed for mutex types (thread.mutex.requirements.mutex).
- [Note 4: This call can cause the completion step for the current phase to start. — end note]
Document Number: P2629R0
Date: 2022-07-08
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>
Authors: Gonzalo Brito Gadeschi
Audience: Concurrency
barrier
token-less split arrive/waitMotivation
This proposal extends
std::barrier
with better support for data-pipelining. These pipelines are very common in, e.g., Deep Learning and HPC applications [0].In Deep Learning pipelines targetting modern NUMA systems, two groups of threads, “producers” and “consumers”, synchronize access to the shared resources of a pipeline stage using a pair of barriers. Consumer threads wait on the “consumer barrier”, process data once unblocked, and arrive on the “producer barrier” to signal that shared resources are safe to reuse. Producer threads wait on the “producer barrier”, reuse shared resources to load the next data set, and arrive on the “consumer barrier” to signal consumer threads that this data set is ready for use.
In this synchronization pattern, consumer threads wait on the consumer barrier without arriving on it, and arrive on the producer barrier without waiting on it; and vice-versa for producer threads.
However,
std::barrier
APIs require threads to arrive on a barrier before waiting on it. This proposal allows producer and consumer threads to synchronize efficiently.Tony tables
Real applications implement multi-stage pipelines with multiple shared resources. The following table shows a single stage of a consumer-producer pipeline involving one consumer thread, one producer thread, one
data
buffer as a shared resource, and a pair ofstd::barriers
.produce(data);
consumer.arrive();
consumer.arrive_and_wait();
consume(data);
producer.arrive();
produce(data);
consumer.arrive_and_discard();
consumer.wait_parity(pp);
consume(data);
producer.arrive_and_discard();
Before this proposal:
data
can be reused by arriving and waiting on theproducer
barrier.data
.data
is ready by arriving at theconsumer
barrier; throws token away.data
becoming ready by arriving and waiting on theconsumer
barrier.data
.data
by arriving on theproducer
barrier; throws token away.These steps cause threads to unnecessarily:
arrive_and_wait
when they only need to wait,arrive
until thebarrier_token
is constructed when they will immediately discard it afterwards, and[[nodiscard]]
away: to avoid the warning produced by discarding the token.After this proposal:
data
can be reused byarriving andwaiting on theproducer
barrier.data
.data
is ready by arriving at theconsumer
barrier without producing abarrier_token
.; throws token away.data
being ready byarriving andwaiting on theconsumer
barrier.data
.data
by arriving on theproducer
barrier without producing a token.throws token away.This proposal addresses the three pre-existing issues raised above.
User guide
The
arrive_and_discard
API is semantically equivalent tostatic_cast<void>(barrier.arrive())
but is easier to implement efficiently.This section therefore focuses on how to use:
The phase completion step of
std::barrier
advances the barrier to the “next phase” (thread.barrier#class-1.3).This proposal guarantees that:
0
”, andN
” is “phaseN+1
”.This proposal defines that the
parity
of:false
, i.e., the parity of “phase0
” isfalse
(or0
);true
, i.e., the parity of “phase1
” istrue
(or1
).The
wait_parity(bool phase_parity)
API waits on the barrier phase parity changing from the specifiedphase_parity
to a different parity. This is analogous to theatomic::wait(value)
API, which waits on the atomic value changing fromvalue
to a different value.Teaching examples
This example uses a barrier to print text in order:
The initial phase of a freshly initialized
barrier
is 0 and its parity isfalse
. To implement this example with the proposed APIs we can:expected_count
of1
instead of2
, since only 1 thread will arrive on each barrier phase.arrive_and_discard
, since this thread will not wait on the barrier.wait_parity(false)
to wait on the phase parity of initial barrier phase, which is “phase 0” and has the phase parityfalse
, to change totrue
(i.e. advancing to “phase 1”):The following example shows a thread that uses
wait_pairty
to wait on allbarrier
phases by alternating thephase_parity
it waits on, on every iteration:This thread arrives on some
other_barrier
, to make sure that other threads arriving onbarrier
don’t cause it to “skip” waiting on a particular phase.Producer-consumer pipeline example
The following code snippet (try it out in compiler-explorer) shows a full working example of the producer-consumer pipeline described above and showcases how to combine these APIs:
Rationale
Applications cannot reuse
std::barrier
to solve these problems, and instead must re-implement it to add these APIs.This proposal chooses to extend
std::barrier
with two APIs to enable applications to re-use it.Rationale for waiting without a token
The proposed
wait_parity
API is a replacement forarrive_and_wait
in certain situations.The
arrive_and_wait
API has significant latency:arrival_token
that the API returns.The
wait_parity
API enables applications that “know” the parity of the barrier phase they want to wait on, and that are able to do so safely, to just wait on the barrier, without having to arrive on it first.Rationale for arriving without a token
The
arrive_and_discard
API is a replacement forarrive
in situations where the calling thread is not going to wait at the barrier.The
arrive
API:arrival_token
that the API returns.The roundtrip latency of sending the message and waiting for the response is significant in large NUMA architectures.
P2588 “
barrier
’s phase completion guarantees” proposes relaxing the specification ofbarrier
to enable implementations to complete the phase in one of the threads thatwaits
at the barrier by requiring that at least one thread waits for the phase to flip.On an implementation that completes the phase on a thread that waits, and on a proper hardware architecture with support for far atomics,
arrive_and_discard
is fire-and-forget. It sends a message to modify the barrier count, but it does not need to wait for a response.During the call to
wait
, the implementation then picks one thread, and uses it to complete the phase, unblocking all other threads when the phase completes.The reference implementation of this proposal on compiler-explorer provides such an implementation.
On large NUMA architectures and data-pipelining applications, the consumer barrier can be placed close to the consumer threads, and the producer barrier can be placed close to the producer threads.
This enables threads to poll the barrier locally and benefit from HW acceleration to do so, but increases the latency of arriving at the barrier.
The
arrive_and_discard
API brings the latency to zero on such an implementation, because the thread does not need to wait for a response to be able to make progress.Impact
Impact on users
None, the semantics of existing code do not change, and it does not change the ABI.
Impact on implementations
The
arrive_and_discard
API imposes minimal overhead on implementations, since it can be naively implemented as:The implementation impact of
wait_parity
is “to be determined”. In the reference implementation, it is implemented naively as:Impact on/from other proposals
P2588 proposes relaxing
std::barrier
’s phase completion guarantees. Depending on the outcome of this proposal,wait_parity
would, just likewait
, be able to complete the barrier’s phase. In that case, it might make sense to not makewait_parity
aconst
member function.Unresolved questions
References
barrier
’s phase completion guarantees.Proposed wording
thread.barrier.class:
arrive
orarrive_and_drop
.CompletionFunction
template parameter, the completion step is run as part of the call toarrive
orarrive_and_drop
that caused the expected count to reach zero. For other specializations, the completion step is run on one of the threads that arrived at the barrier during the phase.arrive_and_drop
, and thenextsuccessor of the current phase starts.wait
, and will remain blocked until the phase completion step is run.completion()
.The end of the completion step strongly happens before the returns from all calls that were unblocked by the completion step. For specializations that do not have the default value of the
CompletionFunction
template parameter, the behavior is undefined if any of the barrier object’s member functions other thanwait
are called while the completion step is in progress.arrive
andarrive_and_drop
execute atomically.CompletionFunction
shall meet theCpp17MoveConstructible
(Table 30) andCpp17Destructible
(Table 34) requirements.is_nothrow_invocable_v<CompletionFunction&>
shall be true.CompletionFunction
template parameter is an unspecified type, such that, in addition to satisfying the requirements ofCompletionFunction
, it meets theCpp17DefaultConstructible
requirements (Table 29) andcompletion()
has no effects.barrier::arrival_token
is an unspecified type, such that it meets theCpp17MoveConstructible
(Table 30),Cpp17MoveAssignable
(Table 32), andCpp17Destructible
(Table 34) requirements.expected >= 0
istrue
andexpected <= max()
istrue
.expected
. Initializescompletion
withstd::move(f)
. Starts the first phase which is the 0-th barrier phase.expected
is0
this object can only be destroyed. — end note]CompletionFunction
’s move constructor.update > 0
istrue
, andupdate
is less than or equal to the expected count for the current barrier phase.arrival_token
that is associated with the phase synchronization point for the current phase. Then, decrements the expected count byupdate
.arrival_token
object.system_error
when an exception is required (thread.req.exception).arrive(update)
.arrival
is associated with the phase synchronization point for the current phase or the immediately preceding phase of the same barrier object.std::move(arrival)
until the phase completion step of the synchronization point’s phase is run.arrival
is associated with the synchronization point for a previous phase, the call returns immediately. — end note]system_error
when an exception is required (thread.req.exception).wait(token)
, wheretoken
is associated with the synchronization point of the current phase if the parity of the phase ordinal matchesphase_parity
, or the previous phase otherwise.wait(arrive())
.arrive_and_drop
strongly happens before the start of the phase completion step for the current phase.system_error
when an exception is required (thread.req.exception).