Document Number: P2588R3
Date: 2022-02-07
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>
Authors: Gonzalo Brito Gadeschi, Eric Niebler, Anthony Williams, Thomas Rodgers
Audience: LWG
barrier
’s phase completion guarantees
Revisions
Abstract
Unintended consequences of std::barrier
’s specification constrain implementations to run the CompletionFunction
on the last thread that arrives at the barrier during the phase. This prevents std::barrier
from benefiting from hardware acceleration for thread synchronization. Removing these constraints is a breaking change. This paper aims to find a sweet spot for the barrier specification that delivers the functionality that applications need while allowing efficient implementations.
Introduction
The specification of std::barrier<CompletionFunction>
requires the phase completion step to run when the expected count becomes zero on one of the threads that arrived at the barrier during the phase thread.barrier.class-1.2:
When the expected count reaches zero, the phase completion step is run. For the specialization with the default value of the CompletionFunction
template parameter, the completion step is run as part of the call to arrive
or arrive_and_drop
that caused the expected count to reach zero. For other specializations, the completion step is run on one of the threads that arrived at the barrier during the phase.
The specification of std::barrier
does not require any thread that arrives at the barrier to call std::barrier::wait
, but calling barrier::wait
is necessary to observe phase completion thread.barrier.class-3.sentence-3:
[…] the behavior is undefined if any of the barrier object’s member functions other than wait
are called while the completion step is in progress.
A thread that never calls wait
can still arrive at the barrier again via synchronization through some other thread that does call wait
. This is very useful in practice, as the following example shows:
Example 0: a producer / consumer pipeline (godbolt)
std::barrier<CF0> b0(2, cf0);
std::barrier<CF1> b1(2, cf1);
void thread_0() {
while(true) {
produce_data();
b0.arrive();
b1.arrive_and_wait();
}
}
void thread_1() {
while(true) {
b0.arrive_and_wait();
consume_data();
b1.arrive_and_wait();
}
}
In this example, thread_0
at “A” produces some data, and then at “B” signals thread_1
that the data is ready. thread_0
will never wait on barrier b0
. Then thread_0
waits on thread_1
consuming the data, and proceeds to generate new data.
The [[nodiscard]]
attribute on arrive
shows that barrier
designers were not expecting threads to arrive at the barrier without calling wait.
The current standard wording in thread.barrier.class-1.2.sentence-3:
For other specializations, the completion step is run on one of the threads that arrived at the barrier during the phase.
aims to provide implementations with enough freedom to run the CompletionFunction
on any thread that participates in the barrier during the phase.
Unfortunately, the current wording requires all implementations to run the CompletionFunction
as part of the call to arrive
performed by the last thread that arrives at the barrier during the phase, as the following example shows:
Example 1: Guarantee that CompletionFunction
runs if no thread waits (godbolt)
std::barrier<CF> b{2, cf};
using tok_t = decltype(b.arrive());
void thread() {
new tok_t(b.arrive());
}
auto t0 = std::thread(thread);
auto t1 = std::thread(thread);
t0.join();
t1.join();
This example spawns two threads at “C”, both of which execute the same sequence of operations:
- “A” arrive at the barrier and leak the token, and
- “B” exit.
After joining both threads at “D”, they no longer exist.
The standard guarantees in thread.barrier.class-1.2.sentence-1 that the completion function runs:
When the expected count reaches zero, the phase completion step is run.
in one of the threads that arrived at the barrier during the phase (thread.barrier.class-1.2.sentence-3). At “E”, these threads do not exist anymore. Therefore, the CompletionFunction
must have run before “E”, and more precisely, it must have run before the last thread that arrived at the barrier during the phase exits.
That is: there is only one place in which it makes sense for conforming standard library implementations to run the phase completion step: as part of the arrive
performed by the last thread that arrives at the barrier during the phase.
This is an unintended consequence of the interaction between:
- the freedom for threads to never call wait, and
- the guarantees about when and where the phase completion step runs.
These consequences are problematic in practice, where Amhdahl’s law limits the scalability of massively parallel applications with small “serial” overheads on modern NUMA architectures with millions of hardware threads. While std::barrier
’s split arrive
/wait
APIs enable threads to hide the cost of synchronization behind independent work, the amount of independent computation available in real applications is limited. Clauses like thread.barrier.class-1.2.sentence-3 aim to enable std::barrier
to use hardware accelerators for synchronization, such as those available in NVIDIA GPUs, to allow applications to hide the cost of synchronization behind small amounts of independent work, but due to the unintended consequences explained above, implementations currently cannot do so.
Fixing this requires a breaking change. There are clear engineering trade-offs between varying degrees of guarantees which determine the set of well-formed programs with different degrees of implementation flexibility which determine performance.
This paper aims to to help the reader answer the question: What is the sweet spot for std::barrier
semantics, and what are the consequences of the breaking changes required to get there?
The following section analyzes the semantics of the current barrier specification and the different changes that we could make to balance functionality and performance. Then, we evaluate changes according to the functionality provided, their ability to leverage hardware acceleration, and the impact of a change on both existing and potential standard library implementations and end-user applications.
Finally, the authors suggest changes that balance functionality and performance to deliver the functionality that applications need in practice while allowing efficient implementations and propose wording for these changes.
Analysis of barrier semantics
The “core” semantics of the std::barrier
phase completion step are:
- The last arrival happens-before phase completion, which happens-before any thread unblocks from
wait
.
- Establish cumulativity from all threads that arrive at the barrier during the phase to all threads unblocked from
wait
through the thread that runs the CompletionFunction
.
These “core” semantics enable applications to, e.g., perform a reduction in a critical-section in-between arriving and waiting:
Example 2: reduction in critical section (godbolt)
std::vector<int> data(nthreads);
int reduction;
auto reduce = [&] {
reduction = std::accumulate(data.begin(), data.end(), 0);
};
std::barrier<decltype(reduce)> b(nthreads, reduce);
void thread(size_t i) {
data[i] = produce_data();
b.arrive_and_wait();
consume(reduction);
}
Here, threads produce some data at “A” and arrive and wait at the barrier at “B”. reduce
is then called inside a critical section, after the last thread arrives, and before any thread is unblocked from the wait. Since all threads participating in the barrier are “stalled” at the wait
, reduce
can access data
safely without data-races. Finally, reduce
happens-before any thread is unblocked from the wait
. That is, all uses of reduction
at C
observe the value that reduce
initialized it with safely and without data-races.
The main design choices that this paper concerns itself with are “when” and “where” should the C++ standard guarantee that the CompletionFunction
runs. These two properties, “where” and “when”, are intertwined. The following Table 1 explores some of the main options, the choices they enable, and their impact on hardware acceleration, existing implementations, and user applications.
Table 1: Design tradeoffs: “When does the CompletionFunction
run?”, “Where is it allowed to run (on which threads)?”, “Does the CompletionFunction
run if no thread calls wait
?”, “What are the hardware acceleration opportunities of these constraints?”, “What’s the impact on implementations and users?”
When |
Where |
Runs if no thread waits? |
Hardware acceleration opportunities |
Implementation impact |
User impact |
Last arrive |
Last thread to arrive |
Yes |
Very low |
None |
None |
Any arrive |
Any thread that arrives |
Yes |
Very low |
None |
None |
Any wait |
Any thread that calls wait |
No |
Low |
All |
None for any application surveyed, minimal otherwise. |
Any arrive or wait |
A thread that arrives or waits |
Options: “Yes” “No” “Unspecified” “Implementation defined” |
Medium |
If answer to “Runs if no thread waits?” is No, None. Otherwise, All. |
None for any application surveyed, minimal otherwise. |
After last arrive before any thread unblocks from wait |
Options: “Unspecified” “A thread that arrives or waits” “A new thread” “A thread that arrives or waits or a new thread” |
Options: “Yes” “No” “Unspecified” “Implementation defined” |
High |
If answer to “Runs if no thread waits?” is No, None. Otherwise, All. |
None for any application surveyed, minimal otherwise. |
Restricting the choice of “when” significantly constrains the threads in which it makes sense for implementations to run the CompletionFunction
. For example, restricting “when” to particular API calls, including the broad “Any arrive
or wait
”, restricts implementations to only run the phase completion step within those API calls (on threads making these API calls).
An important question impacting Example 1 is whether the CompletionFunction
runs if no thread waits. The third column in Table 1 answers this question for the options considered. For the last two rows in the table, the standard is free to define these semantics as it wishes. Some options are provided in the table inline, and vary from well-defined semantics (“Yes” or “No”), to “implementation-defined” or “unspecified”.
On hardware acceleration
The column on “Hardware acceleration opportunities” describes how effective “hardware barrier synchronization accelerations” can be at accelerating synchronization via std::barrier
. Hardware accelerators that have more freedom to pick which thread completes the barrier phase (up to the freedom of creating a new thread to do so) can be significantly more effective than those that cannot make this choice. As the number of hardware threads available on computer hardware increases, the benefits of these accelerators outweighs their costs. NVIDIA GPUs have been shipping barrier accelerators for many years. Being descriptive and not prescriptive enables hardware vendors to innovate.
Impact on existing implementations
If the standard allows the CompletionFunction
to run even if no thread ever waits, no mainstream standard library implementation (libc++
, libstdc++
, and MSVC STL) needs to change. Otherwise, these standard library implementations would need to change.
For the last two rows, selecting “implementation-defined” behavior would require the existing implementations to document their behavior.
Largest possible semantic change
To evaluate the impact of breaking changes, this section defines the “largest” possible semantic change we could make. It allows CompletionFunction
to run:
- When: Runs after last arrive before any thread unblocks from wait (no other requirements about “when”)
- Where: Unspecified.
- Runs if no thread waits? Unspecified.
This change breaks three std::barrier
guarantees:
- That the phase completion step is always run when the expected count reaches 0.
- With this change, it is only guaranteed to run if at least one thread calls
wait
, but this change does not guarantee that it does not run if no thread calls wait
. That is, even if no thread calls wait
, these change allows the implementation to run the CompletionFunction
, but it does not require the implementation to do so.
- That the phase completion step runs on the last thread that arrived at the barrier during the phase.
- With this change, the phase completion step could run on any thread (it’s unspecified where it runs). It can run on the last thread that arrived, but it can also run on a thread that waits, or on a new thread.
- That a barrier with a default completion function can be arrived on indefinitely without calling wait.
- With this change, this becomes undefined behavior.
Potential impact on implementations
Thomas Rodgers mentioned that libdispatch barriers potentially complete on a different thread in the thread pool: all the work in the queue before the enqueuing of the barrier work item completes before the barrier work item, whose completion may happen on some other thread. The largest change would enable such an implementation.
NVIDIA’s barrier accelerators can run the phase completion step very close to the memory where the barrier resides. The largest change enables such an implementation.
Impact on existing applications
We surveyed all public uses of std::barrier
on GitHub as well as within some NVIDIA’s code bases that use them heavily.
The impact of all changes considered - including the largest change - on all existing applications we surveyed was non-existent. The feature is a new C++20 feature that’s been only available since:
and it is a relatively niche and sharp tool.
All applications we discovered were either educational or small toy programs for learning purposes. They all had one thread that both arrived and waited at the barrier during all barrier phases. None of them relied on a particular thread running the CompletionFunction
(we explore the uses we can imagine in the next section).
Potential impact on applications
While the authors could not find any application in the wild that would break due to the largest change, together with the experts we polled, we can imagine some.
Portability concerns
Anthony Williams raised the following concern: if most standard library implementations run the phase completion step as part of the last thread that arrives at the barrier during the phase, applications might end up silently and accidentally relying on this guarantee. Therefore, applications will not be portable to implementations that do something else.
Count-down with effect
P2300 when_all
algorithm accepts a variable number of asynchronous tasks and executes a continuation when they all complete. Given std::barrier
as specified today, one valid implementation of when_all
simply arrives at the barrier from all asynchronous tasks. After the last task arrives, the continuation runs. No thread calls wait
, and the barrier is never used again.
The largest change would silently break the semantics of such a when_all
implementation, causing the application to hang on implementations that do not run the phase completion step as part of the last thread that arrives.
This implementation of when_all
would be using std::barrier
as a “single-use count-down with effect”: a counter initialized to the number of asynchronous tasks that each task decrements by 1
. When the counter reaches 0, it runs some function for its effect.
Since the counter is “single-use”, it would make more sense for when_all
to use std::latch
here. However, when_all
cannot do so because std::latch
does not support CompletionFunctions
. A way to enable this use case would be to add support for CompletionFunctions
to std::latch
. Furthermore, since std::latch
is single-use, the semantics that make sense for std::barrier
do not necessarily make sense for std::latch
. This is worth exploring.
Barriers and latches are synchronization primitives intended to synchronize groups of threads with each other. However, this is not what the when_all
example needs. The when_all
example synchronizes a set of threads with the thread that runs some effect; it does not synchronize the threads themselves with each other. So another alternative that might be worth exploring is to provide such synchronization primitives in the standard library.
Thread id
Anthony Williams mentioned that they commonly see applications in which a handle, like a database handle, is stored in thread_local
storage (TLS). Such an application could use the knowledge that the CompletionFunction
is only executed by a thread that arrived at the barrier during the phase to just access this handle from the CompletionFunction
.
Thomas Rodgers recognized that such an application could, in some cases, store a shared_ptr<Handle*>
in the CompletionFunction
itself and access the database handle through it. However, if the database handle is tied to the thread that created it, this approach would not work. He also recognized that on the libdispatch barrier model, such an application could not depend on TLS either.
This is an example of a larger class of applications that rely on the guarantee that one of the threads that arrived at the barrier during the phase runs the CompletionFunction
. Thread locals are one example; inspecting the thread id and doing something with it would be another.
Any change that relaxes the guarantees about “where” (on which thread) the CompletionFunction
may run will be turning some of these use cases from “well-formed” programs into non-portable or illegal programs. We can imagine these applications but have not found them in the wild.
Suggestion
Our suggestion is to pursue some flavor of the largest semantic change, restricting ourselves to the options that do not impact any existing standard library implementation or application we surveyed in the wild. While we can imagine theoretical applications that would be impacted by any option we take, it is very hard to write applications that do so correctly. Given the novelty and niche of this feature, we expect the impact on such applications to be minimal to non-existent. All experts we polled from the domains of applications using std::barrier
, standard library implementors, and designers of std::barrier
were surprised by the current semantics. These semantics were not intended. The intention was to enable std::barrier
to leverage hardware acceleration. If feasible, our recommendation is to backport this change to C++20.
That is, our suggestion is to guarantee that:
- The last thread arriving at a barrier phase happens-before the
CompletionFunction
runs which happens-before any thread is unblocked from that phase.
- Cumulativity is established between all threads arriving at the barrier during the phase and the thread running the phase completion step, and between that thread and any thread that observes phase completion via a call to
wait
.
- If no thread observes phase completion, whether phase completion runs is unspecified.
- The phase completion step runs on one of the threads that arrived or waited at the barrier during the phase.
Suggested wording
The proposed change modifies thread.barrier.class-1 as follows:
Each barrier phase consists of the following steps:
-
(1.1) The expected count is decremented by each call to arrive
or arrive_and_drop
.
-
(1.2) When the expected count reaches zero, the phase completion step is run. For the specialization with the default value of the CompletionFunction template parameter, the completion step is run as part of the call to arrive
or arrive_and_drop
that caused the expected count to reach zero. For other specializations, the completion step is run on one of the threads that arrived at the barrier during the phase.Exactly once after the expected count reaches zero a thread executes the completion step during its call to arrive
, arrive_and_drop
, or wait
, except that it is implementation-defined whether the step executes if no thread calls wait
.
-
(1.3) When the completion step finishes, the expected count is reset to what was specified by the expected argument to the constructor, possibly adjusted by calls to arrive_and_drop
, and the next phase starts.
In [version.syn], update the value of the feature test macro __cpp_lib_barrier
:
#define __cpp_lib_barrier 201907Lnew-value // also in <barrier>
This change also needs an Annex C entry, so add the following to [diff]:
C.1 C++ and ISO C++ 2023 [dif.cpp23]
C.1.X [thread] Concurrency support library [diff.cpp23.thread]
Change: In this revision of C++ it is implementation-defined whether a barrier’s phase completion step runs if no thread calls wait
. Previously the phase completion step was guaranteed to run on the last thread that calls arrive
or arrive_and_drop
during the phase. In this revision of C++ it can run on any of the threads that arrived or waited at the barrier during the phase.
Affected subclause: [thread.barrier]
Rationale: Correct contradictory wording and improve implementation flexibility for performance.
Effect on original feature: Valid C++ 2020 code using a barrier might have different semantics in this revision of C++ if it depends on a completion function’s side effects occurring exactly once, on a specific thread running the phase completion step, or on a completion function’s side effects occurring without wait
having been called.
For example:
auto b0 = std::barrier(1);
b0.arrive();
b0.arrive();
int data = 0;
auto b1 = std::barrier(1, [&] { data++; });
b1.arrive();
assert(data == 1);
b1.arrive();
Acknowledgements
Everyone that helped, in particular, Olivier Giroux, Eric Niebler, David Olsen, Anthony Williams, and Thomas Rodgers.
Document Number: P2588R3
Date: 2022-02-07
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>
Authors: Gonzalo Brito Gadeschi, Eric Niebler, Anthony Williams, Thomas Rodgers
Audience: LWG
barrier
’s phase completion guaranteesRevisions
Revision 3: update wording with LWG suggestions. Add feature-test macro and Annex C.
Revision 2:
Discussed R1 at SG1 2022-11-07. The following polls were taken.
No consensus.
Consensus.
Unanimous consent to not take this poll
Updated to forbid the implementation from running the
CompletionFunction
on a new thread by removing the following statement from the proposed wording:, or it is a new thread.
.Revision 1:
std::barrier
in the earliest standard, and whether we should consider doing so as a defect report.Revision 0:
Polls taken at SG1 2022-08-29 meeting:
Poll 2: We want to apply this change to std::barrier in the earliest Standard.
Poll 3: We want to get a DR on C++ for this change.
Link to implementation: https://godbolt.org/z/W95Y1q1vY
Abstract
Unintended consequences of
std::barrier
’s specification constrain implementations to run theCompletionFunction
on the last thread that arrives at the barrier during the phase. This preventsstd::barrier
from benefiting from hardware acceleration for thread synchronization. Removing these constraints is a breaking change. This paper aims to find a sweet spot for the barrier specification that delivers the functionality that applications need while allowing efficient implementations.Introduction
The specification of
std::barrier<CompletionFunction>
requires the phase completion step to run when the expected count becomes zero on one of the threads that arrived at the barrier during the phase thread.barrier.class-1.2:The specification of
std::barrier
does not require any thread that arrives at the barrier to callstd::barrier::wait
, but callingbarrier::wait
is necessary to observe phase completion thread.barrier.class-3.sentence-3:A thread that never calls
wait
can still arrive at the barrier again via synchronization through some other thread that does callwait
. This is very useful in practice, as the following example shows:Example 0: a producer / consumer pipeline (godbolt)
In this example,
thread_0
at “A” produces some data, and then at “B” signalsthread_1
that the data is ready.thread_0
will never wait on barrierb0
. Thenthread_0
waits onthread_1
consuming the data, and proceeds to generate new data.The
[[nodiscard]]
attribute onarrive
shows thatbarrier
designers were not expecting threads to arrive at the barrier without calling wait.The current standard wording in thread.barrier.class-1.2.sentence-3:
aims to provide implementations with enough freedom to run the
CompletionFunction
on any thread that participates in the barrier during the phase.Unfortunately, the current wording requires all implementations to run the
CompletionFunction
as part of the call toarrive
performed by the last thread that arrives at the barrier during the phase, as the following example shows:Example 1: Guarantee that
CompletionFunction
runs if no thread waits (godbolt)This example spawns two threads at “C”, both of which execute the same sequence of operations:
After joining both threads at “D”, they no longer exist.
The standard guarantees in thread.barrier.class-1.2.sentence-1 that the completion function runs:
in one of the threads that arrived at the barrier during the phase (thread.barrier.class-1.2.sentence-3). At “E”, these threads do not exist anymore. Therefore, the
CompletionFunction
must have run before “E”, and more precisely, it must have run before the last thread that arrived at the barrier during the phase exits.That is: there is only one place in which it makes sense for conforming standard library implementations to run the phase completion step: as part of the
arrive
performed by the last thread that arrives at the barrier during the phase.This is an unintended consequence of the interaction between:
These consequences are problematic in practice, where Amhdahl’s law limits the scalability of massively parallel applications with small “serial” overheads on modern NUMA architectures with millions of hardware threads. While
std::barrier
’s splitarrive
/wait
APIs enable threads to hide the cost of synchronization behind independent work, the amount of independent computation available in real applications is limited. Clauses like thread.barrier.class-1.2.sentence-3 aim to enablestd::barrier
to use hardware accelerators for synchronization, such as those available in NVIDIA GPUs, to allow applications to hide the cost of synchronization behind small amounts of independent work, but due to the unintended consequences explained above, implementations currently cannot do so.Fixing this requires a breaking change. There are clear engineering trade-offs between varying degrees of guarantees which determine the set of well-formed programs with different degrees of implementation flexibility which determine performance.
This paper aims to to help the reader answer the question: What is the sweet spot for
std::barrier
semantics, and what are the consequences of the breaking changes required to get there?The following section analyzes the semantics of the current barrier specification and the different changes that we could make to balance functionality and performance. Then, we evaluate changes according to the functionality provided, their ability to leverage hardware acceleration, and the impact of a change on both existing and potential standard library implementations and end-user applications.
Finally, the authors suggest changes that balance functionality and performance to deliver the functionality that applications need in practice while allowing efficient implementations and propose wording for these changes.
Analysis of barrier semantics
The “core” semantics of the
std::barrier
phase completion step are:wait
.wait
through the thread that runs theCompletionFunction
.These “core” semantics enable applications to, e.g., perform a reduction in a critical-section in-between arriving and waiting:
Example 2: reduction in critical section (godbolt)
Here, threads produce some data at “A” and arrive and wait at the barrier at “B”.
reduce
is then called inside a critical section, after the last thread arrives, and before any thread is unblocked from the wait. Since all threads participating in the barrier are “stalled” at thewait
,reduce
can accessdata
safely without data-races. Finally,reduce
happens-before any thread is unblocked from thewait
. That is, all uses ofreduction
atC
observe the value thatreduce
initialized it with safely and without data-races.The main design choices that this paper concerns itself with are “when” and “where” should the C++ standard guarantee that the
CompletionFunction
runs. These two properties, “where” and “when”, are intertwined. The following Table 1 explores some of the main options, the choices they enable, and their impact on hardware acceleration, existing implementations, and user applications.Table 1: Design tradeoffs: “When does the
CompletionFunction
run?”, “Where is it allowed to run (on which threads)?”, “Does theCompletionFunction
run if no thread callswait
?”, “What are the hardware acceleration opportunities of these constraints?”, “What’s the impact on implementations and users?”arrive
arrive
wait
wait
arrive
orwait
“Yes”
“No”
“Unspecified”
“Implementation defined”
“Unspecified”
“A thread that arrives or waits”
“A new thread”
“A thread that arrives or waits or a new thread”
“Yes”
“No”
“Unspecified”
“Implementation defined”
Restricting the choice of “when” significantly constrains the threads in which it makes sense for implementations to run the
CompletionFunction
. For example, restricting “when” to particular API calls, including the broad “Anyarrive
orwait
”, restricts implementations to only run the phase completion step within those API calls (on threads making these API calls).An important question impacting Example 1 is whether the
CompletionFunction
runs if no thread waits. The third column in Table 1 answers this question for the options considered. For the last two rows in the table, the standard is free to define these semantics as it wishes. Some options are provided in the table inline, and vary from well-defined semantics (“Yes” or “No”), to “implementation-defined” or “unspecified”.On hardware acceleration
The column on “Hardware acceleration opportunities” describes how effective “hardware barrier synchronization accelerations” can be at accelerating synchronization via
std::barrier
. Hardware accelerators that have more freedom to pick which thread completes the barrier phase (up to the freedom of creating a new thread to do so) can be significantly more effective than those that cannot make this choice. As the number of hardware threads available on computer hardware increases, the benefits of these accelerators outweighs their costs. NVIDIA GPUs have been shipping barrier accelerators for many years. Being descriptive and not prescriptive enables hardware vendors to innovate.Impact on existing implementations
If the standard allows the
CompletionFunction
to run even if no thread ever waits, no mainstream standard library implementation (libc++
,libstdc++
, and MSVC STL) needs to change. Otherwise, these standard library implementations would need to change.For the last two rows, selecting “implementation-defined” behavior would require the existing implementations to document their behavior.
Largest possible semantic change
To evaluate the impact of breaking changes, this section defines the “largest” possible semantic change we could make. It allows
CompletionFunction
to run:This change breaks three
std::barrier
guarantees:wait
, but this change does not guarantee that it does not run if no thread callswait
. That is, even if no thread callswait
, these change allows the implementation to run theCompletionFunction
, but it does not require the implementation to do so.Potential impact on implementations
Thomas Rodgers mentioned that libdispatch barriers potentially complete on a different thread in the thread pool: all the work in the queue before the enqueuing of the barrier work item completes before the barrier work item, whose completion may happen on some other thread. The largest change would enable such an implementation.
NVIDIA’s barrier accelerators can run the phase completion step very close to the memory where the barrier resides. The largest change enables such an implementation.
Impact on existing applications
We surveyed all public uses of
std::barrier
on GitHub as well as within some NVIDIA’s code bases that use them heavily.The impact of all changes considered - including the largest change - on all existing applications we surveyed was non-existent. The feature is a new C++20 feature that’s been only available since:
and it is a relatively niche and sharp tool.
All applications we discovered were either educational or small toy programs for learning purposes. They all had one thread that both arrived and waited at the barrier during all barrier phases. None of them relied on a particular thread running the
CompletionFunction
(we explore the uses we can imagine in the next section).Potential impact on applications
While the authors could not find any application in the wild that would break due to the largest change, together with the experts we polled, we can imagine some.
Portability concerns
Anthony Williams raised the following concern: if most standard library implementations run the phase completion step as part of the last thread that arrives at the barrier during the phase, applications might end up silently and accidentally relying on this guarantee. Therefore, applications will not be portable to implementations that do something else.
Count-down with effect
P2300
when_all
algorithm accepts a variable number of asynchronous tasks and executes a continuation when they all complete. Givenstd::barrier
as specified today, one valid implementation ofwhen_all
simply arrives at the barrier from all asynchronous tasks. After the last task arrives, the continuation runs. No thread callswait
, and the barrier is never used again.The largest change would silently break the semantics of such a
when_all
implementation, causing the application to hang on implementations that do not run the phase completion step as part of the last thread that arrives.This implementation of
when_all
would be usingstd::barrier
as a “single-use count-down with effect”: a counter initialized to the number of asynchronous tasks that each task decrements by1
. When the counter reaches 0, it runs some function for its effect.Since the counter is “single-use”, it would make more sense for
when_all
to usestd::latch
here. However,when_all
cannot do so becausestd::latch
does not supportCompletionFunctions
. A way to enable this use case would be to add support forCompletionFunctions
tostd::latch
. Furthermore, sincestd::latch
is single-use, the semantics that make sense forstd::barrier
do not necessarily make sense forstd::latch
. This is worth exploring.Barriers and latches are synchronization primitives intended to synchronize groups of threads with each other. However, this is not what the
when_all
example needs. Thewhen_all
example synchronizes a set of threads with the thread that runs some effect; it does not synchronize the threads themselves with each other. So another alternative that might be worth exploring is to provide such synchronization primitives in the standard library.Thread id
Anthony Williams mentioned that they commonly see applications in which a handle, like a database handle, is stored in
thread_local
storage (TLS). Such an application could use the knowledge that theCompletionFunction
is only executed by a thread that arrived at the barrier during the phase to just access this handle from theCompletionFunction
.Thomas Rodgers recognized that such an application could, in some cases, store a
shared_ptr<Handle*>
in theCompletionFunction
itself and access the database handle through it. However, if the database handle is tied to the thread that created it, this approach would not work. He also recognized that on the libdispatch barrier model, such an application could not depend on TLS either.This is an example of a larger class of applications that rely on the guarantee that one of the threads that arrived at the barrier during the phase runs the
CompletionFunction
. Thread locals are one example; inspecting the thread id and doing something with it would be another.Any change that relaxes the guarantees about “where” (on which thread) the
CompletionFunction
may run will be turning some of these use cases from “well-formed” programs into non-portable or illegal programs. We can imagine these applications but have not found them in the wild.Suggestion
Our suggestion is to pursue some flavor of the largest semantic change, restricting ourselves to the options that do not impact any existing standard library implementation or application we surveyed in the wild. While we can imagine theoretical applications that would be impacted by any option we take, it is very hard to write applications that do so correctly. Given the novelty and niche of this feature, we expect the impact on such applications to be minimal to non-existent. All experts we polled from the domains of applications using
std::barrier
, standard library implementors, and designers ofstd::barrier
were surprised by the current semantics. These semantics were not intended. The intention was to enablestd::barrier
to leverage hardware acceleration. If feasible, our recommendation is to backport this change to C++20.That is, our suggestion is to guarantee that:
CompletionFunction
runs which happens-before any thread is unblocked from that phase.wait
.Suggested wording
The proposed change modifies thread.barrier.class-1 as follows:
In [version.syn], update the value of the feature test macro
__cpp_lib_barrier
:This change also needs an Annex C entry, so add the following to [diff]:
auto b0 = std::barrier(1); b0.arrive(); b0.arrive(); // implementation-defined; previously well-defined int data = 0; auto b1 = std::barrier(1, [&] { data++; }); b1.arrive(); assert(data == 1); // implementation-defined; previously well-defined b1.arrive(); // implementation-defined; previously well-defined
Acknowledgements
Everyone that helped, in particular, Olivier Giroux, Eric Niebler, David Olsen, Anthony Williams, and Thomas Rodgers.