bulk
Document #: | P3564R0 |
Date: | 2025-01-13 |
Project: | Programming Language C++ LEWG |
Reply-to: |
Mark Hoemmen <mhoemmen@nvidia.com> Bryce Adelstein Lelbach <brycelelbach@gmail.com> Michael Garland <mgarland@nvidia.com> |
bulk
’s forward progress
bulk
for_each
Scheduler-generic std::execution code can never assume concurrent
forward progress in bulk
’s function invocations, even if
the scheduler’s get_forward_progress_guarantee
query
returns concurrent
. This is because a scheduler’s forward
progress guarantee relates to distinct execution agents on the
scheduler, but nothing specifies when the different f(k)
invocations of the user’s function f
execute on distinct
execution agents. Two function invocations that happen on the same agent
cannot have a forward progress guarantee stronger than parallel, because
they cannot perform blocking synchronization without the possibility of
deadlock.
We propose fixing this by specifying that each of bulk
’s
function invocations happens in a distinct execution agent. This would
make bulk
more effective as a basis function for
implementing asynchronous parallel algorithms. Since default
bulk
is sequential, we propose making it ill-formed to use
default bulk
with a scheduler that promises
concurrent
forward progress.
We will begin by explaining why we think saying that
bulk
’s function invocations happen in distinct execution
agents implies that those function invocations have the scheduler’s
forward progress guarantee. Then, we will explain our interpretation of
the current wording, and how it limits both users and implementers.
Next, we will go through our proposed solution and contrast it with
alternative designs. After that, we will present our proposed wording
changes.
bulk
’s forward progressWe propose saying that each of bulk
’s N
function invocations happens in a distinct execution agent. This section
explains why we think that suffices to ensure that bulk
’s
function invocations have the same forward progress guarantee as the
scheduler.
Execution agents do not necessarily correspond to hardware or
operating system features. “These operations happen on distinct
execution agents on the scheduler” is just the language that the
Standard gives us to express that “the scheduler’s forward progress
guarantee applies to these operations.” This establishes a context for
us to speak of forward progress of bulk
’s function
invocations, even when bulk
is executed on a
single-threaded execution resource.
For a scheduler using a single-threaded resource, bulk
’s
function invocations would still execute on distinct execution agents,
but all those agents would run on the same thread. The agents would not
necessarily ever be reified as an actual hardware or software entity;
they are just a way to talk about forward progress of operations on the
scheduler. A single-threaded scheduler like this could promise no
stronger than parallel forward progress.
The Standard defines “execution agent” in the context of synchronization. [thread.req.lockable.general] 1 defines execution agent as “an entity such as a thread that may perform work in parallel with other execution agents.” Note 1 points out that “[i]mplementations or users can introduce other kinds of agents such as processes or thread-pool tasks.” The section goes on to explain that a “given execution agent” can “acquire or release ownership of a lock” ([thread.req.lockable.general] 3). The Standard generally uses “execution agent” when referring to locks and synchronization. For example, [thread.mutex.requirements.general] 1 says: “A mutex object facilitates protection against data races and allows safe synchronization of data between execution agents.” See also [thread.lock.general] 1 and [thread.sharedmutex.requirements.general] 2; the latter explains a situation in which “execution agents block” until some condition is met.
We therefore read other uses of “execution agent” in the Standard as
relating to synchronization. For example, coroutines can be resumed on
an execution agent ([coroutine.handle.resumption] 1); Note 2 explains
that “[a] concurrent resumption of the coroutine can result in a data
race.” This informs our reading of [exec.get.fwd.progress] 1, 3, which
explains that the result of querying
get_forward_progress_guarantee
on a scheduler describes
forward progress of distinct execution agents created by the
scheduler.
The phrase “execution agent” differs from thread of execution ([intro.multithread.general] 1), “also known as a thread,” which “is a single flow of control within a program, including the initial invocation of a specific top-level function, and recursively including every function invocation subsequently executed by the thread.” Execution agents are more general. For example, a program could use a SIMD (Single Instruction Multiple Data) lane as an “execution agent,” even though multiple SIMD lanes may share a single instruction stream (“flow of control”). This is, in fact, one of the motivations for the weakly parallel forward progress guarantee, as explained in the 2014 paper N4156, “Light-Weight Execution Agents.”
This interpretation of “execution agents” lets us speak of “forward
progress” even for a single-threaded execution resource. A scheduler on
a single-threaded resource would execute bulk
as a
sequential for
loop (for example, using default
bulk
). If we talk about each loop iteration as a distinct
execution agent, then we could say that this bulk implementation
promises at most parallel forward progress. That is, this scheduler’s
bulk
could deadlock if any iteration blocks on any other
iteration.
In general, execution agents do not necessarily correspond to
hardware or operating system features. “These function invocations
happen on distinct execution agents on the scheduler” is just a way to
say “the scheduler’s forward progress guarantee applies to these
function invocations.” This is the context for us to talk about forward
progress of bulk
.
bulk
Scheduler-generic code cannot assume that bulk
’s
N
function invocations have a forward progress guarantee
stronger than parallel
, even if a scheduler’s
get_forward_progress_guarantee
query returns
concurrent
. The result of querying
get_forward_progress_guarantee
on the scheduler describes
forward progress of distinct execution agents created by the scheduler
([exec.get.fwd.progress] 1, 3). The default bulk
runs
sequentially on a single execution agent ([exec.bulk] 4) and permits any
number N
of function invocations. As a result, even if the
scheduler promises concurrent forward progress, default
bulk
cannot have forward progress stronger than parallel.
For a customization of bulk
, the current Working Draft does
not specify how many agents it uses, just that those agents have the
scheduler’s forward progress guarantee. For example, if a scheduler
promises concurrent forward progress, but its custom bulk
“tiles” the N
function invocations across fewer than
N
execution agents, each invocation of bulk
could promise at most parallel
forward progress.
bulk
to have the strongest possible forward progress guaranteeUsers want to write various algorithms. Some of those will require
blocking synchronization between different activities, such as function
invocations in bulk
. Right now, they have no way of doing
that in scheduler-generic code, except by launching successive
back-to-back bulk
operations. We want to make it possible
for them to use the synchronization of their choice, if the scheduler
supports it.
One motivation for permitting synchronization in bulk
is
to reduce algorithmic complexity. For example, bulk execution on some
platforms has a launch cost that is independent of the number of
execution agents. If we take away bulk
and make users
express execution over N
agents via recursive calls to when_all
, we force all
concurrent execution into a log N launch cost. If we force users
to launch multiple bulk
operations, typical parallel
algorithms with tree-shaped local communication patterns end up costing
log N bulk
launches instead of a constant number of launches.
SG1 polled with unanimous consent in its Wrocław 2024 review of
P3481R0, “Summarizing
std::execution::bulk()
issues,” that “[w]e need a
version of the bulk API that creates an execution agent per
iteration.”
!=
“parallelize this loop”A bulk
that runs on N execution agents is a unique
facility. Nothing else in Standard C++ lets users execute on N distinct
execution agents “all at once.” We consider this a distinct capability
from parallelizing a loop, and therefore deserving of a separate
interface, for the following reasons.
Loop parallelization involves two separate tasks: mapping of loop iterations to execution agents, and executing work on each of those agents.
Users want, and popular programming models provide, more complicated mappings of iterations to agents than simple chunking. Users often want to know and even specify the mapping.
Regarding (1), one could build loop parallelization as a
straightforward composition of the mapping (a function from agent index
to a set of loop indices) with bulk
(that executes a
function of agent index on a distinct agent). Appendiex B shows an
example of how to implement a chunked parallel for_each
algorithm in this way.
Regarding (2), examples of more complicated mapping facilities
include the variety of schedule
modifiers for OpenMP’s
for
clause, the many array mapping options in High
Performance Fortran, and the generality of ScaLAPACK’s array
distributions. Users need control over data distribution for best
performance, so they can exploit knowledge of data locality relative to
the various execution agents. Giving “implementation freedom” to
bulk
to hide loop distribution inside would take freedom
away from users to build their own chunking schemes on top of
bulk
’s basic execution facility.
Regarding P3481R0’s proposed
bulk_chunked
interface: given an arbitrary mapping, it
would be easier to implement loop parallelization with bulk
than with bulk_chunked
. For example, users might want to
assign loop index k
to agent k % N
(out of
N
agents). As a result, each agent would only see
contiguous “chunks” of size one.
Customizations of bulk
should be able to promise
concurrent forward progress, because popular parallel programming models
try to provide both concurrent and parallel progress, and give users a
way to ask for the one they want. Models offer users this choice in
three different ways.
They expose different programming constructs for “running on N
concurrent threads” versus “parallelizing a loop with M iterations over
N threads” (analogous to parallel for_each
). Some models
only have a way to run on N concurrent threads.
They expose a way to ask for nondefault “extra” forward progress guarantees, even though doing so might only have a performance benefit in special cases.
They expose a more complicated programming model, so that users can get concurrent forward progress between at least some execution agents.
Many parallel programming models expose different programming constructs for “launching N concurrent threads” versus “parallelizing a loop with M iterations over N threads.” The former lets users write algorithms that assume concurrent forward progress, while the latter makes the common case of distributing a large number of function invocations over a smaller number of execution agents easier.
OpenMP 1 distinguishes parallel regions
(e.g., #pragma omp parallel
, with concurrent forward
progress over N
threads, with caveats; see below) from
parallel loops (e.g., #pragma omp parallel for
).
MPI (the Message Passing Interface standard for distributed-memory parallel programming, first released in June 1994) only has a way to launch some number of processes (the distributed-memory analog of execution agents) with the equivalent of a concurrent forward progress guarantee. Users are expected to distribute parallel loop iterations over processes by hand.2
PVM (the Parallel Virtual Machine): MPI’s ability to start a
program with the desired number N of execution agents contrasts with
another programming model of its day, PVM. PVM users who wanted N
execution agents generally had to write boilerplate code that spawned
agents in a binary tree (analogous to recursive when_all
invocations), while MPI users simply got that from the beginning (after
calling MPI_Init
or the equivalent). This was a factor in
MPI’s popularity over PVM.
HPX includes both parallel algorithms like those in the C++
Standard Library, and different ways to enumerate execution agents of
various kinds and execute work in bulk fashion with concurrent forward
progress on them. For example, users can define a “parallel section”
(analogous to a section of code executing on possibly multiple processes
of an MPI communicator), and perform blocking synchronization across
subsets of “images” (execution agents) in the parallel section. Users
can launch their parallel section (via define_spmd_block
)
with a given number of images per “locality” (e.g., node of a cluster,
or NUMA (Non-Uniform Memory Access) domain).
Some parallel programming models expose a way to ask for nondefault
“extra” forward progress guarantees, even though doing so might only
offer a performance benefit in special cases. CUDA3
distinguishes ordinary bulk execution (“device kernel”) launch (with at
best parallel forward progress across thread blocks) from so-called
“cooperative” bulk execution launch (with concurrent forward progress
across thread blocks) as exposed by
CUDA
run-time functions such as cudaLaunchCooperativeKernel
.
The cooperative launch feature is not “free”: it takes away the latency
hiding feature of ordinary kernel launch, and thus only has performance
benefits for specialized algorithms.
Finally, some programming models expose complication, so that users
can get concurrent forward progress between at least some execution
agents. This complication generally takes the form of a hierarchy, with
different levels having possibly different forward progress guarantees.
OpenMP 6.0 (released in November 2024) introduces “progress units” of
some implementation-defined number of consecutive hardware threads.
OpenMP’s parallel regions (#pragma omp parallel
) offer
concurrent forward progress across progress units, but only weakly
parallel progress within a progress unit. Exposing this lets OpenMP
support more architectures, at the cost of users needing to query the
progress unit size (which might be one, indicating that all threads make
concurrent forward progress). CUDA exposes two levels of parallelism:
thread blocks and threads within a block. By default, thread blocks have
the parallel guarantee, while threads within a block have the concurrent
guarantee.4 CUDA provides various
synchronization operations on groups of threads within a thread block,
including synchronous barriers, that would only make sense given
concurrent forward progress.
The “design intent” for such a thing as std::execution is not a single, static thought. Instead, it represents desiderata from a variety of coauthors from different institutions. These interests have been carried over through three major consolidations of proposals: the unification of three “executors” proposals that led to P0443R0, the second unification (as expressed by P1658 and P1660) that led to P0443R11, and the third that led to replacing P0443 with P2300. Nevertheless, one can observe a common set of principles from the various proposals and their reviews.
Bulk execution is a “basis operation,” that is, one of some minimal set of operations on a scheduler that could be used to construct asynchronous parallel algorithms.
Bulk execution may have different forward progress guarantees when running on different execution resources.
Bulk execution’s forward progress guarantees matter to users and should be defined.
Please see Appendix A for a review of this history.
bulk
as
running on N distinct execution agentsbulk
is not just another spelling of
for_each
, but a way to create execution agents in order to
implement parallel algorithms. Therefore, bulk
should offer
the strongest possible forward progress guarantee, by being specified as
each function invocation running in a distinct execution agent. As SG1
polled with unanimous consent in its Wrocław 2024 review of
P3481R0, “[w]e need a version of
the bulk API that creates an execution agent per iteration.”
bulk
should not have a different forward progress
guarantee than any other operation on a scheduler, because otherwise it
hinders reasoning about the forward progress guarantee of an operation
defined by an arbitrary graph of senders.
Make it ill-formed to invoke default bulk
with a
custom scheduler that claims the concurrent
forward
progress guarantee. Make it ill-formed, no diagnostic required to invoke
default bulk
with a scheduler that claims the
parallel
forward progress guarantee, unless default
bulk
on that scheduler has that behavior.
We considered the following alternative designs.
Status quo: bulk
uses an unspecified number of
execution agents; those agents have their scheduler’s forward progress
guarantee, but individual function invocations in bulk
have
at best parallel forward progress.
Let bulk
take a minimum forward progress guarantee
requirement as a parameter. bulk
fails if it cannot satisfy
the requirement.
Treat bulk
as a special case, with a separate query
for its per-invocation forward progress guarantee. (This was P0443’s
design.)
Let each sender algorithm have its own scheduler-dependent
forward progress guarantee, so that the
get_forward_progress_guarantee
query depends on the
operation as well as the scheduler.
Require bulk
to be customized, and require
customizations to have the same forward progress guarantee as the
scheduler’s.
We reject the status quo, Option (1), for reasons discussed above.
Both default bulk
and bulk
customizations must
provide parallel forward progress for their function invocations as long
as their scheduler promises at least parallel forward progress. This is
because bulk
creates an asynchronous operation that
performs its function invocations on a value completion operation
([exec.bulk] 6), and asynchronous operations on a scheduler “execute
value completion operations on an execution agent belonging to the
scheduler’s associated execution resource” ([exec.async.ops] 10). For
example, it would not be legal for a scheduler’s
get_forward_progress_guarantee
query to return
parallel
, yet for bulk
on that scheduler to
use an OpenMP “simd
” construct (which can only promise
weakly_parallel
forward progress).
None of Options (2), (3), and (4) would give users a way to know the
mapping from function invocation to execution agent. Users need to know
that for performance as well as correctness reasons. Given that loop
distributions can be arbitrary one-to-one functions from [ 0, N
) to the set of execution agents, an
interface for users to discover the implementation’s mapping would be
complicated. An interface for users to specify the mapping
(analogous to OpenMP’s various loop distribution options) would also be
complicated, and would be separable from an interface for
running on N
execution agents. (OpenMP likewise separates
running on N
execution agents,
#pragma omp parallel
, from loop parallelization,
#pragma omp parallel for
.) The fundamental interface is
“run on N
execution agents.”
Regarding Option (2), P3481R0
proposes making bulk
take an execution policy. This would
have a similar effect, except that it would exclude concurrent forward
progress.5 The effect of Option (2) is to make
bulk
like parallel for_each
, in that it has a
parameter analogous to an execution policy. However, if
bulk
is a basis operation – a way to implement parallel
algorithms – and not just another way to spell parallel
for_each
, then developers would want to use
bulk
in a different way. When using parallel
algorithms, execution policies like par
and
par_unseq
communicate a “lower bound” forward progress
requirement, often to expose opportunities for optimization. When
implementing parallel algorithms, developers instead want to
communicate the exact requirement of a particular code path. They might
query the scheduler for its forward progress guarantee, and use it to
dispatch to different code paths, as in the example below. This is
because the parallel forward progress path might be less efficient than
the concurrent forward progress path, for example because it replaces
blocking synchronization with multiple bulk
invocations.
constexpr auto g = get_forward_progress_guarantee(scheduler);
if constexpr (g == forward_progress_guarantee::concurrent) {
// ... concurrent implementation ...
} else if constexpr (g == forward_progress_guarantee::parallel) {
// ... parallel implementation ...
} else {
static_assert(false, "Need at least parallel forward progress");
}
Regarding Option (4), it’s clear to us what a forward progress
guarantee means for bulk
’s function invocations that
complete on a single scheduler. However, it’s not clear what that would
mean for an arbitrary graph composed of various asynchronous operations
that might complete on an arbitrary variety of senders. Thus, having the
get_forward_progress_guarantee
query depend on the sender
algorithm would complicate the syntax without actually being useful for
other sender algorithms besides bulk
. This is why forward
progress is a property of a scheduler.
Regarding Option (5), requiring bulk
to be customized
could be done either by removing the default implementation entirely, or
by making it still selected for overload resolution but ill-formed.
Users of bulk
would either need to rely on a scheduler’s
customization, or find another scheduler that customizes
bulk
and then transition to that scheduler whenever they
use bulk
. Either approach to removing default
bulk
would conflict with the P2300 design principle that
every sender algorithm has a default. Sender algorithms are only
customized for specific schedulers, and only because doing so is
necessary for correctness or performance on that scheduler. If a
scheduler has nothing to say about bulk
, it should just
leave the default in place. Removing default bulk
would
also have the unfortunate effect of bifurcating the set of schedulers
into “bulk-capable” and “not-bulk-capable.” SG1 has polled to include
bulk execution in the “minimal set” of an executor’s operations, for
example in Kona 2017. In addition, removing default bulk
would also prevent users from doing reasonable things like benchmarking
default bulk
against a customization.
Scheduler authors who want to promise concurrent forward progress but
do not care about bulk
can do either of the following.
Nothing (do not customize bulk
). This proposal will
make it ill-formed to use default bulk
with their scheduler
(see the next section). The scheduler may thus ignore bulk
when reporting its forward progress guarantee.
Customize bulk
, but have the customization always
complete with an error if the number of agents N
is greater
than 1 (see the section after the next).
bulk
with a scheduler promising concurrent forward
progressWe propose that it be ill-formed to use default bulk
with a scheduler whose get_forward_progress_guarantee
query
returns concurrent
. This is because default
bulk
cannot possibly fulfill concurrent forward progress
for any number of agents N
other than one. This will
prevent a common error for scheduler authors who do not care about
bulk
, yet still want to provide a concurrent
scheduler. It also makes default bulk
usable for users who
don’t mind a weaker forward progress guarantee.
bulk
customization to fail if it cannot fulfill its forward progress for a
given number of agentsPermit a bulk
customization to fail at run time if
it cannot provide the promised forward progress guarantee for a given
number of agents N
.
This lets bulk
still promise the strongest forward
progress guarantee that it can provide for any value of
N
.
bulk
fails by propagating a scheduler-dependent
error completion.
Letting bulk
fail if it cannot fulfill its forward
progress guarantee for a given N
would still permit
bulk
to promise the strongest forward progress guarantee
that it can provide for any value of N
. By analogy,
std::thread
still promises an implementation-defined
forward progress guarantee, even though std::thread
’s
constructor is permitted to fail. Just as std::thread
’s
constructor does not devolve to running code inline if it fails to
launch a thread, bulk
should not devolve to a weaker
forward progress guarantee if it cannot meet its promise.
The idea that execution should “indicate failure” if it cannot provide the requested forward progress guarantee occurs in P0058R1, one of the three predecessors of P0443 (see historical overview in Appendix A). SG1, in its review of P0443R0 at the Issaquah 2016 meeting, expressed the wish for task launch to fail if it could not be launched with a concurrent progress guarantee.
Permitting bulk
to fail for N
“too large”
makes this proposal trivial to implement. Scheduler authors who don’t
care about bulk
have two minimal-effort choices.
Do nothing. Default bulk
will be ill-formed if their
scheduler reports concurrent forward progress, and both well-formed and
correct if their scheduler reports weaker forward progress than
concurrent.
Customize bulk
for their scheduler, and make
bulk
fail at run time for N
greater than
1.
How should bulk
fail? The idiomatic error reporting
method for senders in [exec] is to use the error channel. The sender
adapters starts_on
, schedule_from
, and
on
all handle failing to schedule an operation on a
scheduler by executing an error completion on an unspecified execution
agent. Why shouldn’t other operations that may require the creation of
execution agents, like bulk
, do the same? This behavior is
analogous to that of std::thread
’s constructor, which
throws std::system_error
“if unable to start the new
thread” ([thread.thread.constr] 8).
What error should bulk
report? We should not require the
error to be an exception. This permits use of std::execution with custom
schedulers in freestanding environments, or environments where
exceptions are disabled. We should also let different schedulers report
different error types. For example, a scheduler that creates execution
agents one at a time using jthread
’s constructor (not
necessarily wise from a performance perspective, though it would
guarantee concurrent forward progress) would reasonably just pass along
the thrown system_error
if thread creation fails. Forcing
different schedulers to use the same error type would prevent custom
schedulers from passing along information that might help users handle
errors.
Implementing error handling by reporting down the error channel is
practical in at least some cases. For example, OpenMP has a run-time
function omp_get_max_threads
that a bulk
implementation could use to check if N is too large.
Customizations of bulk
would need to know that they have
N
distinct execution agents available, before any of those
agents may invoke the user’s function. Any error reporting would need to
happen before any function invocations. For example, it would be
nonconforming for a customization to start some smaller number
n
of agents opportunistically, permit them to begin
invoking the function for indices [ 0,
n
), and only thereafter
try creating more agents with the possibility of failure. In our view,
this is an implication of the current specification; this proposal would
not change that.
We do not accept any design permitting bulk
to
change its forward progress guarantee depending on N
, for
example by tiling function invocations across available execution agents
if N
is greater than the maximum number of concurrent
agents. This would effectively make bulk
a parallel loop
interface. As we discuss elsewhere, we consider “running on
N
execution agents” separate functionality from
“parallelize this loop,” and think the two should have distinct
interfaces.
The alternative designs discussed below address the question of how
users would prevent errors due to the number of agents N
being too large. Different implementations’ upper bounds for
N
may vary from 1 (OpenMP implementation running on a
single-processor system that is not permitted to create threads) to
numbers too large to fit in 32 bits (the maximum product of the x, y,
and z dimensions of a grid of thread blocks in a CUDA kernel launch).
Here are three designs. We do not propose any of them, but if
we had to pick one, it would be the third design.
Let users query the scheduler’s maximum number of execution
agents that can execute with the promised forward progress guarantee:
size_t get_max_num_agents()
.
Add a variant of bulk
for which users do not specify
the number of execution agents N
, but instead receive it as
a second argument of their function.
Add a variant of bulk
for which users specify a
maximum number of execution agents N
. They will
then receive the actual number of execution agents
n
, where n
is no larger than N
,
as the second argument of their function.
Regarding Option (1), we do not think a “maximum N
”
query would be helpful for arbitrary user-defined schedulers. First,
users may not need to run on the maximum number of agents,
because they might only have a small number of tasks. Asking for more
agents than needed might waste resources. Second, the query could only
ever be an upper bound, because the actual maximum might change during
the program’s lifetime (as system resources are claimed by other
processes or become available again) and might depend on the context
(e.g., whether this is a nested bulk
invocation, assuming
that those are allowed).6. Third, the best upper bound for
performance, or even a practically attainable upper bound, might be
smaller than the maximum value. Just because
std::vector<T,Allocator>::max_size()
is huge doesn’t
necessarily mean that you can allocate that much memory on every system.
Likewise, just because a scheduler can create millions of
concurrent execution agents, doesn’t necessarily mean that this would
perform acceptably.
To elaborate on (2) and (3), bulk
’s function would have
two parameters Shape k
, Shape N
. The argument
for k
is the invocation index in [ 0, N
), and the argument for N
is the
actual number of execution agents.
Option (2) would correspond to using
#pragma omp parallel
without a num_threads(N)
clause, where users can call omp_get_num_threads
to query
the number of threads that the scheduler offered. The bulk
customization would provide some number N
of execution
agents (not necessarily the maximum – e.g., the scheduler may reduce the
number to help with load balancing). Default bulk
would
provide N=1
, thus ensuring that it has the same progress
guarantee as any scheduler.
Option (2) has the same issue as Option (1), that users may only have
a small number of function invocations to do; getting more agents than
that might waste resources. Not giving users a way to specify
N
would force a branch
if (k < N) do_work(k)
into the user’s function. Also,
the optimal N
value might depend on some properties of the
user’s function (e.g., its stack size requirement, or how many registers
it uses) that the system might not be able to determine even at run time
(e.g., if the function was loaded from a dynamically linked library).
Thus, users still need a way to specify N
.
Option (3) – letting users specify a maximum number of agents – would
still force a branch in the user’s function. However, it would expose an
opportunity for users and the system to negotiate over the number of
agents. Our one concern is that this would turn “failure to fulfill the
forward progress guarantee” from an error into a non-error that is
nevertheless the user’s problem. Implementations could abuse this to
claim concurrent forward progress while only ever offering
N = 1
execution agent. This is why we do not propose this
feature, though we object to it the least of all three options here.
Implementations would only need to make three changes to conform with this proposal.
Schedulers must only promise concurrent forward progress if they
have a bulk
customization and if that customization
promises concurrent forward progress for its function
invocations.
Default bulk
must be ill-formed when used with a
scheduler that promises concurrent forward progress.
If a scheduler has a bulk
customization, and if the
bulk
customization can fail to provide the scheduler’s
forward progress guarantee for a given Shape
argument
shape
, then the bulk
customization must report
an error (via set_error
).
The custom schedulers in
NVIDIA’s reference
implementation of std::execution already conform with this proposal,
because they do not promise concurrent forward progress. For example,
its CPU-based static_thread_pool::scheduler
reports
parallel forward progress, because it uses a thread pool with a fixed
number of threads. The GPU-based schedulers report weakly parallel
forward progress. Thus, the only change we have left to implement would
be (2).
Text in blockquotes is not proposed wording, but rather instructions for generating proposed wording. The � character is used to denote a placeholder section number which the editor shall determine.
In [version.syn], change the value of the
__cpp_lib_senders
macro toYYYYMML
, where the placeholder valueYYYYMML
denotes this proposal’s date of adoption.
bulk
Change [bulk.exec] 4 as follows, so that it is ill-formed to complete default
bulk
if its completion scheduler promises concurrent forward progress. (The only changes to this section are in the body of the lambda assigned toimpls-for
<bulk_t>::
complete
.)
4
The member
impls-for
<bulk_t>::
complete
is initialized with a callable object equivalent to the following
lambda:
[]<class Index, class State, class Rcvr, class Tag, class... Args>
(Index, State& state, Rcvr& rcvr, Tag, Args&&... args) noexcept -> void requires see below {
if constexpr (same_as<Tag, set_value_t>) {
constexpr auto guarantee = get_forward_progress_guarantee(decltype(get_completion_scheduler<set_value_t>(get_env(rcvr)))); static_assert(guarantee != forward_progress_guarantee::concurrent);
auto& [shape, f] = state;
constexpr bool nothrow = noexcept(f(auto(shape), args...));
(rcvr, [&]() noexcept(nothrow) {
TRY-EVALfor (decltype(auto(shape)) i = 0; i < shape; ++i) {
(auto(i), args...);
f}
()(std::move(rcvr), std::forward<Args>(args)...);
Tag}());
} else {
()(std::move(rcvr), std::forward<Args>(args)...);
Tag}
}
5
The expression in the requires-clause of the lambda above is
true
if and only if Tag
denotes a type other
than set_value_t
or if the expression
f(auto(shape), args...)
is well-formed.
bulk
customizationsChange [bulk.exec] 6 as follows, so that
bulk
customizations complete with an error completion if the customization is unable to execute onshape
distinct execution agents. (The definition of the packargs
has been moved from subparagraph 6.1 to paragraph 6. Subparagraph 6.1 has been rewritten and now includes two new subparagraphs 6.1.1 and 6.1.2.)
6
Let the subexpression out_sndr
denote the result of the
invocation bulk(sndr, shape, f)
or an object equal to such,
and let the subexpression rcvr
denote a receiver such that
the expression connect(out_sndr, rcvr)
is well-formed.
Let
args
denote a pack of lvalue subexpressions
referring to the value completion result datums of the input sender
sndr
. The expression
connect(out_sndr, rcvr)
has undefined behavior unless it
creates an asynchronous operation ([exec.async.ops]) that, when
started,
(6.1) if
sndr
propagated a value completion operation, does exactly
one of the following:
(6.1.1)
completes with an error completion operation with a scheduler-specific
error on rcvr
, if the asynchronous operation’s value
completion operation would have been unable to invoke
f(i, args...)
on a distinct execution agent for every
i
of type Shape
in [0, shape)
;
or
(6.1.2)
invokes f(i, args...)
on a distinct execution agent for
every i
of type Shape
in
[0, shape)
; and
(6.1) on a
value completion operation, invokes f(i, args...)
for every
i
of type Shape
in [0, shape)
,
where args
is a pack of lvalue subexpressions referring to
the value completion result datums of the input sender, and
(6.2)
propagates all completion operations sent by sndr
.
Appendix A reviews the history of std::execution in the context of bulk execution.
P0443, the predecessor proposal of P2300, was itself a unification of three different “executors” proposals:
“Google’s executor model for interfacing with thread pools” N4414 (2015),
“Chris Kohlhoff’s executor model for the Networking TS” N4370 (2015), and
“NVIDIA’s executor model for the Parallelism TS” P0058.
Of these three proposals, only P0058 includes bulk execution. It
defines different “tags” for specifying different levels of forward
progress of bulk function invocations, including
concurrent_execution_tag
, as well as different kinds of
executors (including concurrent_executor
) that offer
different levels of forward progress. It explains that different
strengths of forward progress guarantee can be used to construct
algorithms that communicate in different ways.
For example, parallel agents might communicate via a shared
atomic<T>
parameter, while concurrent agents might use a sharedbarrier
object to synchronize their communication (Section 4.5).
concurrent_execution_tag
- … The basic idea is that the function invocations executed by a group of concurrent execution agents are permitted to block each others’ forward progress. This guarantee allows concurrent execution agents to communicate and synchronize (Section 4.10).
N4370 and its foundational proposals on asynchronous operations
N4045 (2014) and
N4242 (2014) do not speak of
forward progress. N4414 makes forward progress a property of the
specific executor – e.g., the system_executor
has
concurrent forward progress. However, it leaves a query for that
property to future work.
An interface like that suggested around Execution Agents in N4156 (and related papers) could provide a generic mechanism for checking the traits of an executor (whether it is concurrent, parallel, or weakly parallel, as well as the behavior of thread local storage). Discussion of whether that concept should be applied to all executors is left for a subsequent paper but some form of execution agent traits is reasonable as [sic].
It may help to refer to N4156 (“Light-Weight Execution Agents,” 2014), which defines “execution agent” in terms of the three levels of forward progress guarantees that later became part of the C++ Standard. It gives examples of implementations that provide different levels of forward progress for their execution agents.
SG1’s discussions of P0443 cover forward progress guarantees extensively, including guarantees of bulk execution. See e.g., SG1 discussion in Issaquah 2016, Kona 2017, and Albuquerque 2017. Some reviewers mentioned wanting bulk execution to be able to provide concurrent forward progress if possible, for example if a bounded thread pool sometimes has enough threads for it. In Kona 2017, SG1 polled “Should the bulk two-way7 functions be part of the minimal set for a final TS?” with results 10/3/1/0/1.
P0688R0 responded to Kona
2017 feedback on P0443R1. That feedback aimed to simplify P0443. P0668
does not clarify the forward progress of bulk execution’s individual
function invocations; “blocking” for bulk_*_execute
refers
to the caller, as in, “[m]ay block forward progress of the caller until
one or more invocations of f
finish execution.” Minutes of
SG1 discussion at the Toronto 2017 meeting (
https://wiki.edg.com/bin/view/Wg21toronto2017/P0688 ) refer to the
guarantees “within the bulk group” as distinct from the properties of
the “spawning thread.” It also points out that whether an operation
blocks the caller (in P0443, operations could be blocking or
nonblocking; P2300 only has asynchronous operations) means
something different than the forward progress guarantee of the
operation.
Later versions of P0443
introduced a customization point bulk_execute
which is the
analog and predecessor of P2300’s bulk
.
bulk_execute
lets callers specify a separate forward
progress guarantee for bulk execution, as a bulk_guarantee
query. It has offered three different guarantees
(unsequenced
, sequenced
, and
parallel
) since R4 of the proposal. The
parallel
guarantee means parallel forward progress,
sequenced
means a sequential loop, and
unsequenced
means something like
std::execution::par_unseq
.
The 2019 paper P1658 led to the
bulk_execute
customization point design in later versions
of P0443 (starting with R11). P1658 summarizes bulk_execute
as “eagerly creat[ing] multiple execution agents in bulk.”
P1660R0 (“A Compromise Executor
Design Sketch”) talks about different ways to express bulk execution –
either eager, or lazy – but leaves the bulk execution interface as
future work for P0443.
SG1’s Prague 2020 review of
P1897R2, “Towards C++23 executors: A
proposal for an initial set of algorithms” (see
minutes)
reflects a debate on whether bulk execution is a fundamental building
block. Much of this debate revolves around P0443’s separation of bulk
executors and non-bulk executors. The final version
P1897R3 removes
indexed_for
(the “cleaned up version of
bulk_execute
), with the suggestion that higher-level
algorithms over ranges could be built atop bulk execution.
Discussion of P1898, “Forward progress delegation for executors,” points out that forward progress guarantees are not as well-defined for asynchronous algorithms as they are for synchronous algorithms. This matters, for example, for launching nested work, like bulk iteration being launched inside bulk iteration.
We consider P2181, “Correcting the Design of Bulk Execution (R0 published June 2020, R1 published November 2020) to describe the last pre-P2300 state of affairs of bulk execution. It proposes changes to P0443R14 (which was the last published version of P0443).8 P0443R14 includes an”Editorial note” that says, “We should probably define what ‘execute N invocations of the function object F on the executor S in bulk’ means more carefully.” Section 4.1 of P2181 proposes defining that phrase in the following way.
A group of execution agents created in bulk has a shape. Execution agents within a group are identified by indices, whose unique values are the set of contiguous indices spanned by the group’s shape.
An executor bulk executes an expression by scheduling the creation of a group of execution agents on which the expression executes in bulk. Invocable expressions are invoked with each execution agent’s index. Bulk execution of expressions that are not invocables is executor-defined.
SG1 ended up giving feedback on P2181R1, but not on bulk’s forward progress. P2181 review stopped there, as the effort was folded into P2300.
In P2300, the “Asynchronous inclusive scan” example that uses
bulk
(Section 1.3.2) says that it “creat[es]” or “spawn[s]”
a number of “execution agents” equal to its input argument N. The
example goes on to build an asynchronous inclusive scan algorithm out of
a sequence of bulk
and then
operations. The
example tiles the bulk
operations by hand. However, the
later nonwording description of bulk
leaves unspecified the
number of execution agents that bulk
uses.
Each invocation of
function
runs in an execution agent whose forward progress guarantees are determined by the scheduler on which they are run. All agents created by a single use ofbulk
execute with the same guarantee. The number of execution agents used bybulk
is not specified. This allows a scheduler to execute some invocations of thefunction
in parallel.
The above text does not appear until R6 of P2300. R2 elaborates on
the brief description in R0 and R1 with the following text that supports
the “N execution agents” intent of bulk
.
Each invocation of function runs in an execution agent whose forward progress guarantees are determined by the scheduler on which they are run. All agents created by a single use of
bulk
execute with the same guarantee. This allows, for instance, a scheduler to execute all invocations of the function in parallel.The
bulk
operation is intended to be used at the point where the number of agents to be created is known and provided tobulk
via its shape parameter.
A possible reason for the change in R6 is that bulk
has
always had a default sequential implementation since R0 of P2300. Making
sender algorithms customizable and providing default implementations for
them are key parts of P2300’s design, as explained in Section 5.4,
“Sender algorithms are customizable.” However, bulk
having
a sequential default means that it cannot offer a concurrent forward
progress guarantee, even if the scheduler does.
We want to restore P2181’s definition of bulk execution as running each function invocation on a distinct execution agent.
We also want to preserve default bulk
’s sequential
implementation.
We propose resolving this apparent contradiction by using
“execution agent” just as a way to talk about forward progress. This
means that default bulk
promises at best parallel forward
progress.
for_each
This Compiler Explorer
example shows how to build loop parallelization as a straightforward
composition of the mapping (a function from agent index to a set of loop
indices) with bulk
(that executes a function of agent index
on a distinct agent). We show the example’s complete source code
below.
#include <algorithm> // ranges::for_each
#include <cassert>
#include <functional> // invoke
#include <iostream>
#include <type_traits>
#include <ranges>
#include <vector>
namespace mystd {
template<class I, class Fun>
struct for_each_result {
I in;
Fun fun;};
// Sequential implementation of something like ranges::for_each.
struct seq_for_each_fn {
template<
::random_access_iterator I,
std::sentinel_for<I> S,
stdclass Proj = std::identity,
::indirectly_unary_invocable<std::projected<I, Proj>> Fun
std>
<I, Fun>
for_each_resultoperator() (I first, S last, Fun f, Proj proj = {}) const {
for (; first != last; ++first) {
::invoke(f, std::invoke(proj, *first));
std}
return {std::move(first), std::move(f)};
}
};
inline constexpr seq_for_each_fn seq_for_each;
// Given the index in [0, num_agents) of an agent,
// return the half-open interval of indices [first, last)
// in the range [0, num_elements)
// for which that agent is responsible.
//
// In case num_agents does not evenly divide num_elements,
// load-balance so that the number of indices per agent
// cannot vary by more than one.
template<std::integral Integral>
constexpr auto my_chunk(Integral num_elements, Integral agent_index, Integral num_agents)
{
constexpr Integral ZERO{0};
constexpr Integral ONE{1};
if (num_elements == 0) {
return std::tuple{ZERO, ZERO};
}
auto quotient = num_elements / num_agents;
// The first `remainder` agents get `quotient+1` number of elements;
// the rest of the agents get `quotient` number of elements.
auto remainder = num_elements - quotient * num_agents;
auto my_chunk_size = quotient + (agent_index < remainder ? ONE : ZERO);
// Cast back to Integral to undo effects of integer promotion;
// otherwise, no matching call for std::min with Integral=short.
auto num_incremented = std::min(
static_cast<Integral>(remainder));
agent_index, auto num_original = (agent_index >= remainder) ?
(agent_index - remainder) : ZERO;
auto my_chunk_index = num_incremented * (quotient + ONE) + num_original * quotient;
assert(
< num_elements ||
my_chunk_index (my_chunk_index == num_elements && my_chunk_size == ZERO)
);
// Cast back to Integral to undo effects of integer promotion
// (for e.g., Integral=short).
return std::tuple{
static_cast<Integral>(my_chunk_index),
static_cast<Integral>(my_chunk_index + my_chunk_size)
};
}
// For the range [iterator, sentinel), return
// the given agent's [my_iterator, my_sentinel).
template<
::random_access_iterator I,
std::sentinel_for<I> S
std>
constexpr auto /* iterator, sentinel (not necessarily S) */
(I first, S last,
my_chunk::iter_difference_t<I> agent_index,
std::iter_difference_t<I> num_agents)
std{
auto num_elements = std::ranges::distance(first, last);
auto [my_first, my_last] = my_chunk(num_elements, agent_index, num_agents);
return std::tuple{first + my_first, first + my_last};
}
// Implementation corresponding to default bulk in [exec.bulk].
template<std::integral Shape, std::invocable<Shape> Fun>
void bulk(Shape N, Fun f) {
for (Shape k = Shape(0); k < N; ++k) {
::invoke(f, k);
std}
}
// Chunked "parallel" implementation of something like ranges::for_each.
// It uses the above "bulk" inside.
struct chunked_for_each_fn {
template<
::random_access_iterator I,
std::sentinel_for<I> S,
stdclass Proj = std::identity,
::indirectly_unary_invocable<std::projected<I, Proj>> Fun
std>
<I, Fun>
for_each_resultoperator() (I first, S last, std::iter_difference_t<I> num_agents, Fun f, Proj proj = {}) const {
(num_agents, [&] (std::iter_difference_t<I> agent_index) {
bulkauto [my_first, my_last] = my_chunk(first, last, agent_index, num_agents);
(my_first, my_last, f, proj);
seq_for_each});
return {std::move(first), std::move(f)};
}
};
inline constexpr chunked_for_each_fn chunked_for_each;
} // namespace mystd
namespace test {
template<std::integral Integral>
void test_my_chunk_once(
)
Integral num_elements, Integral num_agents{
constexpr Integral ZERO{0};
constexpr Integral ONE{1};
{0};
Integral cur_first{0};
Integral cur_lastfor (Integral agent_index = ZERO; agent_index < num_agents;
++agent_index)
{
auto [first, last] = mystd::my_chunk(num_elements, agent_index, num_agents);
assert(first == cur_last);
assert(last >= first);
auto quotient = num_elements / num_agents;
assert(last - first == quotient || last - first == quotient + ONE);
= first;
cur_first = last;
cur_last }
assert(cur_last == num_elements);
assert(cur_first <= cur_last);
}
template<std::integral Integral>
void test_my_chunk_tmpl() {
constexpr Integral ZERO{0};
constexpr Integral ONE{1};
constexpr Integral TWO{2};
constexpr Integral FIVE{5};
constexpr Integral FIFTEEN{15};
constexpr Integral SEVENTEEN{17};
(ZERO, ONE);
test_my_chunk_once(ONE, ONE);
test_my_chunk_once(ONE, SEVENTEEN);
test_my_chunk_once(SEVENTEEN, ONE);
test_my_chunk_once(SEVENTEEN, TWO);
test_my_chunk_once
(FIFTEEN, FIVE);
test_my_chunk_once(SEVENTEEN, FIVE);
test_my_chunk_once}
void test_my_chunk() {
<short>();
test_my_chunk_tmpl<unsigned short>();
test_my_chunk_tmpl<int>();
test_my_chunk_tmpl<unsigned int>();
test_my_chunk_tmpl<long long>();
test_my_chunk_tmpl<unsigned long long>();
test_my_chunk_tmpl}
} // namespace test
int main() {
::test_my_chunk();
test
auto x = std::ranges::to<std::vector>(std::views::iota(1, 21));
::cout << "std::ranges::for_each:\n";
std[[maybe_unused]] auto result =
::ranges::for_each(x.begin(), x.end(), [] (float x_k) {
std::cout << "x_k: " << x_k << "\n";
std});
::cout << "\nseq_for_each:\n";
std[[maybe_unused]] auto result_seq_0 =
::seq_for_each(x.begin(), x.end(), [] (float x_k) {
mystd::cout << "x_k: " << x_k << "\n";
std});
::cout << "\nseq_for_each with nontrivial projection:\n";
std[[maybe_unused]] auto result_seq_1 =
::seq_for_each(x.begin(), x.end(), [] (float x_k) {
mystd::cout << "10 * x_k: " << x_k << "\n";
std}, [] (float x_k) { return 10 * x_k; } );
int num_agents = 3;
::cout << "\nchunked_for_each (num_agents=" << num_agents << "):\n";
std[[maybe_unused]] auto result3 =
::chunked_for_each(x.begin(), x.end(), num_agents, [] (float x_k) {
mystd::cout << "x_k: " << x_k << "\n";
std});
return 0;
}
L. S. Blackford et al., “ScaLAPACK Users’ Guide,” Society of Industrial and Applied Mathematics, Philadelpha, PA, 1997. Available online: https://netlib.org/scalapack/slug/ [last accessed 2025/01/02].
G. A. Geist, J. A. Kohl, and P. M. Papadopoulos, “PVM and MPI: a Comparison of Features,” Calculateurs Paralleles, Vol. 8 No. 2, 1996. Available online: http://www.csm.ornl.gov/pvm/PVMvsMPI.ps [last accessed 2024/12/31].
Hartmut Haiser et al., “HPX - The C++ Standard Library for Parallelism and Concurrency,” Journal of Open Source Software, September 2020. Available online: https://joss.theoj.org/papers/10.21105/joss.02352 [last accessed 2025/01/10].
HPX Documentation, v1.10.0. Available online: https://hpx-docs.stellar-group.org/latest/pdf/HPX.pdf [last accessed 2025/01/10].
Ken Kennedy, Charles Koelbel, and Hans Zima, “The Rise and Fall of High Performance Fortran,” Communications of the ACM, Vol. 54, No. 11, pp. 74 - 82, November 2011.
Message Passing Interface Forum, “MPI: A Message-Passing Interface Standard Version 4.1”, November 2023. Available online: https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report.pdf [last accessed 2024/12/31].
“OpenMP Application Programming Interface,” Version 6.0, November 2024. Available online: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-6-0.pdf [last accessed 2024/12/31].
Parallel Virtual Machine website. Available online: https://www.csm.ornl.gov/pvm/pvm_home.html [last accessed 2024/12/31].