1. Changes history
1.1. R1
-
Incorporate to the feedback from SG1 review, Poland, 2024.
-
Add wording section.
2. Motivation
When [P2300R10] was merged into the working draft, a subsequent review of the github issue tracker discovered several
outstanding issues relating to the use of the
algorithm that was in [P2300R10] which were not addressed during the
LEWG review. These issues mostly account for better expressing what
can and cannot do.
For most algorithms defined in [P2300R10], the default implementation typically describes the desired behavior.
However, this is not the case for
.
The expectation for
is that it will be customized for most use cases.
The default implementation is a common-denominator solution that is not necessarily optimal for the majority of scenarios.
The description in [exec.bulk] specifies the default implementation but does not outline any requirements for customizations.
This creates uncertainty for users regarding what they can expect when invoking
.
This paper addresses this issue by specifying what is permissible in a customization of
.
A second issue this paper seeks to resolve is the lack of an execution policy for the provided functor.
For example, imagine that we have scheduler that supports
execution policy only for parallelism (certain GPU
schedulers) and it wants to customize
. If the functor is unsuitable for the
execution policy, there is
currently no mechanism for users to express this with the current API shape.
A third issue this paper addresses is the absence of chunking in the default implementation of
.
Invoking the functor for every element in a range can lead to performance challenges in certain use cases.
2.1. Allowed behavior for bulk
customizations
From the specification of bulk in [exec.bulk], if we have customized
, the following are unclear:
-
Can
invoke the given functor concurrently?bulk -
Can
create decayed copies of the given functor?bulk -
Can
create decayed copies of the values produced by the previous sender?bulk -
Can
handle cancellation within the algorithm?bulk -
How should
respond to exceptions thrown by the given functor?bulk
2.2. Execution policy for the given functor
Similar to
([alg.foreach]) users might expect the
algorithm to take an execution policy as argument.
The execution policy would define the execution capabilities of the given function.
There are two main reasons for wanting an execution policy to accompany the function passed to
:
-
To better tailor the algorithm to the available hardware.
-
To improve usage in generic code.
Currently, we have hardware that supports both
and
execution policy semantics, as well as hardware that only supports
.
Standardizing the API without execution policies implies that the language chooses a default execution policy for
.
This decision could favor certain hardware vendors over others.
From a generic programming perspective, consider a scenario where a user implements a function like the following:
void process ( std :: vector < int >& data , auto std :: invocable < int > f ) { // call `f` for each element in `data`; try using `bulk`. // can we call `f` concurrently? }
In the body of the generic function process, we do not know the constraints imposed on
.
Can we assume that
can be invoked with the
execution policy?
Can it be invoked with the
execution policy?
Or should we default to the
execution policy?
Since there is no way to infer the execution policy that should apply to
, the implementation of process must be conservative and default to the
execution policy.
However, in most cases, the
or
execution policy would likely be more appropriate.
2.3. Chunking
The current specification of
([exec.bulk]) does not support chunking.
This limitation presents a performance issue for certain use cases.
If two iterations can potentially be executed together more efficiently than in isolation, chunking would provide a performance benefit.
Let us take an example:
std :: atomic < std :: uint32_t > sum { 0 }; std :: vector < std :: uint32_t > data ; ex :: sender auto s = ex :: bulk ( ex :: just (), std :: execution :: par , 100'000 , [ & sum , & data ]( std :: uint32_t idx ) { sum . fetch_add ( data [ idx ]); });
In this example, we perform 100,000 atomic operations. A more efficient implementation would allow a single atomic operation for each chunk of work, enabling each chunk to perform local summation. This approach might look like:
std :: atomic < std :: uint32_t > sum { 0 }; ex :: sender auto s = ex :: bulk_chunked ( ex :: just (), std :: execution :: par , 100'000 , [ & sum , & data ]( std :: uint32_t begin , std :: uint32_t end ) { std :: uint32_t partial_sum = 0 ; while ( begin != end ) { partial_sum += data [ idx ]; } sum . fetch_add ( partial_sum ); });
Other similar examples exist where grouping operations together can improve performance.
This is particularly important when the functor passed to
cannot be inlined or when the functor is expensive to invoke.
3. API
Define
to match the following API:
// NEW: algorithm to be used as a default basis operation template < execution :: sender Predecessor , typename ExecutionPolicy , std :: integral Size , std :: invocable < Size , Size , values - sent - by ( Predecessor )... > Func > execution :: sender auto bulk_chunked ( Predecessor pred , ExecutionPolicy && pol , Size size , Func f ); // NEW: algorithm to ensure one execution agent per iteration template < execution :: sender Predecessor , typename ExecutionPolicy , std :: integral Size , std :: invocable < Size , values - sent - by ( Predecessor )... > Func > execution :: sender auto bulk_unchunked ( Predecessor pred , ExecutionPolicy && pol , Size size , Func f ); template < execution :: sender Predecessor , typename ExecutionPolicy , std :: integral Size , std :: invocable < Size , values - sent - by ( Predecessor )... > Func > execution :: sender auto bulk ( Predecessor pred , ExecutionPolicy && pol , // NEW Size size , Func f ) { // Default implementation return bulk_chunked ( std :: move ( pred ), std :: forward < ExecutionPolicy > ( pol ), size , [ func = std :: move ( func )] < typename ... Vs > ( Size begin , Size end , Vs && ... vs ) noexcept ( std :: is_nothrow_invocable_v < Func , Size , Vs && ... > ) { while ( begin != end ) { f ( begin ++ , std :: forward < Vs > ( vs )...); } }); }
The
algorithm is a customization point returning a sender that describes a work with following properties:
-
Let
be the values sent by the predecessor.args ... -
In the absence of any errors, call
zero or multiple times, such as, for everyf ( b , e , args ...)
in range [0,i
), there is exactly one call tosize
with i ∈ [f ( b , e , args ...)
,b
). All the calls toe
strongly happen before any call tof
.set_value -
If any call to
throws, then the sender completes on receiverf
withrcvr
, whereset_error ( std :: move ( rcvr ), e )
is one of the exceptions thrown, or is an exception derived frome
. Before callingstd :: runtime_error
, the operation may callset_error
zero or multiple times, such as, for everyf ( b , e , args ...)
in range [0,i
), there is maximum one call tosize
withf ( b , e , args ...)
∈ [i
,b
).e
The
algorithm is a customization point returning a sender that describes a work with following properties:
-
Let
be the values sent by the predecessor.args ... -
In the absence of any errors, call
zero or multiple times, such as, for everyf ( i , args ...)
in range [0,i
), there is exactly one call tosize
. All the calls tof ( i , args ...)
strongly happen before any call tof
.set_value -
If any call to
throws, then the sender completes on receiverf
withrcvr
, whereset_error ( std :: move ( rcvr ), e )
is one of the exceptions thrown, or is an exception derived frome
. Before callingstd :: runtime_error
, the operation may callset_error
zero or multiple times, such as, for everyf ( i , args ...)
in range [0,i
), there is maximum one call tosize
.f ( i , args ...)
The
algorithm is a customization point.
The default implementation calls
as shown above.
4. Design discussions
4.1. Use chunked version as a basis operation
To address the performance problem described in the motivation, we propose to add a chunked version for
.
This allows implementations to process the iteration space in chunks, and take advantage of the locality of the chunk processing.
This is useful when publishing the effects of the functor may be expensive (the example from the motivation) and when the given functor cannot be inlined.
The implementation of
can use dynamically sized chunks that adapt to the workload at hand.
For example, computing the results of a Mandelbrot fractal on a line is an unbalanced process.
Some values can be computed very fast, while others take longer to compute.
Passing a range as parameters to the functor passed to
is similar to the way Intel TBB’s
functions (see [parallel_for]).
An implementation of
can be easily built on top of
without losing performance (as shown in the Proposal section).
4.2. Also define an unchunked version
For GPUs is important to have a version of
that ensures one execution agent per iteration.
Defining
in terms of
would not be efficient for GPUs.
Moreover, for composability reasons, we need a way to specify which version of
will not do any chunking.
For example, an algorithm tailored for GPUs that depends on
would not want to have chunking, and, without controlling the specialization for
, it must not accidentally call a chunked version of
.
For this reasons, we need to have a version of
that doesn’t do any chunking.
4.3. Using execution policies
As discussed in the § 2 Motivation section, not having the possibility to specify execution policies for the
algorithm is a downside.
The only reasonable default execution policy choice is
because defaulting to anything else might lead to deadlocks.
On contrary, it does not make sense because the use cases,
(and
/
) is aimed for would
expect parallelism, that is using
,
or
policies. Thus,
has to have an execution policy
parameter.
One criticism of this solution is that each invocation of
needs to contain an extra parameter that is typically verbose to type.
The idea considered by the authors to solve it is to provide versions of the algorithms that have the execution policies already baked in. Something like:
auto bulk_seq ( auto prev , auto size , auto f ); auto bulk_unseq ( auto prev , auto size , auto f ); auto bulk_par ( auto prev , auto size , auto f ); auto bulk_par_unseq ( auto prev , auto size , auto f ); auto bulk_chunked_seq ( auto prev , auto size , auto f ); auto bulk_chunked_unseq ( auto prev , auto size , auto f ); auto bulk_chunked_par ( auto prev , auto size , auto f ); auto bulk_chunked_par_unseq ( auto prev , auto size , auto f ); auto bulk_unchunked_seq ( auto prev , auto size , auto f ); auto bulk_unchunked_unseq ( auto prev , auto size , auto f ); auto bulk_unchunked_par ( auto prev , auto size , auto f ); auto bulk_unchunked_par_unseq ( auto prev , auto size , auto f );
We dropped this idea, as this isn’t scalable.
On the other hand, it’s quite easy to write a thin wrapper on top of
/
on the user side that calls the algorithm
with the execution policy of choice.
Something like:
auto user_bulk_par ( auto prev , auto size , auto f ) { return std :: execution :: bulk ( std :: execution :: par , prev , size , f ); }
4.4. Conflicting execution policies
For [P2500R2] design we previously agreed that there might be an execution policy attached to a
. The
question that comes to mind is what should we do if we have execution policy passed via
parameters that conflicts with
execution policy attached to a scheduler?
This is another area to explore. It’s quite obvious that execution policy passed to
describes possible parallelization
ways of
callable object, while it’s not 100% obvious what execution policy attached to a scheduler means outside of [P2500R2] proposal.
We might choose different strategies (not all of them are mutually exclusive):
-
Reconsider the decisions we made for [P2500R2] by stopping attaching policies to schedulers.
-
Reduce policies: choose the most conservative one. For example, for the passed policy
and attached policypar
we reduce it the resulting policy toseq
because it’s the most conservative one.seq -
For "peer" conflicting policies:
andpar
we might either want to reduce it tounseq
or might want to give a compile-time error.seq
-
-
Give a compile-time error right away for any pair of conflicting policies.
Anyway, exploration of this problem is out of scope of this paper.
For the purpose of this paper, we need to ensure that the way the given function is called is compatible with the execution policy passed to
.
4.5. Expectations on customizations
The current specification of
in [exec.bulk] doesn’t define any constraints/expectations for the customization of the algorithm.
The default implementation (a serial version of
) is expected to be very different that its customizations.
Having customizations in mind, we need to define the minimal expectations for calling
.
We propose the following:
-
Mandate
to be copy-constructible.f -
Allow
to be invoked according to the specified execution policy (and don’t assume it’s going to be called serially).f -
Allow the values produced by the previous sender to be decay copied (if the values produced are copyable).
-
Allow the algorithm to handle cancellation requests, but don’t mandate it.
We want to require that the given functor be copy-constructible so that
requires the same type of function as a
with an execution policy.
Adding an execution policy to
/
/
would also align them with
.
If
is allowed to be invoked concurrently, then implementations may want to copy the arguments passed to it.
Another important point for a
operation is how it should react to cancellation requests.
We want to specify that
should be able to react to cancellation requests, but not make this mandatory.
It shall be also possible for vendors to completely ignore cancellation requests (if, for example, supporting cancellation would actually slow down the main uses-cases that the vendor is optimizing for).
Also, the way cancellation is implemented should be left unspecified.
Checking the cancellation token every iteration may be expensive for certain cases. A cancellation check requires an acquire memory barrier (which may be too much for really small functors), and might prevent vectorization. Thus, implementations may want to check the cancellation every N iterations; N can be a statically known number, or can be dynamically determined.
As
,
, and
are customization points, we need to specify these constraints to all three of them.
4.6. Exception handling
While there are multiple solutions to specify which errors can be thrown by
, the most sensible ones seem to be:
-
pick an arbitrary exception thrown by one of the invocations of
(maybe using a atomic operations to select the first thrown exception),f -
reduce the exceptions to another exception that can carry one or more of the thrown exceptions using a user-defined reduction operation (similar to using
as discussed by [P0333R0]),exception_list -
allow implementations to produce a new exception type (e.g., to represent failures outside of
, or to represent failure to transmit the original exception to the receiver)f
One should note that option 2 can be seen as a particular case of option 3.
Also, option 2 seems to be more complex than option 1, without providing any palpable benefits to the users.
The third option is very useful when implementations may fail outside of the calls to the given functor. Also, there may be cases in which exceptions cannot be transported from the place they were thrown to the place they need to be reported.
Based on the experience with the existing parallel frameworks we incline to recommend the option one because
-
In a parallel execution the common behavior is non-deterministic by nature. Thus, we can peak arbitrary exception to throw
-
Catching any exception already indicates that something went wrong, thus a good implementation might initiate a cancellation mechanism to finish already failed work as soon as possible.
On top of that, for special cases we also want to allow option 3.
5. Specification
5.1. In [execution.syn]
After
, add:
struct bulk_chunked_t { unspecified }; struct bulk_unchunked_t { unspecified };
After
, add:
inline constexpr bulk_chunked_t bulk_chunked {}; inline constexpr bulk_unchunked_t bulk_unchunked {};
5.2. In [exec.bulk]
Rename the title of the section from “execution::bulk” to “execution::bulk, execution::bulk_chunked, and execution::bulk_unchunked”.
Apply the following changes (no track changes for paragraph numbering):
-
,bulk
, andbulk_chunked
runbulk_unchunked sa task repeatedly for every index in an index space. -
The
namenames
,bulk
, andbulk_chunked bulk_unchunked denotes adenote pipeable sender adaptorobjectobjects . Let
be eitherbulk - algo
,bulk
orbulk_chunked
. For subexpressionsbulk_unchunked
,sndr
,policy
, andshape
, letf
bePolicy
andremove_cvref_t < decltype ( policy ) >
beShape
.decltype ( auto ( shape )) IfThe expression
does not satisfydecltype (( sndr ))
, or ifsender
does not satisfyShape
, or ifintegral
does not satisfydecltype (( f ))
,movable - value
is ill-formed.bulk ( sndr , shape , f ) bulk - algo
is ill-formed if any of the following is true:( sndr , policy , shape , f ) -
does not satisfydecltype (( sndr ))
,sender -
isis_execution_policy_v < remove_cvref_t < Policy >> false
, -
does not satisfyShape
,integral -
does not model thedecltype (( f ))
concept.copy_constructible
-
-
Otherwise, the expression
bulk ( sndr , shape , f ) bulk - algo
is expression-equivalent to:( sndr , policy , shape , f ) transform_sender ( get - domain - early ( sndr ), make - sender ( bulk , product - type { shape , f }, sndr )) transform_sender ( get - domain - early ( sndr ), make - sender ( bulk - algo , product - type { policy , shape , f }, sndr )) except that
is evaluated only once.sndr -
The exposition-only class template
([exec.snd.general]) is specialized forimpls - for
,bulk_t
orbulk_chunked_t
as follows:bulk_unchunked_t namespace std :: execution { template <> struct impls - for < bulk_t bulk - algo > : default - impls { static constexpr auto complete = see below ; }; } -
The member
is initialized with a callable object equivalent to the following lambda:impls - for < bulk_t bulk_unchunked_t >:: complete [] < class Index , class State , class Rcvr , class Tag , class ... Args > ( Index , State & state , Rcvr & rcvr , Tag , Args && ... args ) noexcept -> void requires see below { if constexpr ( same_as < Tag , set_value_t > ) { auto & [ shape , f ] = state ; auto & [ policy , shape , f ] = state ; constexpr bool nothrow = noexcept ( f ( auto ( shape ), args ...)); TRY - EVAL ( rcvr , [ & ]() noexcept ( nothrow ) { for ( decltype ( auto ( shape )) i = 0 ; i < shape ; ++ i ) { f ( auto ( i ), args ...); } Tag ()( std :: move ( rcvr ), std :: forward < Args > ( args )...); }()); } else { Tag ()( std :: move ( rcvr ), std :: forward < Args > ( args )...); } } -
The expression in the requires-clause of the lambda above is
true
if and only if
denotes a type other thanTag
or if the expressionset_value_t
is well-formed.f ( auto ( shape ), args ...)
-
-
The member
is initialized with a callable object equivalent to the following lambda:impls - for < bulk_chunked_t >:: complete [] < class Index , class State , class Rcvr , class Tag , class ... Args > ( Index , State & state , Rcvr & rcvr , Tag , Args && ... args ) noexcept -> void requires see below { if constexpr ( same_as < Tag , set_value_t > ) { auto & [ policy , shape , f ] = state ; constexpr bool nothrow = noexcept ( f ( auto ( shape ), auto ( shape ), args ...)); TRY - EVAL ( rcvr , [ & ]() noexcept ( nothrow ) { f ( 0 , auto ( shape ), args ...); Tag ()( std :: move ( rcvr ), std :: forward < Args > ( args )...); }()); } else { Tag ()( std :: move ( rcvr ), std :: forward < Args > ( args )...); } } -
The expression in the requires-clause of the lambda above is
true
if and only if
denotes a type other thanTag
or if the expressionset_value_t
is well-formed.f ( auto ( shape ), auto ( shape ), args ...)
-
-
The member
is initialized with a callable object equivalent to the following lambda:impls - for < bulk_t >:: complete [] < class Index , class State , class Rcvr , class Tag , class ... Args > ( Index , State & state , Rcvr & rcvr , Tag , Args && ... args ) noexcept -> void requires see below { auto & [ policy , shape , f ] = state ; constexpr bool nothrow = noexcept ( f ( auto ( shape ), args ...)); auto new_f = [ func = std :: move ( f )] ( Size begin , Size end , Vs & ... vs ) noexcept ( nothrow ) { while ( begin != end ) f ( begin ++ , std :: forward ( vs )...); } impls - for < bulk_chunked_t >:: complete ( Index (), product - type { policy , shape , std :: move ( new_f )}, rcvr , Tag (), std :: forward < Args > ( args )...); } -
The expression in the requires-clause of the lambda above is
true
if and only if
denotes a type other thanTag
or if the expressionset_value_t
is well-formed.f ( auto ( shape ), args ...)
-
-
-
If
isbulk - algo
,bulk_chunked
, orbulk_unchunked
, letbulk Letthe subexpression
denote the result of the invocationout_sndr bulk bulk - algo ( sndr , policy , shape , f )
denote a receiver such that the expressionrcvr
is well-formed. The expressionconnect ( out_sndr , rcvr )
has undefined behavior unless it creates an asynchronous operation ([async.ops]) that, when started:connect ( out_sndr , rcvr ) -
on a value completion operation, invokes
for everyf ( i , args ...)
of typei
fromShape
to0
, whereshape
is a pack of lvalue subexpressions referring to the value completion result datums of the input sender, andargs -
propagates all completion operations sent by
.sndr
-
if
has a scuccessful completion, wheresndr
is a pack of lvalue subexpressions referring to the value completion result datums ofargs
, then:sndr -
if
also completes successfully, then:out_sndr -
if
isbulk - algo
orbulk
, invokesbulk_unchunked
for everyf ( i , args ...)
of typei
fromShape
to0
;shape -
if
isbulk - algo
, invokesbulk_chunked
zero or multiple times with pairs off ( b , e , args ...)
andb
of typee
in range [Shape
,0
], such as for everyshape
of typei
fromShape
to0
, there is an invocation with a pairshape
andb
, such ase
;b <= i < e
-
-
if
completes without_sndr
, the asynchronous operation may invoke a subset of the invocations ofset_error ( rcvr , e )
as described above before the completion signal, andf
is either:e -
an exception thrown by an invocation of
, orf -
an exception derived from
;runtime_error
-
-
if
completes without_sndr
, the asynchronous operation may invoke a subset of the invocations ofset_stopped ( rcvr )
as described above before the completion signal;f
-
-
if
does not complete withsndr
, the completion signal is forwarded toset_value (...) recv -
the function
is invoked in a way that is compatible with the execution policyf
.policy
-
-
If
isbulk - algo
,bulk_chunked
, orbulk_unchunked
, then the asyncrhonous operation corresponding tobulk bulk - algo
is allowed (but not required) to:( sndr , policy , shape , f ) -
complete with
if cancellation is requested (through the connected receiver);set_stopped () -
ignore cancellation requests;
-
make decaying copies of the values produced by the predecessor sender, if the values are copyable.
-
6. Polls
6.1. SG1, Wrocław, Poland, 2024
[P3481R0] was presented at the SG1 meeting in November 2024 in Wrocław, Poland. SG1 provided the following feedback in the form of polls:
-
If we have chunking, we need to expose a control on the chunk size.
| SF | F | N | A | SA | | 1 | 2 | 3 | 1 | 1 | No consensus
-
We need a version of the
API that presents the chunk to the callable in order to implement the parallel algorithmsbulk | SF | F | N | A | SA | | 4 | 2 | 1 | 0 | 1 | Consensus in favor
-
We need a version of the bulk API that creates an execution agent per iteration.
| SF | F | N | A | SA | | 2 | 3 | 3 | 0 | 0 | Unanimous consent
-
We believe / desire:
-
Bulk needs an execution policy to describe the callable, IF the scheduler also has an execution policy (for debugging for example) then a conservative choice should be used (
is more conservative thanseq
)par -
No SG1 concerns with the proposed exception handling
-
No change to status quo on default implementation of bulk being serial
-
No change to status quo on bulk having a predecessor
-
Forms of bulk needed:
-
->bulk
->f ( i ) bulk_chunked -
->bulk_chunked f ( b , e ) -
->bulk_unchunked
"execution agent per iteration"f ( i )
-
| SF | F | N | A | SA | | 5 | 2 | 1 | 0 | 0 | Unanimous consent
-