task_scheduler
support for parallel bulk
execution| Document #: | P3927R0 [Latest] [Status] |
| Date: | 2026-01-14 |
| Project: | Programming Language C++ |
| Audience: |
SG1 Concurrency and Parallelism Working Group LEWG Library Evolution Working Group LWG Library Working Group |
| Reply-to: |
Eric Niebler <eric.niebler@gmail.com> |
By default, instances of the coroutine type std::execution::task
store the “current” scheduler in type-erased scheduler wrapper called
std::execution::task_scheduler.
As with other type-erased wrappers, the goal of std::execution::task_scheduler
is presumably to behave as much like a drop-in replacement for the
object it wraps as is possible.
The task_scheduler falls short of
this ideal in one respect: if a
task_scheduler wraps a
parallel_scheduler and is used to
launch parallel work with a bulk
sender, the work is not parallelized as it would be had a
parallel_scheduler been used
directly. That is because the
task_scheduler does not treat the
bulk algorithms specially, as
parallel_scheduler does.
Fortunately, the
parallel_scheduler has been
specified in such a way that the
task_scheduler can reuse its
back-end helpers, making the job of specifying an improved
task_scheduler much easier.
Like task_scheduler, the
parallel_scheduler is a type-erased
wrapper for a scheduler-like object. It uses the abstract base classes
parallel_scheduler_backend and
receiver_proxy to punch the
schedule,
bulk_chunked, and
bulk_unchunked operations through
the type-erased interface. These are precisely the operations we would
like task_scheduler to handle.
Currently, task_scheduler is
specified to have an exposition-only member
sch_ of type shared_ptr<void>.
If this is changed to shared_ptr<parallel_scheduler_backend>,
then the bulk algorithms can
dispatch through
and sch_->schedule_bulk_chunked(...)
and be accelerated for free.sch_->schedule_bulk_unchunked(...)
Well ok, not exactly free; some work is needed:
We need a class that inherits
parallel_scheduler_backend and
implements its abstract interface in terms of a concrete scheduler,
like:
template<scheduler Sch> structtask-scheduler-backend: parallel_scheduler_backend { // exposition only explicittask-scheduler-backend(Sch sch) : sched_(std::move(sch)) {} void schedule(receiver_proxy& r, span<byte> s) noexcept override; void schedule_bulk_chunked(size_t shape, bulk_item_receiver_proxy& r, span<byte> s) noexcept override; void schedule_bulk_unchunked(size_t shape, bulk_item_receiver_proxy& r, span<byte> s) noexcept override; Sch sched_; };
The schedule override would
connect the result of calling execution::schedule(sched_)
with a receiver that wraps the
receiver_proxy and then calls
start on the resulting operation
state.
The schedule_bulk_[un]chunked
overrides would construct a bulk
sender whose predecessor is essentially the
just()
sender, but with a value completion scheduler of
sched_. It would then
connect that
bulk sender with a receiver that
wraps the bulk_item_receiver_proxy
and calls start on the resulting
operation state. Since the predecessor sender has
sched_ as its value completion
scheduler,
connect will
use sched_’s domain to transform the
bulk sender before connecting it
with the receiver, causing the sender to use a custom implementation as
appropriate.
We also need task_scheduler
to have a completion domain with a
transform_sender member function
that accepts vanilla bulk_[un]chunked
senders and transforms them so that they use
and sch_->schedule_bulk_chunked(...).sch_->schedule_bulk_unchunked(...)
structtask-scheduler-domain: default_domain { template<class BulkSndr, class Env> static constexpr auto transform_sender(set_value_t, BulkSndr&& bulk_sndr, const Env& env) noexcept; };
This member function would be constrained to accept only bulk_[un]chunked
senders and would return a new sender that, when connected and started,
would connect and start bulk_sndr’s
predecessor sender. Error and stopped completions are forwarded to the
receiver. Value completions are used to construct a
bulk_item_receiver_proxy which is
passed to .sch_->schedule_bulk_chunked(...)
The proposed solution has been implemented in NVIDIA’s CCCL library. The relevant
pull request can be found at https://github.com/NVIDIA/cccl/pull/5975,
and the source for the
task_scheduler is here.
[ Editor's note: Change 33.13.5 [exec.task.scheduler] as follows: ]
namespace std::execution { class task_scheduler { classts-senderts-domain; // exposition onlytemplate<receiver R>class state; // exposition onlytemplate<scheduler Sch>classpublic: using scheduler_concept = scheduler_t; template<class Sch, class Allocator = allocator<void>> requires (!same_as<task_scheduler, remove_cvref_t<Sch>>) && scheduler<Sch> explicit task_scheduler(Sch&& sch, Allocator alloc = {});backend-for; // exposition onlyts-sendersee belowschedule(); friend bool operator==(const task_scheduler& lhs, const task_scheduler& rhs) noexcept; template<class Sch> requires (!same_as<task_scheduler, Sch>) && scheduler<Sch> friend bool operator==(const task_scheduler& lhs, const Sch& rhs) noexcept; private: shared_ptr<voidparallel_scheduler_backend>sch_; // exposition only// see [exec.sysctxrepl.psb]}; }
task_scheduleris a class that modelsscheduler(33.6 [exec.sched]). Given an objectsof typetask_scheduler, letbe theSCHED(s)sched_member of the object owned bys.. The expressionsch_get_forward_progress_guarantee(s)is equivalent toget_forward_progress_guarantee(. The expressionSCHED(s))get_completion_domain<set_value_t>(s)is equivalent totask_scheduler::ts-domain().template<class Sch, class Allocator = allocator<void>> requires(!same_as<task_scheduler, remove_cvref_t<Sch>>) && scheduler<Sch> explicit task_scheduler(Sch&& sch, Allocator alloc = {});
Effects: Initialize
sch_withallocate_shared<.backend-for<remove_cvref_t<Sch>>>(alloc, std::forward<Sch>(sch))Recommended practice: Implementations should avoid the use of dynamically allocated memory for small scheduler objects.
Remarks: Any allocations performed by
construction ofcalls onts-senderorstateobjects resulting from*thisare performed using a copy ofalloc.ts-senderschedule();
- Effects: Returns an object of type
ts-sendercontaining a sender initialized withschedule(.SCHED(*this))bool operator==(const task_scheduler& lhs, const task_scheduler& rhs) noexcept;
- Effects: Equivalent to: return
lhs ==;SCHED(rhs)template<class Sch> requires (!same_as<task_scheduler, Sch>) && scheduler<Sch> bool operator==(const task_scheduler& lhs, const Sch& rhs) noexcept;
- Returns:
falseif the type ofis notSCHED(lhs)Sch, otherwise.SCHED(lhs) == rhs[ Editor's note: Remove paragraphs 8-12 and add the following paragraphs: ]
- For an lvalue
rof type derived fromreceiver_proxy, letbe an object of a type that modelsWRAP-RCVR(r)receiverand whose completion handlers result in invoking the corresponding completion handlers ofr.namespace std::execution { template<scheduler Sch> class task_scheduler::backend-for: public parallel_scheduler_backend {// exposition onlypublic: explicitbackend-for(Sch sch) : sched_(std::move(sch)) {} void schedule(receiver_proxy& r, span<byte> s) noexcept override; void schedule_bulk_chunked(size_t shape, bulk_item_receiver_proxy& r, span<byte> s) noexcept override; void schedule_bulk_unchunked(size_t shape, bulk_item_receiver_proxy& r, span<byte> s) noexcept override; Schsched_;// exposition only}; }
- Let
just-sndr-likebe a sender whose only value completion signature isset_value_t()and for which the expressionget_completion_scheduler<set_value_t>(get_env(isjust-sndr-like)) ==sched_true.void schedule(receiver_proxy& r, span<byte> s) noexcept override;
- Effects: Constructs an operation state
oswithconnect(schedule(and callssched_),WRAP-RCVR(r))start(os).void schedule_bulk_chunked(size_t shape, bulk_item_receiver_proxy& r, span<byte> s) noexcept override;
- Effects: Let
chunk_sizebe an integer less than or equal toshape, letnum_chunksbe(shape + chunk_size - 1) / chunk_size, and letfnbe a function object such that for an integeri,fn(i)callsr.execute(i * chunk_size, m), wheremis the lesser of(i + 1) * chunk_sizeandshape. Constructs an operation stateosas if withconnect(bulk(and callsjust-sndr-like, par, num_chunks, fn),WRAP-RCVR(r))start(os).void schedule_bulk_unchunked(size_t shape, bulk_item_receiver_proxy& r, span<byte> s) noexcept override;
- Effects: Let
fnbe a function object such that for an integeri,fn(i)is equivalent tor.execute(i, i + 1). Constructs an operation stateosas if withconnect(bulk(and callsjust-sndr-like, par, shape, fn),WRAP-RCVR(r))start(os).see belowschedule();
Returns: a prvalue
ts-sndrwhose type modelssendersuch that:
(8.1)
get_completion_scheduler<set_value_t>(get_env(is equal tots-sndr))*this.(8.2)
get_completion_domain<set_value_t>(get_env(is expression-equivalent tots-sndr)).ts-domain()(8.3) If a receiver
rcvris connected tots-sndrand the resulting operation state is started, calls, wheresch_->schedule(r, s)
(8.3.1)
ris a proxy forrcvrwith basesystem_context_replaceability::receiver_proxy(33.15 [exec.par.scheduler]) and(8.3.2)
sis a preallocated backend storage forr.(8.4)
completion_signatures_of_t<Sndr>denotes:completion_signatures< set_value_t(), set_error_t(error_code), set_error_t(exception_ptr), set_stopped_t() >namespace std::execution { class task_scheduler::ts-domain: public default_domain { public: template<class BulkSndr, class Env> static constexpr auto transform_sender(set_value_t, BulkSndr&& bulk_sndr, const Env& env) noexcept; }; }template<class BulkSndr, class Env> // exposition only static constexprsee belowtransform_sender(BulkSndr&& bulk_sndr, const Env& env) noexcept;
Constraints:
sender_in<BulkSndr, Env>istrue,auto(std::forward<BulkSndr>(bulk_sndr))is well-formed, and eitherorsender-for<BulkSndr, bulk_chunked_t>issender-for<BulkSndr, bulk_unchunked_t>true.Effects: Equivalent to:
auto& [_, data, child] = bulk_sndr; auto& [_, shape, fn] = data; auto sch =call-with-default(get_completion_scheduler<set_value_t>,not-a-scheduler(), get_env(child),FWD-ENV(env)); returne;where
eisif the type ofnot-a-sender()schis nottask_scheduler; otherwise, it is a prvalue whose type modelssendersuch that, if it is connected torcvrand the resulting operation state is started,childis connected to an unspecified receiverRand started. Ifchildcompletes with an error or a stopped completion, the completion operation is forwarded unchanged torcvr. Otherwise, letargsbe a pack of lvalue subexpressions designating objects decay-copied from the value result datums. Then
(15.1) If
bulk_sndrwas the result of the evaluation of an expression equivalent tobulk_chunked(child, policy, shape, f)or a copy of such, thenis called wheresch_->schedule_bulk_chunked(shape, r, s)ris a bulk chunked proxy (33.15 [exec.par.scheduler]) forrcvrwith callablefand argumentsargs, andsis a preallocated backend storage forr.(15.2) Otherwise, calls
wheresch_->schedule_bulk_unchunked(shape, r, s)ris a bulk unchunked proxy forrcvrwith callablefand argumentsargs, andsis a preallocated backend storage forr.Recommended practice: The returned sender should hold references to the parts of
bulk_sndrthat it needs.Remarks: The expression
get_env(R)is expression-equivalent to, whereFWD-ENV(get_env(rcvr-copy))rcvr-copyis an lvalue subexpression designating an object decay-copied fromrcvr.