1. Motivation
Standard parallel algorithms with execution policies which set semantic requirements to user-provided callable objects were a good start for supporting parallelism in the C++ standard.
The C++ Ranges library - ranges, views, etc. - is a powerful facility to produce lazily evaluated pipelines that can be processed by range-based algorithms. Together they provide a productive and expressive API with the room for extra optimizations.
Combining these two powerful features by adding support for execution policies to the range-based algorithms opens an opportunity to fuse several computations into one parallel algorithm call, thus reducing the overhead on parallelism. That is especially valuable for heterogeneous implementations of parallel algorithms, for which the range-based API helps reducing the number of kernels submitted to an accelerator.
Users are already using ranges and range adaptors by passing range iterators to the existing non-range parallel algorithms. [P2408R5] was adopted to enable this. This pattern is often featured when teaching C++ parallel algorithms and appears in many codebases.
However, passing range iterators to non-range algorithms is unwieldy and verbose. It is surprising to users that they cannot simply pass the ranges to the parallel algorithms as they would for serial algorithms.
| Scalar-Vector Multiply | |
|---|---|
| Before | After | 
|  |  | 
| Matrix Transpose | |
|---|---|
| Before | After | 
|  |  | 
Earlier, [P2500R2] proposed to add the range-based C++ parallel algorithms together with its primary goal of extending algorithms with schedulers. We have decided to split those parts to separate papers, which could progress independently.
2. Design overview
This paper proposes execution policy support for C++ range-based algorithms. In the nutshell, the proposal extends C++ range algorithms with overloads taking any standard C++ execution policy as a function parameter. These overloads are further referred to as parallel range algorithms.
The proposal is targeted to C++26.
2.1. Design summary
2.1.1. Differences to serial range algorithms
Comparing to the C++20 serial range algorithms, we propose the following modifications:
- 
     The execution policy parameter is added. 
- 
     for_each for_each_n 
- 
     Parallel range algorithms take range 
- 
     Until better parallelism-friendly abstractions are proposed, parallel algorithms require random_access_ { iterator , range } 
- 
     At least one of the input sequences as well as the output sequence must be bounded. (§ 2.7 Requiring ranges to be bounded) 
2.1.2. Differences to C++17 parallel algorithms
In addition to data sequences being passed as either ranges or "iterator and sentinel" pairs, the following differences to the C++17 parallel algorithms are proposed:
- 
     for_each void 
- 
     Algorithms require random_access_ { iterator , range } 
- 
     At least one of the input sequences as well as the output sequence must be bounded. 
2.1.3. Other design aspects
- 
     Except as mentioned above, the parallel range algorithms should return the same type as the corresponding serial range algorithms. (§ 2.3 Algorithm return types) 
- 
     The proposed algorithms should follow the design of serial range algorithms with regard to name lookup. (§ 2.4 Non ADL-discoverable functions) 
- 
     The proposed algorithms should require callable object passed to an algorithm to be regular_invocable 
- 
     The proposed APIs are not customization points. (§ 2.9 Parallel range algorithms are not customization points) 
- 
     The proposed algorithms should follow the design of C++17 parallel algorithms with regard to constexpr 
2.2. Coexistence with schedulers
We believe that adding parallel range algorithms does not have the risk of conflict with anticipated scheduler-based algorithms, because an execution policy does not satisfy the requirements for a policy-aware scheduler ([P2500R2]), a sender ([P3300R0]), or really anything else from [P2300R9] that can be used to specify such algorithms.
At this point we do not, however, discuss how the appearance of schedulers may or should impact the execution rules for parallel algorithms specified in [algorithms.parallel.exec], and just assume that the same rules apply to the range algorithms with execution policies.
2.3. Algorithm return types
We explored possible algorithm return types and came to conclusion that returning the same type as serial range algorithms is the preferred option to make the changes for enabling parallelism minimal.
auto res = std :: ranges :: sort ( v ); 
becomes:
auto res = std :: ranges :: sort ( std :: execution :: par , v ); 
However, 
The following table summarizes return value types for the existing variants of these two algorithms:
| API | Return type | 
|---|---|
|  |  | 
| Parallel  |  | 
|  |  | 
| Parallel  |  | 
|  |  | 
| ,+overload |  | 
|  |  | 
While the serial 
In 
Based on the analysis above and the feedback from SG9 we think that the most reasonable return type
for parallel variants of 
| API | Return type | 
|---|---|
| Parallel  |  | 
| Parallel ,+overload |  | 
| Parallel  |  | 
2.4. Non ADL-discoverable functions
We believe the proposed functionality should have the same behavior as serial range algorithms regarding the name lookup. For now, the new overloads are supposed to be special functions that are not discoverable by ADL (the status quo of the standard for serial range algorithms).
[P3136R0] suggests to respecify range algorithms to be actual function objects. If adopted, that proposal will
apply to all algorithms in the 
Either way, adding parallel versions of the range algorithms should not be a problem. Please see § 4.1 Possible implementation of a parallel range algorithm for more information.
2.5. Requiring random_access_iterator random_access_range 
   C++17 parallel algorithms minimally require LegacyForwardIterator for data sequences, but in our opinion, it is not quite suitable for an efficient parallel implementation. Therefore for parallel range algorithms we propose to require random access ranges and iterators.
Though the feedback we received in Tokyo requested to support forward ranges, we would like this question to be discussed in more detail. Using parallel algorithms with forward ranges will in most cases give little to no benefit, and may even reduce performance due to extra overheads. We believe that forward ranges and iterators are bad abstractions for parallel data processing, and allowing those could result in wrong expectations and unsatisfactory user experience with parallel algorithms.
Many parallel programming models that are well known and widely used in the industry, including OpenMP, OpenCL, CUDA, SYCL, oneTBB, define iteration or data spaces for their parallel constructs in ways that allow creating sufficient parallel work quickly and efficiently. A key property for this is the ability to split the work into smaller chunks. These programming models allow to control the amount of work per chunk and sometimes the ways chunks are created and/or scheduled. All these also support iteration spaces up to at least 3 dimensions.
Except for 
These very programming models are often used as backends to implement the C++ standard parallelism. Not surprisingly,
most implementations fall back to serial processing if data sequences have no random access. Of the GNU libstdc++,
LLVM libc++, and MSVC’s standard library, only the latter attempts to process forward iterator based sequences in parallel,
for which it first needs to serially iterate over a whole sequence once or even twice.
oneAPI Data Parallel C++ library (oneDPL) supports forward iterators only for a very few algorithms,
only for 
Returning to the SG1/SG9 feedback, there seemingly are two main reasons why others do not want to restrict parallel algorithms by only random access ranges:
- 
     That would prevent some useful views, such as filter_view 
- 
     That would be inconsistent with the C++17 parallel algorithms. 
Given the other aspects of the proposed design, we believe some degree of inconsistency with C++17 parallel algorithms is inevitable and should not become a gating factor for important design decisions.
The question of supporting the standard views that do not provide random access is very important. We think though
that it should better be addressed through proper abstractions and new concepts defining iteration spaces, including
multidimensional ones, suitable for parallel algorithms. We intend to work on developing these (likely in another paper),
however it requires time and effort to make it right, and we think trying to squeeze that into C++26 adds significant risks.
For now, random access ranges with known bounds (see § 2.7 Requiring ranges to be bounded) is probably the best approximation
that exists in the standard. Starting from that and gradually enabling other types of iteration spaces
in a source-compatible manner seems to us better than blanket allowance of any 
2.6. Taking range 
   We would like to propose a range as the output for the overloads that take ranges for input. Similarly, we propose a sentinel for output where input is passed as "iterator and sentinel". See § 4 Proposed API for the examples.
The reasons for that are:
- 
     It creates a safer API where all the data sequences have known limits. 
- 
     Not for all algorithms the output size is defined by the input size. An example is copy_if 
- 
     Passing a range for output makes code a bit simpler in the cases typical for parallel execution. 
It is worth noting that to various degrees these reasons are also applicable to serial algorithms.
There are already range algorithms  - 
We think that in practice parallel algorithms mainly write the output data into a container or storage
with preallocated space, for efficiency reasons. So, typically parallel algorithms receive 
Also, using classes such as 
All in all, we think for parallel algorithms taking ranges and sentinels for output makes more sense than only taking an iterator.
The main concern we have heard about this approach is the mismatch between serial and parallel variations. That is, if serial range algorithms only take iterators for output and parallel range algorithms only take ranges, switching between those will always require code changes. That can be resolved by:
- 
     (A) adding output-as-range to serial range algorithms, 
- 
     (B) adding output-as-iterator to parallel range algorithms 
or both.
The option (A) gives some of the described benefits to serial range algorithms as well; one could argue that it would be a useful addition on its own. The option (B) does not seem to have benefits besides the aligned semantics, while it has the downside of not enforcing the requirements we propose in § 2.7 Requiring ranges to be bounded.
With either (A) or (B), the output parameter for range algorithm overloads could be both a range and an iterator. In the formal wording, this could be represented either as two separate overloads with different requirements on that parameter, or with an exposition-only range-or-iterator concept that combines the requirements by logical disjunction, as its name suggest. We did not explore which makes more sense; at glance, there seems to be little practical difference for library implementors.
For "iterator and sentinel" overloads we prefer to always require a sentinel for output, despite the mismatch with the corresponding serial overloads.
2.7. Requiring ranges to be bounded
One of the requirements we want to put on the parallel range algorithms is to disallow unbounded input and output. The reasons for that are:
- 
     First, for efficient parallel implementation we need to know the iteration space bounds. Otherwise, it’s hard to apply the "divide and conquer" strategy for creating work for multiple execution threads. 
- 
     Second, while serial range algorithms allow passing an "infinite" range like std :: ranges :: views :: iota ( 0 ) 
We have evaluated a few options to specify such a requirement, and for now decided to use the 
In the case of two or more input ranges or sequences, it is sufficient for just one to be bounded.
The other input ranges are then assumed to have at least as many elements as the bounded one.
This enables unbounded ranges such as 
void normalize_parallel ( range auto && v ) { auto mx = reduce ( execution :: par , v , ranges :: max {}); transform ( execution :: par , v , views :: repeat ( mx ), v , divides ); } 
At the same time, for an output range (that we propose in § 2.6 Taking range as an output) our preference is to have a boundary independently on the input range(s). The main motivation is to follow established practices of secure coding, which recommend or even require to always specify the size of the output in order to prevent out-of-range data modifications. We think this will not impose any practical limitation on which ranges can be used for the output of a parallel algorithm, as we could not find or invent an example of a random-access writable range which would also be unbounded.
If several provided ranges or sequences are bounded, an algorithm should stop as soon as the end is reached for the shortest one.
There are already precedents in the standard that an algorithm takes two sequences with potentially different input sizes
and chooses the smaller size as the number of iterations it is going to make, such as 
2.8. Requirements for callable parameters
In [P3179R0] we proposed that parallel range algorithms should require function objects for predicates, comparators, etc.
to have 
We did extra investigation and decided that requiring 
- 
     The vast majority of the serial range algorithms requires function objects to be regular_invocable const 
- 
     Remaining algorithms should be considered individually. For example, for_each operator () generate 
The following example works fine for serial code. While it compiles for parallel code, users should not assume that the
semantics remains intact. Since the parallel version of 
struct callable { void operator ()( int & x ) { ++ x ; ++ i ; // a data race if the callable is executed concurrently } int get_i () const { return i ; } private : int i = 0 ; }; callable c ; // serial for_each call auto fun = std :: for_each ( v . begin (), v . end (), c ); // parallel for_each call // The callable object cannot be read because parallel for_each version purposefully returns void std :: for_each ( std :: execution :: par , v . begin (), v . end (), c ); // for_each serial range version call auto [ _ , fun ] = std :: ranges :: for_each ( v . begin (), v . end (), c ); 
We allow the same callable to be used in the proposed 
// callable is used from the previous code snippet callable c ; // The returned iterator is ignored std :: ranges :: for_each ( std :: execution :: par , v . begin (), v . end (), c ); 
Again, even though 
// callable is used from the previous code snippet // Wrapping a callable object with std::reference_wrapper compiles, but might result in data races callable c ; std :: ranges :: for_each ( std :: execution :: par , v . begin (), v . end (), std :: ref ( c )); 
Our conclusion is that it’s user responsibility to provide such a callable that avoids data races, same as for C++17 parallel algorithms.
2.9. Parallel range algorithms are not customization points
We do not propose the parallel range algorithms to be customization points because it’s unclear which parameter to customize for. One could argue that customizations may exist for execution policies, but we expect custom execution policies to become unnecessary once the C++ algorithms will work with schedulers/senders/receivers.
2.10. constexpr 
   [P2902R0] suggests allowing algorithms with execution policies to be used in constant expressions. We do not consider that as a primary design goal for our work, however we will happily align with that proposal in the future once it progresses towards adoption into the working draft.
3. More examples
3.1. Change existing code to use parallel range algorithms
One of the goals is to require a minimal amount of changes when switching from the existing API to parallel range algorithms. However, that simplicity should not create hidden issues negatively impacting the overall user experience. We believe that the proposal provides a good balance in that regard.
As an example, let’s look at using 
For the serial range-based 
std :: ranges :: for_each ( v , []( auto & x ) { ++ x ; }); 
switching to the parallel version will look like:
std :: ranges :: for_each ( std :: execution :: par , v , []( auto & x ) { ++ x ; }); 
In this simple case, the only change is an execution policy added as the first function argument. It will also hold for
the "iterator and sentinel" overload of 
The C++17 parallel 
std :: for_each ( std :: execution :: par , v . begin (), v . end (), []( auto & x ) { ++ x ; }); 
can be changed to one of the following:
// Using iterator and sentinel std :: ranges :: for_each ( std :: execution :: par , v . begin (), v . end (), []( auto & x ) { ++ x ; }); // Using vector as a range std :: ranges :: for_each ( std :: execution :: par , v , []( auto & x ) { ++ x ; }); 
So, here only changing the namespace is necessary, though users might also change 
However, for other algorithms more changes might be necessary.
3.2. Less parallel algorithm calls and better expressiveness
Let’s consider the following example:
reverse ( policy , begin ( data ), end ( data )); transform ( policy , begin ( data ), end ( data ), begin ( result ), []( auto i ){ return i * i ; }); auto res = any_of ( policy , begin ( result ), end ( result ), pred ); 
It has three stages and eventually tries to answer the question if the input sequence contains an element after reversing and transforming it. The interesting considerations are:
- 
     Since the example has three parallel stages, it adds extra overhead for parallel computation per algorithm. 
- 
     The first two stages will complete for all elements before the any_of any_of pred true, thus providing more performance.
Let’s make it better:
// With fancy iterators auto res = any_of ( policy , make_transform_iterator ( make_reverse_iterator ( end ( data )), []( auto i ){ return i * i ; }), make_transform_iterator ( make_reverse_iterator ( begin ( data )), []( auto i ){ return i * i ; }), pred ); 
Now there is only one parallel algorithm call, and 
- 
     First, it doesn’t compile. We use transform iterator make_transform_iterator any_of transform_iterator 
- 
     Second, it requires using a non-standard iterator. 
- 
     Third, the expressiveness of the code is not good: it is hard to read while easy to make a mistake like the one described in the first bullet. 
Let’s improve the example further with the proposed API:
// With ranges auto res = any_of ( policy , data | views :: reverse | views :: transform ([]( auto i ){ return i * i ; }), pred ); 
The example above lacks the drawbacks described for the previous variations:
- 
     There is only one algorithm call; 
- 
     The implementation might skip unnecessary work; 
- 
     There is no room for the lambda type mistake; 
- 
     The readability is much better compared to the second variation and not worse than in the first one. 
4. Proposed API
Note: 
// for_each template < class ExecutionPolicy , random_access_iterator I , sized_sentinel_for < I > S , class Proj = identity , indirectly_unary_invocable < projected < I , Proj >> Fun > I ranges :: for_each ( ExecutionPolicy && policy , I first , S last , Fun f , Proj proj = {}); template < class ExecutionPolicy , random_access_range R , class Proj = identity , indirectly_unary_invocable < projected < iterator_t < R > , Proj >> Fun > requires sized_sentinel_for < ranges :: sentinel_t < R > , ranges :: iterator_t < R >> ranges :: borrowed_iterator_t < R > ranges :: for_each ( ExecutionPolicy && policy , R && r , Fun f , Proj proj = {}); // binary transform with an output range and an output sentinel template < typename ExecutionPolicy , random_access_iterator I1 , sentinel_for < I1 > S1 , random_access_iterator I2 , sentinel_for < I2 > S2 , random_access_iterator O , sized_sentinel_for < O > SO , copy_constructible F , class Proj1 = identity , class Proj2 = identity > requires indirectly_writable < O , indirect_result_t < F & , projected < I1 , Proj1 > , projected < I2 , Proj2 >>> && ( sized_sentinel_for < S1 , I1 > || sized_sentinel_for < S2 , I2 > ) constexpr binary_transform_result < I1 , I2 , O > transform ( ExecutionPolicy && policy , I1 first1 , S1 last1 , I2 first2 , S2 last2 , O result , SO s , F binary_op , Proj1 proj1 = {}, Proj2 proj2 = {} ); template < typename ExecutionPolicy , ranges :: random_access_range R1 , ranges :: random_access_range R2 , ranges :: random_access_range RR , copy_constructible F , class Proj1 = identity , class Proj2 = identity > requires indirectly_writable < ranges :: iterator_t < RR > , indirect_result_t < F & , projected < ranges :: iterator_t < R1 > , Proj1 > , projected < ranges :: iterator_t < R2 > , Proj2 >>> && ( sized_sentinel_for < ranges :: sentinel_t < R1 > , ranges :: iterator_t < R1 >> || sized_sentinel_for < ranges :: sentinel_t < R2 > , ranges :: iterator_t < R2 >> ) && sized_sentinel_for < ranges :: sentinel_t < RR > , ranges :: iterator_t < RR >> constexpr binary_transform_result < ranges :: borrowed_iterator_t < R1 > , ranges :: borrowed_iterator_t < R2 > , ranges :: borrowed_iterator_t < RR >> transform ( ExecutionPolicy && policy , R1 && r1 , R2 && r2 , RR && result , F binary_op , Proj1 proj1 = {}, Proj2 proj2 = {} ); 
4.1. Possible implementation of a parallel range algorithm
// A possible implementation of std::ranges::for_each namespace ranges { namespace __detail { struct __for_each_fn { // ... // Existing serial overloads // ... // The overload for unsequenced and parallel policies. Requires random_access_iterator template < class ExecutionPolicy , random_access_iterator I , sized_sentinel_for < I > S , class Proj = identity , indirectly_unary_invocable < projected < I , Proj >> Fun > requires is_execution_policy_v < std :: remove_cvref_t < ExecutionPolicy >> I operator ()( ExecutionPolicy && exec , I first , S last , Fun f , Proj proj = {}) const { // properly handle the execution policy; // for the reference, a serial implementation is provided for (; first != last ; ++ first ) { std :: invoke ( f , std :: invoke ( proj , * first )); } return first ; } template < class ExecutionPolicy , random_access_range R , class Proj = identity , indirectly_unary_invocable < projected < iterator_t < R > , Proj >> Fun > ranges :: borrowed_iterator_t < R > operator ()( ExecutionPolicy && exec , R && r , Fun f , Proj proj = {}) const { return ( * this )( std :: forward < ExecutionPolicy > ( exec ), std :: ranges :: begin ( r ), std :: ranges :: end ( r ), f , proj ); } }; // struct for_each } // namespace __detail inline namespace __for_each_fn_namespace { inline constexpr __detail :: __for_each_fn for_each ; } // __for_each_fn_namespace } // namespace ranges 
5. Absence of some serial range-based algorithms
We understand that some useful algorithms do not yet exist in 
6. Further exploration
6.1. Thread-safe views examination
We need to understand better whether using some 
We need to invest more time to understand the implications of sharing a state between 
Here are questions we want to answer (potentially not a complete list):
- 
     Do users have enough control to guarantee absence of data races for such views? 
- 
     Are races not possible because of implementation strategy chosen by standard libraries? 
- 
     Do we need to add extra requirements towards thread safety to the standard views? 
7. Revision history
7.1. R1 => R2
- 
     Summarize proposed differences from the serial range algorithms and from the non-range parallel algorithms 
- 
     Allow all but one input sequences to be unbounded 
- 
     List existing algorithms that take ranges for output 
- 
     Update arguments and mitigations for using ranges for output 
- 
     Add more arguments in support of random access ranges 
- 
     Fix the signatures of for_each 
7.2. R0 => R1
- 
     Address the feedback from SG1 and SG9 review 
- 
     Add more information about iterator constraints 
- 
     Propose range 
- 
     Require ranges to be bounded 
8. Polls
8.1. SG9, Tokyo 2024
Poll 1: 
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 2 | 4 | 2 | 0 | 0 | 
Poll 2: Parallel 
| Unanimous consent. | 
Poll 3: Parallel ranges algos should require 
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 3 | 2 | 3 | 1 | 1 | 
Poll 4: Range-based parallel algos should require const operator()
| SF | F | N | A | SA | 
|---|---|---|---|---|
| 0 | 7 | 2 | 0 | 0 |