Document Number:	N4706
Date:	2017-11-22
Revises:	N4698
Editor:	Jared Hoberock NVIDIA Corporation jhoberock@nvidia.com

Parallel algorithms

7.1

In general

[parallel.alg.general]

~~This clause describes components that C++ programs may use to perform operations on containers and other sequences in parallel.~~

7.1.1

Requirements on user-provided function objects

[parallel.alg.general.user]

Function objects passed into parallel algorithms as objects of type BinaryPredicate, Compare, and BinaryOperation shall not directly or indirectly modify objects via their arguments.

7.1.2

Effect of execution policies on algorithm execution

[parallel.alg.general.exec]

Parallel algorithms have template parameters named ExecutionPolicy which describe the manner in which the execution of these algorithms may be parallelized and the manner in which they apply the element access functions.

The invocations of element access functions in parallel algorithms invoked with an execution policy object of type sequential_execution_policy execute in sequential order in the calling thread.

The invocations of element access functions in parallel algorithms invoked with an execution policy object of type parallel_execution_policy are permitted to execute in an unordered fashion in either the invoking thread or in a thread implicitly created by the library to support parallel algorithm execution. Any such invocations executing in the same thread are indeterminately sequenced with respect to each other. [ Note: It is the caller's responsibility to ensure correctness, for example that the invocation does not introduce data races or deadlocks. — end note ]
[ Example:
using namespace std::experimental::parallel; int a[] = {0,1}; std::vector<int> v; for_each(par, std::begin(a), std::end(a), [&](int i) { v.push_back(i*2+1); });
The program above has a data race because of the unsynchronized access to the container v. — end example ]

[ Example:
using namespace std::experimental::parallel; std::atomic<int> x = 0; int a[] = {1,2}; for_each(par, std::begin(a), std::end(a), [&](int n) { x.fetch_add(1, std::memory_order_relaxed); // spin wait for another iteration to change the value of x while (x.load(std::memory_order_relaxed) == 1) { } });
The above example depends on the order of execution of the iterations, and is therefore undefined (may deadlock). — end example ]

[ Example:
using namespace std::experimental::parallel; int x=0; std::mutex m; int a[] = {1,2}; for_each(par, std::begin(a), std::end(a), [&](int) { m.lock(); ++x; m.unlock(); });
The above example synchronizes access to object x ensuring that it is incremented correctly. — end example ]
The invocations of element access functions in parallel algorithms invoked with an execution policy of type unsequenced_policy are permitted to execute in an unordered fashion in the calling thread, unsequenced with respect to one another within the calling thread. [ Note: This means that multiple function object invocations may be interleaved on a single thread. — end note ]
[ Note: This overrides the usual guarantee from the C++ standard, Section 1.9 [intro.execution] that function executions do not interleave with one another. — end note ]

The invocations of element access functions in parallel algorithms invoked with an executino policy of type vector_policy are permitted to execute in an unordered fashion in the calling thread, unsequenced with respect to one another within the calling thread, subject to the sequencing constraints of wavefront application (7.1.3) for the last argument to for_loop or for_loop_strided.

The invocations of element access functions in parallel algorithms invoked with an execution policy of type parallel_vector_execution_policy are permitted to execute in an unordered fashion in unspecified threads, and unsequenced with respect to one another within each thread. [ Note: This means that multiple function object invocations may be interleaved on a single thread. — end note ]
[ Note: This overrides the usual guarantee from the C++ standard, Section 1.9 [intro.execution] that function executions do not interleave with one another. — end note ]
Since parallel_vector_execution_policy allows the execution of element access functions to be interleaved on a single thread, synchronization, including the use of mutexes, risks deadlock. Thus the synchronization with parallel_vector_execution_policy is restricted as follows:
A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation, or another function invocation is specified to synchronize with it, and if it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions may not be invoked by user code called from parallel_vector_execution_policy algorithms.
[ Note: Implementations must ensure that internal synchronization inside standard library routines does not induce deadlock. — end note ]
[ Example:
using namespace std::experimental::parallel; int x=0; std::mutex m; int a[] = {1,2}; for_each(par_vec, std::begin(a), std::end(a), [&](int) { m.lock(); ++x; m.unlock(); });
The above program is invalid because the applications of the function object are not guaranteed to run on different threads. — end example ]
[ Note: The application of the function object may result in two consecutive calls to m.lock on the same thread, which may deadlock. — end note ]
[ Note: The semantics of the parallel_execution_policy or the parallel_vector_execution_policy invocation allow the implementation to fall back to sequential execution if the system cannot parallelize an algorithm invocation due to lack of resources. — end note ]
Algorithms invoked with an execution policy object of type execution_policy execute internally as if invoked with the contained execution policy object.

The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are implementation-defined.

7.1.3

Wavefront Application

[parallel.alg.general.wavefront]

For the purposes of this section, an evaluation is a value computation or side effect of an expression, or an execution of a statement. Initialization of a temporary object is considered a subexpression of the expression that necessitates the temporary object.

An evaluation A contains an evaluation B if:

A and B are not potentially concurrent ([intro.races]); and
the start of A is the start of B or the start of A is sequenced before the start of B; and
the completion of B is the completion of A or the completion of B is sequenced before the completion of A.

[ Note: This includes evaluations occurring in function invocations. — end note ]

An evaluation A is ordered before an evaluation B if A is deterministically sequenced before B. [ Note: If A is indeterminately sequenced with respect to B or A and B are unsequenced, then A is not ordered before B and B is not ordered before A. The ordered before relationship is transitive. — end note ]

For an evaluation A ordered before an evaluation B, both contained in the same invocation of an element access function, A is a vertical antecedent of B if:

there exists an evaluation S such that:
- S contains A, and
- S contains all evaluations C (if any) such that A is ordered before C and C is ordered before B,
- but S does not contain B, and
control reached B from A without executing any of the following:
- a goto statement or asm declaration that jumps to a statement outside of S, or
- a switch statement executed within S that transfers control into a substatement of a nested selection or iteration statement, or
- a throw [ Note: even if caught — end note ] , or
- a longjmp.

[ Note: Vertical antecedent is an irreflexive, antisymmetric, nontransitive relationship between two evaluations. Informally, A is a vertical antecedent of B if A is sequenced immediately before B or A is nested zero or more levels within a statement S that immediately precedes B. — end note ]

In the following, X_i and X_j refer to evaluations of the same expression or statement contained in the application of an element access function corresponding to the i^th and j^th elements of the input sequence. [ Note: There might be several evaluations X_k, Y_k, etc. of a single expression or statement in application k, for example, if the expression or statement appears in a loop within the element access function. — end note ]

Horizontally matched is an equivalence relationship between two evaluations of the same expression. An evaluation B_i is horizontally matched with an evaluation B_j if:

both are the first evaluations in their respective applications of the element access function, or
there exist horizontally matched evaluations A_i and A_j that are vertical antecedents of evaluations B_i and B_j, respectively.

[ Note: Horizontally matched establishes a theoretical lock-step relationship between evaluations in different applications of an element access function. — end note ]

Let f be a function called for each argument list in a sequence of argument lists. Wavefront application of f requires that evaluation A_i be sequenced before evaluation B_i if i < j and and:

A_i is sequenced before some evaluation B_i and B_i is horizontally matched with B_j, or
A_i is horizontally matched with some evaluation A_j and A_j is sequenced before B_{j_.}

[ Note: Wavefront application guarantees that parallel applications i and j execute such that progress on application j never gets ahead of application i. — end note ]

[ Note: The relationships between A_i and B_i and between A_j and B_j are sequenced before, not vertical antecedent. — end note ]

7.1.4

`ExecutionPolicy` algorithm overloads

[parallel.alg.overloads]

The Parallel Algorithms Library provides overloads for each of the algorithms named in Table 1, corresponding to the algorithms with the same name in the C++ Standard Algorithms Library. For each algorithm in Table 2, if there are overloads for corresponding algorithms with the same name in the C++ Standard Algorithms Library, the overloads shall have an additional template type parameter named ExecutionPolicy, which shall be the first template parameter. In addition, each such overload shall have the new function parameter as the first function parameter of type ExecutionPolicy&&.

~~Unless otherwise specified, the semantics of ExecutionPolicy algorithm overloads are identical to their overloads without.~~

~~Parallel algorithms shall not participate in overload resolution unless is_execution_policy<decay_t<ExecutionPolicy>>::value is true.~~

Table 2 — ~~Table of parallel algorithms~~
adjacent_difference adjacent_find all_of any_of

copy copy_if copy_n count

count_if equal exclusive_scan fill

fill_n find find_end find_first_of

find_if find_if_not for_each for_each_n

generate generate_n includes inclusive_scan

inner_product inplace_merge is_heap is_heap_until

is_partitioned is_sorted is_sorted_until lexicographical_compare

max_element merge min_element minmax_element

mismatch move none_of nth_element

partial_sort partial_sort_copy partition partition_copy

reduce remove remove_copy remove_copy_if

remove_if replace replace_copy replace_copy_if

replace_if reverse reverse_copy rotate

rotate_copy search search_n set_difference

set_intersection set_symmetric_difference set_union sort

stable_partition stable_sort swap_ranges transform

transform_exclusive_scan transform_inclusive_scan transform_reduce uninitialized_copy

uninitialized_copy_n uninitialized_fill uninitialized_fill_n unique

unique_copy

~~[ Note: Not all algorithms in the Standard Library have counterparts in Table 2. — end note ]~~

7.2

Definitions

[parallel.alg.defns]

Define GENERALIZED_SUM(op, a1, ..., aN) as follows:

a1 when N is 1

op(GENERALIZED_SUM(op, b1, ..., bK), GENERALIZED_SUM(op, bM, ..., bN)) where

b1, ..., bN may be any permutation of a1, ..., aN and

1 < K+1 = M ≤ N.

Define GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aN) as follows:

a1 when N is 1

op(GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aK), GENERALIZED_NONCOMMUTATIVE_SUM(op, aM,
..., aN) where 1 < K+1 = M ≤ N.

7.3

Non-Numeric Parallel Algorithms

[parallel.alg.ops]

7.3.1

Header `<experimental/algorithm>` synopsis

[parallel.alg.ops.synopsis]

#include <algorithm>

namespace std {
namespace std::experimental {
inline namespace parallelism_v2 {
inline namespace v2 {
  template<class ExecutionPolicy,
           class InputIterator, class Function>
    void for_each(ExecutionPolicy&& exec,
                  InputIterator first, InputIterator last,
                  Function f);
  template<class InputIterator, class Size, class Function>
    InputIterator for_each_n(InputIterator first, Size n,
                             Function f);
  template<class ExecutionPolicy,
           class InputIterator, class Size, class Function>
    InputIterator for_each_n(ExecutionPolicy&& exec,
                             InputIterator first, Size n,
                             Function f);

namespace execution {
  // 7.3.6, No vec
  template<class F>
    auto no_vec(F&& f) noexcept -> decltype(std::forward<F>(f)());

  // 7.3.7, Ordered update class
  template<class T>
    class ordered_update_t;

  // 7.3.8, Ordered update function template
  template<class T>
    ordered_update_t<T> ordered_update(T& ref) noexcept;
}


// Exposition only: Suppress template argument deduction.
template<class T> struct no_deduce { using type = T; };
template<class T> struct no_dedude_t = typename no_deduce<T>::type;

// 7.3.2, Reductions Support for reductions
template<class T, class BinaryOperation>
  unspecified reduction(T& var, const T& identity, BinaryOperation combiner);
template<class T>
  unspecified reduction_plus(T& var);
template<class T>
  unspecified reduction_multiplies(T& var);
template<class T>
  unspecified reduction_bit_and(T& var);
template<class T>
  unspecified reduction_bit_or(T& var);
template<class T>
  unspecified reduction_bit_xor(T& var);
template<class T>
  unspecified reduction_min(T& var);
template<class T>
  unspecified reduction_max(T& var);

// 7.3.3, Inductions Support for inductions
template<class T>
  unspecified induction(T&& var);
template<class T>
  unspecified induction(T&& var, S stride);

// 7.3.4, For loop for_loop
template<class I, class... Rest>
  void for_loop(no_deduce_t<I> start, I finish, Rest&&... rest);
template<class ExecutionPolicy,
         class I, class... Rest>
  void for_loop(ExecutionPolicy&& exec,
                no_deduce_t<I> start, I finish, Rest&&... rest);
template<class I, class S, class... Rest>
  void for_loop_strided(no_deduce_t<I> start, I finish,
                        S stride, Rest&&... rest);
template<class ExecutionPolicy,
         class I, class S, class... Rest>
  void for_loop_strided(ExecutionPolicy&& exec,
                        no_deduce_t<I> start, I finish,
                        S stride, Rest&&... rest);
template<class I, class Size, class... Rest>
  void for_loop_n(I start, Size n, Rest&&... rest);
template<class ExecutionPolicy,
         class I, class Size, class... Rest>
  void for_loop_n(ExecutionPolicy&& exec,
                  I start, Size n, Rest&&... rest);
template<class I, class Size, class S, class... Rest>
  void for_loop_n_strided(I start, Size n, S stride, Rest&&... rest);
template<class ExecutionPolicy,
         class I, class Size, class S, class... Rest>
  void for_loop_n_strided(ExecutionPolicy&& exec,
                          I start, Size n, S stride, Rest&&... rest);

}
}
}
}

7.3.2

Reductions

[parallel.alg.reductions]

Each of the function templates in this subclause ([parallel.alg.reductions]) returns a reduction object of unspecified type having a reduction value type and encapsulating a reduction identity value for the reduction, a combiner function object, and a live-out object from which the initial value is obtained and into which the final value is stored.

An algorithm uses reduction objects by allocating an unspecified number of instances, known as accumulators, of the reduction value type. [ Note: An implementation might, for example, allocate an accumulator for each thread in its private thread pool. — end note ] Each accumulator is initialized with the object’s reduction identity, except that the live-out object (which was initialized by the caller) comprises one of the accumulators. The algorithm passes a reference to an accumulator to each application of an element-access function, ensuring that no two concurrently executing invocations share the same accumulator. An accumulator can be shared between two applications that do not execute concurrently, but initialization is performed only once per accumulator.

Modifications to the accumulator by the application of element access functions accrue as partial results. At some point before the algorithm returns, the partial results are combined, two at a time, using the reduction object’s combiner operation until a single value remains, which is then assigned back to the live-out object. [ Note: in order to produce useful results, modifications to the accumulator should be limited to commutative operations closely related to the combiner operation. For example if the combiner is plus<T>, incrementing the accumulator would be consistent with the combiner but doubling it or assigning to it would not. — end note ]

template<class T, class BinaryOperation>
unspecified reduction(T& var, const T& identity, BinaryOperation combiner);

Requires:: T shall meet the requirements of CopyConstructible and MoveAssignable. The expression var = combiner(var, var) shall be well-formed.
Returns:: a reduction object of unspecified type having reduction value type T, reduction identity identity, combiner function object combiner, and using the object referenced by var as its live-out object.

template<class T>
unspecified reduction_plus(T& var);template<class T>
unspecified reduction_multiplies(T& var);template<class T>
unspecified reduction_bit_and(T& var);template<class T>
unspecified reduction_bit_or(T& var);template<class T>
unspecified reduction_bit_xor(T& var);template<class T>
unspecified reduction_min(T& var);template<class T>
unspecified reduction_max(T& var);

Requires:: T shall meet the requirements of CopyConstructible and MoveAssignable.
Returns:: a reduction object of unspecified type having reduction value type T, reduction identity and combiner operation as specified in table Table 3 and using the object referenced by var as its live-out object.

7.3.3

Inductions

[parallel.alg.inductions]

Each of the function templates in this section return an induction object of unspecified type having an induction value type and encapsulating an initial value i of that type and, optionally, a stride.

For each element in the input range, an algorithm over input sequence S computes an induction value from an induction variable and ordinal position p within S by the formula i + p * stride if a stride was specified or i + p otherwise. This induction value is passed to the element access function.

An induction object may refer to a live-out object to hold the final value of the induction sequence. When the algorithm using the induction object completes, the live-out object is assigned the value i + n * stride, where n is the number of elements in the input range.

template<class T>
unspecified induction(T&& var);template<class T, class S>
unspecified induction(T&& var, S stride);

Returns:: an induction object with induction value type remove_cv_t>remove_reference_t>T<<, initial value var, and (if specified) stride stride. If T is an lvalue reference to non-const type, then the object referenced by var becomes the live-out object for the induction object; otherwise there is no live-out object.

7.3.4

For loop

[parallel.alg.forloop]

template<class I, class... Rest>
void for_loop(no_deduce_t<I> start, I finish, Rest&&... rest);template<class ExecutionPolicy,
      class I, class... Rest>
void for_loop(ExecutionPolicy&& exec,
              no_deduce_t<I> start, I finish, Rest&&... rest);

template<class I, class S, class... Rest>
void for_loop_strided(no_deduce_t<I> start, I finish,
                      S stride, Rest&&... rest);template<class ExecutionPolicy,
      class I, class S, class... Rest>
void for_loop_strided(ExecutionPolicy&& exec,
                      no_deduce_t<I> start, I finish,
                      S stride, Rest&&... rest);

template<class I, class Size, class... Rest>
void for_loop_n(I start, Size n, Rest&&... rest);template<class ExecutionPolicy,
      class I, class Size, class... Rest>
void for_loop_n(ExecutionPolicy&& exec,
                I start, Size n, Rest&&... rest);
          
template<class I, class Size, class S, class... Rest>
void for_loop_n_strided(I start, Size n, S stride, Rest&&... rest);template<class ExecutionPolicy, 
      class I, class Size, class S, class... Rest>
void for_loop_n_strided(ExecutionPolicy&& exec,
                        I start, Size n, S stride, Rest&&... rest);

Requires:: For the overloads with an ExecutionPolicy, I shall be an integral type or meet the requirements of a forward iterator type; otherwise, I shall be an integral type or meet the requirements of an input iterator type. Size shall be an integral type and n shall be non-negative. S shall have integral type and stride shall have non-zero value. stride shall be negative only if I has integral type or meets the requirements of a bidirectional iterator. The rest parameter pack shall have at least one element, comprising objects returned by invocations of reduction ([parallel.alg.reduction]) and/or induction ([parallel.alg.induction]) function templates followed by exactly one invocable element-access function, f. For the overloads with an ExecutionPolicy, f shall meet the requirements of CopyConstructible; otherwise, f shall meet the requirements of MoveConstructible.
Effects:: Applies f to each element in the input sequence, as described below, with additional arguments corresponding to the reductions and inductions in the rest parameter pack. The length of the input sequence is:

n, if specified,

otherwise finish - start if neither n nor stride is specified,

otherwise 1 + (finish-start-1)/stride if stride is positive,

otherwise 1 + (start-finish-1)/-stride.

The first element in the input sequence is start. Each subsequent element is generated by adding stride to the previous element, if stride is specified, otherwise by incrementing the previous element. [ Note: As described in the C++ standard, section [algorithms.general], arithmetic on non-random-access iterators is performed using advance and distance. — end note ] [ Note: The order of the elements of the input sequence is important for determining ordinal position of an application of f, even though the applications themselves may be unordered. — end note ]
The first argument to f is an element from the input sequence. [ Note: if I is an iterator type, the iterators in the input sequence are not dereferenced before being passed to f. — end note ] For each member of the rest parameter pack excluding f, an additional argument is passed to each application of f as follows:

If the pack member is an object returned by a call to a reduction function listed in section [parallel.alg.reductions], then the additional argument is a reference to an accumulator of that reduction object.

If the pack member is an object returned by a call to induction, then the additional argument is the induction value for that induction object corresponding to the position of the application of f in the input sequence.
Complexity:: Applies f exactly once for each element of the input sequence.
Remarks:: If f returns a result, the result is ignored.

7.3.5

For each

[parallel.alg.foreach]

template<class ExecutionPolicy,
      class InputIterator, class Function>
void for_each(ExecutionPolicy&& exec,
              InputIterator first, InputIterator last,
              Function f);

Effects:: Applies f to the result of dereferencing every iterator in the range [first,last). [ Note: If the type of first satisfies the requirements of a mutable iterator, f may apply nonconstant functions through the dereferenced iterator. — end note ]
Complexity:: ~~Applies f exactly last - first times.~~
Remarks:: ~~If f returns a result, the result is ignored.~~
Notes:: Unlike its sequential form, the parallel overload of for_each does not return a copy of its Function parameter, since parallelization may not permit efficient state accumulation.
Requires:: ~~Unlike its sequential form, the parallel overload of for_each requires Function to meet the requirements of CopyConstructible.~~

template<class InputIterator, class Size, class Function>
InputIterator for_each_n(InputIterator first, Size n,
Function f);

Requires:: Function shall meet the requirements of MoveConstructible [ Note: Function need not meet the requirements of CopyConstructible. — end note ]
Effects:: Applies f to the result of dereferencing every iterator in the range [first,first + n), starting from first and proceeding to first + n - 1. [ Note: If the type of first satisfies the requirements of a mutable iterator, f may apply nonconstant functions through the dereferenced iterator. — end note ]
Returns:: ~~first + n for non-negative values of n and first for negative values.~~
Remarks:: ~~If f returns a result, the result is ignored.~~

template<class ExecutionPolicy,
      class InputIterator, class Size, class Function>
InputIterator for_each_n(ExecutionPolicy && exec,
                         InputIterator first, Size n,
                         Function f);

Effects:: Applies f to the result of dereferencing every iterator in the range [first,first + n), starting from first and proceeding to first + n - 1. [ Note: If the type of first satisfies the requirements of a mutable iterator, f may apply nonconstant functions through the dereferenced iterator. — end note ]
Returns:: ~~first + n for non-negative values of n and first for negative values.~~
Remarks:: ~~If f returns a result, the result is ignored.~~
Notes:: ~~Unlike its sequential form, the parallel overload of for_each_n requires Function to meet the requirements of CopyConstructible.~~

7.3.6

No vec

[parallel.alg.novec]

template<class F>
auto no_vec(F&& f) noexcept -> decltype(std::forward<F>(f)());

Effects:: Evaluates std::forward>F<(f)(). When invoked within an element access function in a parallel algorithm using vector_policy, if two calls to no_vec are horizontally matched within a wavefront application of an element access function over input sequence S, then the execution of f in the application for one element in S is sequenced before the execution of f in the application for a subsequent element in S; otherwise, there is no effect on sequencing.
Returns:: the result of f.
Remarks:: ~~If f returns a result, the result is ignored.~~
Notes:: If f exits via an exception, then terminate will be called, consistent with all other potentially-throwing operations invoked with vector_policy execution. [ Example:
extern int* p; for_loop(vec, 0, n[&](int i) { y[i] +=y[i+1]; if(y[i] < 0) { no_vec([]{ *p++ = i; }); } });
The updates *p++ = i will occur in the same order as if the policy were seq. — end example ]

7.3.7

Ordered update class

[parallel.alg.ordupdate.class]

class ordered_update_t {
  T& ref_; // exposition only
public:
  ordered_update_t(T& loc) noexcept
    : ref_(loc) {}
  ordered_update_t(const ordered_update_t&) = delete;
  ordered_update_t& operator=(const ordered_update_t&) = delete;

  template <class U>
    auto operator=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ = std::move(rhs); }); }
  template <class U>
    auto operator+=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ += std::move(rhs); }); }
  template <class U>
    auto operator-=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ -= std::move(rhs); }); }
  template <class U>
    auto operator*=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ *= std::move(rhs); }); }
  template <class U>
    auto operator/=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ /= std::move(rhs); }); }
  template <class U>
    auto operator%=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ %= std::move(rhs); }); }
  template <class U>
    auto operator>>=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ >>= std::move(rhs); }); }
  template <class U>
    auto operator<<=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ <<= std::move(rhs); }); }
  template <class U>
    auto operator&=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ &= std::move(rhs); }); }
  template <class U>
    auto operator^=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ ^= std::move(rhs); }); }
  template <class U>
    auto operator|=(U rhs) const noexcept
      { return no_vec([&]{ return ref_ |= std::move(rhs); }); }

  auto operator++() const noexcept
    { return no_vec([&]{ return ++ref_; }); }
  auto operator++(int) const noexcept
    { return no_vec([&]{ return ref_++; }); }
  auto operator--() const noexcept
    { return no_vec([&]{ return --ref_; }); }
  auto operator--(int) const noexcept
    { return no_vec([&]{ return ref_--; }); }
};

An object of type ordered_update_t><T<> is a proxy for an object of type T intended to be used within a parallel application of an element access function using a policy object of type vector_policy. Simple increments, assignments, and compound assignments to the object are forwarded to the proxied object, but are sequenced as though executed within a no_vec invocation. [ Note: The return-value deduction of the forwarded operations results in these operations returning by value, not reference. This formulation prevents accidental collisions on accesses to the return value. — end note ]

7.3.8

Ordered update function template

[parallel.alg.ordupdate.func]

template<T>
ordered_update_t<T> ordered_update(T& loc) noexcept;

Returns:: { loc }.

7.4

Numeric Parallel Algorithms

[parallel.alg.numeric]

7.4.1

Header `<experimental/numeric>` synopsis

[parallel.alg.numeric.synopsis]

namespace std { namespace experimental { namespace parallel { inline namespace v2 { template<class InputIterator> typename iterator_traits<InputIterator>::value_type reduce(InputIterator first, InputIterator last); template<class ExecutionPolicy, class InputIterator> typename iterator_traits<InputIterator>::value_type reduce(ExecutionPolicy&& exec, InputIterator first, InputIterator last); template<class InputIterator, class T> T reduce(InputIterator first, InputIterator last, T init); template<class ExecutionPolicy, class InputIterator, class T> T reduce(ExecutionPolicy&& exec, InputIterator first, InputIterator last, T init); template<class InputIterator, class T, class BinaryOperation> T reduce(InputIterator first, InputIterator last, T init, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class T, class BinaryOperation> T reduce(ExecutionPolicy&& exec, InputIterator first, InputIterator last, T init, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class T> OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, T init); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class T> OutputIterator exclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, T init); template<class InputIterator, class OutputIterator, class T, class BinaryOperation> OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, T init, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class T, class BinaryOperation> OutputIterator exclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, T init, BinaryOperation binary_op); template<class InputIterator, class OutputIterator> OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result); template<class ExecutionPolicy, class InputIterator, class OutputIterator> OutputIterator inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result); template<class InputIterator, class OutputIterator, class BinaryOperation> OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class BinaryOperation> OutputIterator inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class BinaryOperation, class T> OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op, T init); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class BinaryOperation, class T> OutputIterator inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op, T init); template<class InputIterator, class UnaryOperation, class T, class BinaryOperation> T transform_reduce(InputIterator first, InputIterator last, UnaryOperation unary_op, T init, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class UnaryOperation, class T, class BinaryOperation> T transform_reduce(ExecutionPolicy&& exec, InputIterator first, InputIterator last, UnaryOperation unary_op, T init, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class UnaryOperation, class T, class BinaryOperation> OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, T init, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class UnaryOperation, class T, class BinaryOperation> OutputIterator transform_exclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, T init, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class UnaryOperation, class BinaryOperation> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class UnaryOperation, class BinaryOperation> OutputIterator transform_inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class UnaryOperation, class BinaryOperation, class T> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, BinaryOperation binary_op, T init); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class UnaryOperation, class BinaryOperation, class T> OutputIterator transform_inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, BinaryOperation binary_op, T init); } } } }

7.4.2

Reduce

[parallel.alg.reduce]

template<class InputIterator>
typename iterator_traits<InputIterator>::value_type
   reduce(InputIterator first, InputIterator last);

Effects:: ~~Same as reduce(first, last, typename iterator_traits<InputIterator>::value_type{}).~~

template<class InputIterator, class T>
T reduce(InputIterator first, InputIterator last, T init);

Effects:: ~~Same as reduce(first, last, init, plus<>()).~~

template<class InputIterator, class T, class BinaryOperation>
T reduce(InputIterator first, InputIterator last, T init,
         BinaryOperation binary_op);

Returns:: ~~GENERALIZED_SUM(binary_op, init, *first, ..., *(first + (last - first) - 1)).~~
Requires:: ~~binary_op shall not invalidate iterators or subranges, nor modify elements in the range [first,last).~~
Complexity:: ~~O(last - first) applications of binary_op.~~
Notes:: ~~The primary difference between reduce and accumulate is that the behavior of reduce may be non-deterministic for non-associative or non-commutative binary_op.~~

7.4.3

Exclusive scan

[parallel.alg.exclusive.scan]

template<class InputIterator, class OutputIterator, class T>
OutputIterator exclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result,
                              T init);

Effects:: ~~Same as exclusive_scan(first, last, result, init, plus<>()).~~

template<class InputIterator, class OutputIterator, class T, class BinaryOperation>
OutputIterator exclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result,
                              T init, BinaryOperation binary_op);

Effects:: ~~Assigns through each iterator i in [result,result + (last - first)) the value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., *(first + (i - result) - 1)).~~
Returns:: ~~The end of the resulting range beginning at result.~~
Requires:: ~~binary_op shall not invalidate iterators or subranges, nor modify elements in the ranges [first,last) or [result,result + (last - first)).~~
Complexity:: ~~O(last - first) applications of binary_op.~~
Notes:: The difference between exclusive_scan and inclusive_scan is that exclusive_scan excludes the ith input element from the ith sum. If binary_op is not mathematically associative, the behavior of exclusive_scan may be non-deterministic.

7.4.4

Inclusive scan

[parallel.alg.inclusive.scan]

template<class InputIterator, class OutputIterator>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result);

Effects:: ~~Same as inclusive_scan(first, last, result, plus<>()).~~

template<class InputIterator, class OutputIterator, class BinaryOperation>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result,
                              BinaryOperation binary_op);template<class InputIterator, class OutputIterator, class BinaryOperation, class T>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
                              OutputIterator result,
                              BinaryOperation binary_op, T init);

Effects:: Assigns through each iterator i in [result,result + (last - first)) the value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, *first, ..., *(first + (i - result))) or GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., *(first + (i - result))) if init is provided.
Returns:: ~~The end of the resulting range beginning at result.~~
Requires:: ~~binary_op shall not invalidate iterators or subranges, nor modify elements in the ranges [first,last) or [result,result + (last - first)).~~
Complexity:: ~~O(last - first) applications of binary_op.~~
Notes:: The difference between exclusive_scan and inclusive_scan is that inclusive_scan includes the ith input element in the ith sum. If binary_op is not mathematically associative, the behavior of inclusive_scan may be non-deterministic.

7.4.5

Transform reduce

[parallel.alg.transform.reduce]

template<class InputIterator, class UnaryFunction, class T, class BinaryOperation>
T transform_reduce(InputIterator first, InputIterator last,
                   UnaryOperation unary_op, T init, BinaryOperation binary_op);

Returns:: GENERALIZED_SUM(binary_op, init, unary_op(*first), ..., unary_op(*(first + (last - first) -
1))).
Requires:: ~~Neither unary_op nor binary_op shall invalidate subranges, or modify elements in the range [first,last)~~
Complexity:: ~~O(last - first) applications each of unary_op and binary_op.~~
Notes:: ~~transform_reduce does not apply unary_op to init.~~

7.4.6

Transform exclusive scan

[parallel.alg.transform.exclusive.scan]

template<class InputIterator, class OutputIterator,
      class UnaryOperation,
      class T, class BinaryOperation>
OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last,
                                        OutputIterator result,
                                        UnaryOperation unary_op,
                                        T init, BinaryOperation binary_op);

Effects:: Assigns through each iterator i in [result,result + (last - first)) the value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, unary_op(*first), ..., unary_op(*(first + (i
- result) - 1))).
Returns:: ~~The end of the resulting range beginning at result.~~
Requires:: ~~Neither unary_op nor binary_op shall invalidate iterators or subranges, or modify elements in the ranges [first,last) or [result,result + (last - first)).~~
Complexity:: ~~O(last - first) applications each of unary_op and binary_op.~~
Notes:: The difference between transform_exclusive_scan and transform_inclusive_scan is that transform_exclusive_scan excludes the ith input element from the ith sum. If binary_op is not mathematically associative, the behavior of transform_exclusive_scan may be non-deterministic. transform_exclusive_scan does not apply unary_op to init.

7.4.7

Transform inclusive scan

[parallel.alg.transform.inclusive.scan]

template<class InputIterator, class OutputIterator,
      class UnaryOperation,
      class BinaryOperation>
OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last,
                                        OutputIterator result,
                                        UnaryOperation unary_op,
                                        BinaryOperation binary_op);template<class InputIterator, class OutputIterator,
      class UnaryOperation,
      class BinaryOperation, class T>
OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last,
                                        OutputIterator result,
                                        UnaryOperation unary_op,
                                        BinaryOperation binary_op, T init);

Effects:: Assigns through each iterator i in [result,result + (last - first)) the value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, unary_op(*first), ..., unary_op(*(first + (i -
result)))) or GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, unary_op(*first), ..., unary_op(*(first + (i
- result)))) if init is provided.
Returns:: ~~The end of the resulting range beginning at result.~~
Requires:: ~~Neither unary_op nor binary_op shall invalidate iterators or subranges, or modify elements in the ranges [first,last) or [result,result + (last - first)).~~
Complexity:: ~~O(last - first) applications each of unary_op and binary_op.~~
Notes:: The difference between transform_exclusive_scan and transform_inclusive_scan is that transform_inclusive_scan includes the ith input element from the ith sum. If binary_op is not mathematically associative, the behavior of transform_inclusive_scan may be non-deterministic. transform_inclusive_scan does not apply unary_op to init.

Task Block

[parallel.task_block]

8.1

Header `<experimental/task_block>` synopsis

[parallel.task_block.synopsis]

namespace std {
namespace std::experimental {
inline namespace parallelism_v2 {
inline namespace v2 {
  class task_cancelled_exception;

  class task_block;

  template<class F>
    void define_task_block(F&& f);

  template<class f>
    void define_task_block_restore_thread(F&& f);
}
}
}
}

8.2

Class `task_cancelled_exception`

[parallel.task_block.task_cancelled_exception]

namespace std {
namespace std::experimental {
inline namespace parallelism_v2 {
inline namespace v2 {

  class task_cancelled_exception : public exception
  {
    public:
      task_cancelled_exception() noexcept;
      virtual const char* what() const noexcept override;
  };
}
}
}
}

The class task_cancelled_exception defines the type of objects thrown by task_block::run or task_block::wait if they detect than an exception is pending within the current parallel block. See 8.5, below.

8.2.1

`task_cancelled_exception` member function `what`

[parallel.task_block.task_cancelled_exception.what]

virtual const char* what() const noexcept

Returns:: An implementation-defined NTBS.

8.3

Class `task_block`

[parallel.task_block.class]

namespace std {
namespace std::experimental {
inline namespace parallelism_v2 {
inline namespace v2 {

  class task_block
  {
    private:
      ~task_block();

    public:
      task_block(const task_block&) = delete;
      task_block& operator=(const task_block&) = delete;
      void operator&() const = delete;

      template<class F>
        void run(F&& f);

      void wait();
  };
}
}
}
}

The class task_block defines an interface for forking and joining parallel tasks. The define_task_block and define_task_block_restore_thread function templates create an object of type task_block and pass a reference to that object to a user-provided function object.

An object of class task_block cannot be constructed, destroyed, copied, or moved except by the implementation of the task block library. Taking the address of a task_block object via operator& is ill-formed. Obtaining its address by any other means (including addressof) results in a pointer with an unspecified value; dereferencing such a pointer results in undefined behavior.

A task_block is active if it was created by the nearest enclosing task block, where “task block” refers to an invocation of define_task_block or define_task_block_restore_thread and “nearest enclosing” means the most recent invocation that has not yet completed. Code designated for execution in another thread by means other than the facilities in this section (e.g., using thread or async) are not enclosed in the task block and a task_block passed to (or captured by) such code is not active within that code. Performing any operation on a task_block that is not active results in undefined behavior.

When the argument to task_block::run is called, no task_block is active, not even the task_block on which run was called. (The function object should not, therefore, capture a task_block from the surrounding block.)

[ Example:

define_task_block([&](auto& tb) {
  tb.run([&]{
    tb.run([] { f(); });               // Error: tb is not active within run
    define_task_block([&](auto& tb2) { // Define new task block
      tb2.run(f);
      ...
    });
  });
  ...
});

— end example ]

[ Note: Implementations are encouraged to diagnose the above error at translation time. — end note ]

8.3.1

`task_block` member function template `run`

[parallel.task_block.class.run]

template<class F> void run(F&& f);

Requires:: F shall be MoveConstructible. DECAY_COPY(std::forward<F>(f))() shall be a valid expression.
Preconditions:: *this shall be the active task_block.
Effects:: Evaluates DECAY_COPY(std::forward<F>(f))(), where DECAY_COPY(std::forward<F>(f)) is evaluated synchronously within the current thread. The call to the resulting copy of the function object is permitted to run on an unspecified thread created by the implementation in an unordered fashion relative to the sequence of operations following the call to run(f) (the continuation), or indeterminately sequenced within the same thread as the continuation. The call to run synchronizes with the call to the function object. The completion of the call to the function object synchronizes with the next invocation of wait on the same task_block or completion of the nearest enclosing task block (i.e., the define_task_block or define_task_block_restore_thread that created this task_block).
Throws:: task_cancelled_exception, as described in 8.5.
Remarks:: The run function may return on a thread other than the one on which it was called; in such cases, completion of the call to run synchronizes with the continuation. [ Note: The return from run is ordered similarly to an ordinary function call in a single thread. — end note ]
Remarks:: The invocation of the user-supplied function object f may be immediate or may be delayed until compute resources are available. run might or might not return before the invocation of f completes.

8.3.2

`task_block` member function `wait`

[parallel.task_block.class.wait]

void wait();

Preconditions:: *this shall be the active task_block.
Effects:: Blocks until the tasks spawned using this task_block have completed.
Throws:: task_cancelled_exception, as described in 8.5.
Postconditions:: All tasks spawned by the nearest enclosing task block have completed.
Remarks:: The wait function may return on a thread other than the one on which it was called; in such cases, completion of the call to wait synchronizes with subsequent operations. [ Note: The return from wait is ordered similarly to an ordinary function call in a single thread. — end note ] [ Example:
define_task_block([&](auto& tb) { tb.run([&]{ process(a, w, x); }); // Process a[w] through a[x] if (y < x) tb.wait(); // Wait if overlap between [w,x) and [y,z) process(a, y, z); // Process a[y] through a[z] });
— end example ]

8.4

Function template `define_task_block`

[parallel.task_block.define_task_block]

template<class F>
void define_task_block(F&& f);
       template<class F>
void define_task_block_restore_thread(F&& f);

Requires:: Given an lvalue tb of type task_block, the expression f(tb) shall be well-formed
Effects:: Constructs a task_block tb and calls f(tb).
Throws:: exception_list, as specified in 8.5.
Postconditions:: All tasks spawned from f have finished execution.
Remarks:: The define_task_block function may return on a thread other than the one on which it was called unless there are no task blocks active on entry to define_task_block (see 8.3), in which case the function returns on the original thread. When define_task_block returns on a different thread, it synchronizes with operations following the call. [ Note: The return from define_task_block is ordered similarly to an ordinary function call in a single thread. — end note ] The define_task_block_restore_thread function always returns on the same thread as the one on which it was called.
Notes:: It is expected (but not mandated) that f will (directly or indirectly) call tb.run(function-object).

8.5

Exception Handling

[parallel.task_block.exceptions]

Every task_block has an associated exception list. When the task block starts, its associated exception list is empty.

When an exception is thrown from the user-provided function object passed to define_task_block or define_task_block_restore_thread, it is added to the exception list for that task block. Similarly, when an exception is thrown from the user-provided function object passed into task_block::run, the exception object is added to the exception list associated with the nearest enclosing task block. In both cases, an implementation may discard any pending tasks that have not yet been invoked. Tasks that are already in progress are not interrupted except at a call to task_block::run or task_block::wait as described below.

If the implementation is able to detect that an exception has been thrown by another task within the same nearest enclosing task block, then task_block::run or task_block::wait may throw task_canceled_exception; these instances of task_canceled_exception are not added to the exception list of the corresponding task block.

When a task block finishes with a non-empty exception list, the exceptions are aggregated into an exception_list object, which is then thrown from the task block.

The order of the exceptions in the exception_list object is unspecified.

Doc. No.	Title	Primary Section	Macro Name	Value	Header
~~N4505~~	~~Working Draft, Technical Specification for C++ Extensions for Parallelism~~	7	~~`__cpp_lib_experimental_parallel_algorithm`~~	~~201505~~	`<experimental/algorithm>` `<experimental/exception_list>` `<experimental/execution_policy>` `<experimental/numeric>`
P0155R0	Task Block R5	8	`__cpp_lib_experimental_parallel_task_block`	201711~~201510~~	`<experimental/exception_list>` `<experimental/task_block>`
P0076R4	Vector and Wavefront Policies	5.2	`__cpp_lib_experimental_execution_vector_policy`	201711~~201707~~	`<experimental/algorithm>` `<experimental/execution>`
P0075R2	Template Library for Parallel For Loops	7.3.2	`__cpp_lib_experimental_parallel_for_loop`	201711	`<experimental/algorithm>`

Function	Reduction Identity	Combiner Operation
`reduction_plus`	`T()`	`x + y`
`reduction_multiplies`	`T(1)`	`x * y`
`reduction_bit_and`	`(~T())`	`X & y`
`reduction_bit_or`	`T()`	`x \| y`
`reduction_bit_xor`	`T()`	`x ^ y`
`reduction_min`	`var`	`min(x, y)`
`reduction_max`	`var`	`max(x, y)`

Working Draft, Technical Specification for C++ Extensions for Parallelism Version 2

In general

Execution policy type trait

Sequential execution policy

Parallel execution policy

Parallel+Vector execution policy

Dynamic execution policy