Document Number: | |
---|---|
Date: | |
Revises: | |
Editor: | NVIDIA Corporation |
Note: this is an early draft. It’s known to be incomplet and incorrekt, and it has lots of bad formatting.
This Technical Specification describes requirements for implementations of an interface that computer programs written in the C++ programming language may use to invoke algorithms with parallel execution. The algorithms described by this Technical Specification are realizable across a broad class of computer architectures.
This Technical Specification is non-normative. Some of the functionality described by this Technical Specification may be considered for standardization in a future version of C++, but it is not currently part of any C++ standard. Some of the functionality in this Technical Specification may never be standardized, and other functionality may be standardized in a substantially changed form.
The goal of this Technical Specification is to build widespread existing practice for parallelism in the C++ standard algorithms library. It gives advice on extensions to those vendors who wish to provide them.
The following referenced document is indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 14882:2017— is herein called the C++ Standard.
The library described in ISO/IEC 14882:2017— clauses 20-3317-30 is herein called
the C++ Standard Library. The C++ Standard Library components described in
ISO/IEC 14882:2017— clauses 28, 29.8 and 23.10.1025, 26.7 and 20.7.2 are herein called the C++ Standard
Algorithms Library.
Unless otherwise specified, the whole of the C++ Standard's Library
introduction (C++14 §20) is included into this
Technical Specification by reference.
For the purposes of this document, the terms and definitions given in the C++ Standard and the following apply.
A parallel algorithm is a function template described by this Technical Specification declared in namespace std::experimental::parallel::v2
with a formal template parameter named ExecutionPolicy
.
Parallel algorithms access objects indirectly accessible via their arguments by invoking the following functions:
sort
function may invoke the following element access functions:
RandomAccessIterator
.
swap
function on the elements of the sequence (as per 25.4.1.1 [sort]/2).
Compare
function object.
Since the extensions described in this Technical Specification are
experimental and not part of the C++ Standard Library, they should not be
declared directly within namespace std
. Unless otherwise specified, all
components described in this Technical Specification are declared in namespace
std::experimental::parallelism_v2
.parallel::v2
std
.
— end note ]
Unless otherwise specified, references to such entities described in this
Technical Specification are assumed to be qualified with
std::experimental::parallelism_v2
, and references to entities described in the C++
Standard Library are assumed to be qualified with parallel::v2std::
.
Extensions that are expected to eventually be added to an existing header
<meow>
are provided inside the <experimental/meow>
header,
which shall include the standard contents of <meow>
as if by
#include <meow>
An implementation that provides support for this Technical Specification shall define the feature test macro(s) in Table 1.
Doc. No. | Title | Primary Section | Macro Name | Value | Header |
---|---|---|---|---|---|
__cpp_lib_experimental_parallel_algorithm |
<experimental/algorithm> <experimental/exception_list> <experimental/execution_policy> <experimental/numeric>
|
||||
P0155R0 | Task Block R5 | __cpp_lib_experimental_parallel_task_block |
201711 |
<experimental/exception_list> <experimental/task_block> |
|
P0076R4 | Vector and Wavefront Policies | __cpp_lib_experimental_execution_vector_policy |
201711 |
<experimental/algorithm> <experimental/execution> |
|
P0075R2 | Template Library for Parallel For Loops | __cpp_lib_experimental_parallel_for_loop |
201711 | <experimental/algorithm> |
This clause describes classes that are execution policy types. An object of an execution policy type indicates the kinds of parallelism allowed in the execution of an algorithm and expresses the consequent requirements on the element access functions.
std::vector<int> v = ... // standard sequential sort std::sort(v.begin(), v.end()); using namespace std::experimental::parallel; // explicitly sequential sort sort(seq, v.begin(), v.end()); // permitting parallel execution sort(par, v.begin(), v.end()); // permitting vectorization as well sort(par_vec, v.begin(), v.end()); // sort with dynamically-selected execution size_t threshold = ... execution_policy exec = seq; if (v.size() > threshold) { exec = par; } sort(exec, v.begin(), v.end());— end example ]
<experimental/execution_policy>
synopsis#include <execution>namespace std {namespace std::experimental { inline namespace parallelism_v2 {inline namespace v2 {namespace execution {// ?, template<class T> struct is_execution_policy; template<class T> constexpr bool is_execution_policy_v = is_execution_policy<T>::value;// ?, class sequential_execution_policy;// ?, class parallel_execution_policy;// ?, class parallel_vector_execution_policy;// ?, class execution_policy;// 5.2, Unsequenced execution policy class unsequenced_policy;// 5.3, Vector execution policy class vector_policy;// 5.6, Execution policy objects inline constexpr unsequenced_policy unseq{ unspecified }; inline constexpr parallel_policy par{ unspecified }; }}} }}
template<class T> struct is_execution_policy { see below };
is_execution_policy
can be used to detect parallel
execution policies for the purpose of excluding function signatures from
otherwise ambiguous overload resolution participation.
is_execution_policy<T>
shall be a UnaryTypeTrait with a BaseCharacteristic of true_type
if T
is the type of a standard or implementation-defined execution policy, otherwise false_type
.
The behavior of a program that adds specializations for is_execution_policy
is undefined.
class sequential_execution_policy{ unspecified };
The class sequential_execution_policy
is an
execution policy type used as a unique type to disambiguate parallel
algorithm overloading and require that a parallel algorithm's execution
may not be parallelized.
class parallel_execution_policy{ unspecified };
The class parallel_execution_policy
is an execution
policy type used as a unique type to disambiguate parallel algorithm
overloading and indicate that a parallel algorithm's execution may be
parallelized.
class parallel_vector_execution_policy{ unspecified };
The class class parallel_vector_execution_policy
is
an execution policy type used as a unique type to disambiguate parallel
algorithm overloading and indicate that a parallel algorithm's execution
may be vectorized and parallelized.
class unsequenced_policy{ unspecified };
The class unsequenced_policy
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel algorithm's
execution may be vectorized, e.g., executed on a single thread using
instructions that operate on multiple data items.
The invocations of element access functions in parallel algorithms invoked with an execution policy of type unsequenced_policy
are permitted to execute in an unordered fashion in the calling thread,
unsequenced with respect to one another within the calling thread.
During the execution of a parallel algorithm with the experimental::execution::unsequenced_policy
policy, if the invocation of an element access function exits via an uncaught exception, terminate()
shall be called.
class vector_policy{ unspecified };
The class vector_policy
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel algorithm's
execution may be vectorized. Additionally, such vectorization will
result in an execution that respects the sequencing constraints of
wavefront application ([parallel.alg.general.wavefront]). unsequenced_policy
, for example.
— end note ]
The invocations of element access functions in parallel algorithms invoked with an execution policy of type vector_policy
are permitted to execute in unordered fashion in the calling thread,
unsequenced with respect to one another within the calling thread,
subject to the sequencing constraints of wavefront application (for_loop
or for_loop_strided
.
During the execution of a parallel algorithm with the experimental::execution::vector_policy
policy, if the invocation of an element access function exits via an uncaught exception, terminate()
shall be called.
class execution_policy { public:// 5.4, execution_policy construct/assign template<class T> execution_policy(const T& exec); template<class T> execution_policy& operator=(const T& exec);// 5.5, execution_policy object access const type_info& type() const noexcept; template<class T> T* get() noexcept; template<class T> const T* get() const noexcept; };
The class execution_policy
is a container for execution policy objects.
execution_policy
allows dynamic control over standard algorithm execution.
std::vector<float> sort_me = ... using namespace std::experimental::parallel; execution_policy exec = seq; if(sort_me.size() > threshold) { exec = std::par; } std::sort(exec, std::begin(sort_me), std::end(sort_me));— end example ]
Objects of type execution_policy
shall be constructible and assignable from objects of
type T
for which is_execution_policy<T>::value
is true
.
execution_policy
construct/assign
execution_policy
object with a copy of exec
's state.is_execution_policy<T>::value
is true
.
exec
's state to *this
.*this
execution_policy
object access
typeid(T)
, such that T
is the type of the execution policy object contained by *this
.
target_type() == typeid(T)
, a pointer to the stored execution policy object; otherwise a null pointer.is_execution_policy<T>::value
is true
.constexpr sequential_execution_policy seq{}; constexpr parallel_execution_policy par{}; constexpr parallel_vector_execution_policy par_vec{};constexpr execution::unsequenced_policy unseq{}; constexpr execution::vector_policy vec{};
The header <experimental/execution
declares a global object associated with each type of execution policy defined by this Technical Specification._policy>
During the execution of a standard parallel algorithm,
if temporary memory resources are required and none are available,
the algorithm throws a std::bad_alloc
exception.
During the execution of a standard parallel algorithm, if the invocation of an element access function exits via an uncaught exception, the behavior of the program is determined by the type of execution policy used to invoke the algorithm:
parallel_vector_execution_policy
, unsequenced_policy
, or vector_policy
,
std::terminate
shall be called.
sequential_execution_policy
or
parallel_execution_policy
, the execution of the algorithm exits via an
exception. The exception shall be an exception_list
containing all uncaught exceptions thrown during
the invocations of element access functions, or optionally the uncaught exception if there was only one.
for_each
is executed sequentially,
if an invocation of the user-provided function object throws an exception, for_each
can exit via the uncaught exception, or throw an exception_list
containing the original exception.
— end note ]
std::bad_alloc
, all exceptions
thrown during the execution of
the algorithm are communicated to the caller. It is
unspecified whether an algorithm implementation will "forge ahead" after
encountering and capturing a user exception.
— end note ]
std::bad_alloc
exception even if one or more
user-provided function objects have exited via an
exception. For example, this can happen when an algorithm fails to
allocate memory while
creating or adding elements to the exception_list
object.
— end note ]
<experimental/exception_list>
synopsisnamespace std {namespace std::experimental { inline namespace parallelism_v2 {inline namespace v2 {class exception_list : public exception { public:typedef unspecified iterator;using iterator = unspecified; size_t size() const noexcept; iterator begin() const noexcept; iterator end() const noexcept; const char* what() const noexcept override; };}} }}
The class exception_list
owns a sequence of exception_ptr
objects. The parallel
algorithms may use the
exception_list
to communicate uncaught exceptions encountered during parallel execution to the
caller of the algorithm.
The type exception_list::iterator
shall fulfill the requirements of
ForwardIterator
.
size_t size() const noexcept;
exception_ptr
objects contained within the exception_list
.
iterator begin() const noexcept;
exception_ptr
object contained within the exception_list
.
iterator end() const noexcept;
const char* what() const noexcept override;
Function objects passed into parallel algorithms as objects of type BinaryPredicate
,
Compare
, and BinaryOperation
shall not directly or indirectly modify
objects via their arguments.
Parallel algorithms have template parameters named ExecutionPolicy
which describe
the manner in which the execution of these algorithms may be parallelized and the manner in
which they apply the element access functions.
The invocations of element access functions in parallel algorithms invoked with an execution
policy object of type sequential_execution_policy
execute in sequential order in
the calling thread.
The invocations of element access functions in parallel algorithms invoked with an execution
policy object of type parallel_execution_policy
are permitted to execute in an
unordered fashion in either the invoking thread or in a thread implicitly created by the library
to support parallel algorithm execution. Any such invocations executing in the same thread are
indeterminately sequenced with respect to each other.
using namespace std::experimental::parallel; int a[] = {0,1}; std::vector<int> v; for_each(par, std::begin(a), std::end(a), [&](int i) { v.push_back(i*2+1); });The program above has a data race because of the unsynchronized access to the container
v
.
— end example ]
using namespace std::experimental::parallel; std::atomic<int> x = 0; int a[] = {1,2}; for_each(par, std::begin(a), std::end(a), [&](int n) { x.fetch_add(1, std::memory_order_relaxed); // spin wait for another iteration to change the value of x while (x.load(std::memory_order_relaxed) == 1) { } });The above example depends on the order of execution of the iterations, and is therefore undefined (may deadlock). — end example ]
using namespace std::experimental::parallel; int x=0; std::mutex m; int a[] = {1,2}; for_each(par, std::begin(a), std::end(a), [&](int) { m.lock(); ++x; m.unlock(); });The above example synchronizes access to object
x
ensuring that it is
incremented correctly.
— end example ]
The invocations of element access functions in parallel algorithms invoked with an
execution policy of type unsequenced_policy
are permitted to execute
in an unordered fashion in the calling thread, unsequenced with respect to one another
within the calling thread.
The invocations of element access functions in parallel algorithms invoked with an
executino policy of type vector_policy
are permitted to execute
in an unordered fashion in the calling thread, unsequenced with respect to one another
within the calling thread, subject to the sequencing constraints of wavefront application
(for_loop
or for_loop_strided
.
The invocations of element access functions in parallel algorithms invoked with an execution
policy of type parallel_vector_execution_policy
are permitted to execute in an unordered fashion in unspecified threads, and unsequenced
with respect to one another within each thread.
parallel_vector_execution_policy
allows the execution of element access functions to be
interleaved on a single thread, synchronization, including the use of mutexes, risks deadlock. Thus the
synchronization with parallel_vector_execution_policy
is restricted as follows:
A standard library function is vectorization-unsafe if it is specified to synchronize with
another function invocation, or another function invocation is specified to synchronize with it, and if
it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions
may not be invoked by user code called from parallel_vector_execution_policy
algorithms.
using namespace std::experimental::parallel; int x=0; std::mutex m; int a[] = {1,2}; for_each(par_vec, std::begin(a), std::end(a), [&](int) { m.lock(); ++x; m.unlock(); });The above program is invalid because the applications of the function object are not guaranteed to run on different threads. — end example ]
m.lock
on the same thread, which may deadlock.
— end note ]
parallel_execution_policy
or the
parallel_vector_execution_policy
invocation allow the implementation to fall back to
sequential execution if the system cannot parallelize an algorithm invocation due to lack of
resources.
— end note ]
Algorithms invoked with an execution policy object of type execution_policy
execute internally as if invoked with the contained execution policy object.
The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are implementation-defined.
For the purposes of this section, an evaluation is a value computation or side effect of an expression, or an execution of a statement. Initialization of a temporary object is considered a subexpression of the expression that necessitates the temporary object.
An evaluation A contains an evaluation B if:
An evaluation A is ordered before an evaluation B if A is deterministically
sequenced before B.
For an evaluation A ordered before an evaluation B, both contained in the same invocation of an element access function, A is a vertical antecedent of B if:
goto
statement or asm
declaration that jumps to a statement outside of S, orswitch
statement executed within S that transfers control into a substatement of a nested selection or iteration statement, orthrow
longjmp
.
In the following, Xi and Xj refer to evaluations of the same expression
or statement contained in the application of an element access function corresponding to the ith and
jth elements of the input sequence.
Horizontally matched is an equivalence relationship between two evaluations of the same expression. An evaluation Bi is horizontally matched with an evaluation Bj if:
Let f be a function called for each argument list in a sequence of argument lists. Wavefront application of f requires that evaluation Ai be sequenced before evaluation Bi if i < j and and:
ExecutionPolicy
algorithm overloads
The Parallel Algorithms Library provides overloads for each of the algorithms named in
Table 1, corresponding to the algorithms with the same name in the C++ Standard Algorithms Library.
For each algorithm in ExecutionPolicy
, which shall be the first template parameter.
In addition, each such overload shall have the new function parameter as the
first function parameter of type ExecutionPolicy&&
.
Unless otherwise specified, the semantics of ExecutionPolicy
algorithm overloads
are identical to their overloads without.
Parallel algorithms shall not participate in overload resolution unless
is_execution_policy<decay_t<ExecutionPolicy>>::value
is true
.
adjacent_difference |
adjacent_find |
all_of |
any_of |
copy |
copy_if |
copy_n |
count |
count_if |
equal |
exclusive_scan |
fill |
fill_n |
find |
find_end |
find_first_of |
find_if |
find_if_not |
for_each |
for_each_n |
generate |
generate_n |
includes |
inclusive_scan |
inner_product |
inplace_merge |
is_heap |
is_heap_until |
is_partitioned |
is_sorted |
is_sorted_until |
lexicographical_compare |
max_element |
merge |
min_element |
minmax_element |
mismatch |
move |
none_of |
nth_element |
partial_sort |
partial_sort_copy |
partition |
partition_copy |
reduce |
remove |
remove_copy |
remove_copy_if |
remove_if |
replace |
replace_copy |
replace_copy_if |
replace_if |
reverse |
reverse_copy |
rotate |
rotate_copy |
search |
search_n |
set_difference |
set_intersection |
set_symmetric_difference |
set_union |
sort |
stable_partition |
stable_sort |
swap_ranges |
transform |
transform_exclusive_scan |
transform_inclusive_scan |
transform_reduce |
uninitialized_copy |
uninitialized_copy_n |
uninitialized_fill |
uninitialized_fill_n |
unique |
unique_copy |
Define GENERALIZED_SUM(op, a1, ..., aN)
as follows:
a1
when N
is 1
op(GENERALIZED_SUM(op, b1, ..., bK)
, GENERALIZED_SUM(op, bM, ..., bN))
where
b1, ..., bN
may be any permutation of a1, ..., aN
and1 < K+1 = M ≤ N
.
Define GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aN)
as follows:
a1
when N
is 1
op(GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aK), GENERALIZED_NONCOMMUTATIVE_SUM(op, aM,
..., aN)
where 1 < K+1 = M ≤ N
.
<experimental/algorithm>
synopsis#include <algorithm>namespace std {namespace std::experimental { inline namespace parallelism_v2 {inline namespace v2 { template<class ExecutionPolicy, class InputIterator, class Function> void for_each(ExecutionPolicy&& exec, InputIterator first, InputIterator last, Function f); template<class InputIterator, class Size, class Function> InputIterator for_each_n(InputIterator first, Size n, Function f); template<class ExecutionPolicy, class InputIterator, class Size, class Function> InputIterator for_each_n(ExecutionPolicy&& exec, InputIterator first, Size n, Function f);namespace execution {// 7.3.6, No vec template<class F> auto no_vec(F&& f) noexcept -> decltype(std::forward<F>(f)());// 7.3.7, Ordered update class template<class T> class ordered_update_t;// 7.3.8, Ordered update function template template<class T> ordered_update_t<T> ordered_update(T& ref) noexcept; } // Exposition only: Suppress template argument deduction. template<class T> struct no_deduce { using type = T; }; template<class T> struct no_dedude_t = typename no_deduce<T>::type;// 7.3.2, Reductions Support for reductions template<class T, class BinaryOperation> unspecified reduction(T& var, const T& identity, BinaryOperation combiner); template<class T> unspecified reduction_plus(T& var); template<class T> unspecified reduction_multiplies(T& var); template<class T> unspecified reduction_bit_and(T& var); template<class T> unspecified reduction_bit_or(T& var); template<class T> unspecified reduction_bit_xor(T& var); template<class T> unspecified reduction_min(T& var); template<class T> unspecified reduction_max(T& var);// 7.3.3, Inductions Support for inductions template<class T> unspecified induction(T&& var); template<class T> unspecified induction(T&& var, S stride);// 7.3.4, For loop for_loop template<class I, class... Rest> void for_loop(no_deduce_t<I> start, I finish, Rest&&... rest); template<class ExecutionPolicy, class I, class... Rest> void for_loop(ExecutionPolicy&& exec, no_deduce_t<I> start, I finish, Rest&&... rest); template<class I, class S, class... Rest> void for_loop_strided(no_deduce_t<I> start, I finish, S stride, Rest&&... rest); template<class ExecutionPolicy, class I, class S, class... Rest> void for_loop_strided(ExecutionPolicy&& exec, no_deduce_t<I> start, I finish, S stride, Rest&&... rest); template<class I, class Size, class... Rest> void for_loop_n(I start, Size n, Rest&&... rest); template<class ExecutionPolicy, class I, class Size, class... Rest> void for_loop_n(ExecutionPolicy&& exec, I start, Size n, Rest&&... rest); template<class I, class Size, class S, class... Rest> void for_loop_n_strided(I start, Size n, S stride, Rest&&... rest); template<class ExecutionPolicy, class I, class Size, class S, class... Rest> void for_loop_n_strided(ExecutionPolicy&& exec, I start, Size n, S stride, Rest&&... rest);}} }}
Each of the function templates in this subclause ([parallel.alg.reductions]) returns a reduction object of unspecified type having a reduction value type and encapsulating a reduction identity value for the reduction, a combiner function object, and a live-out object from which the initial value is obtained and into which the final value is stored.
An algorithm uses reduction objects by allocating an unspecified number of instances, known as accumulators, of the reduction value
type.
Modifications to the accumulator by the application of element
access functions accrue as partial results. At some point before the
algorithm
returns, the partial results are combined, two at a time, using
the reduction object’s combiner operation until a single value remains,
which
is then assigned back to the live-out object. plus<T>
, incrementing
the accumulator would be consistent with the combiner but doubling it or assigning to it would not.
— end note ]
template<class T, class BinaryOperation>
unspecified reduction(T& var, const T& identity, BinaryOperation combiner);
CopyConstructible
and MoveAssignable
. The expression var = combiner(var, var)
shall be well-formed.T
, reduction identity identity
, combiner function object combiner
, and using the object referenced by var
as its live-out object.template<class T>
unspecified reduction_plus(T& var); template<class T>
unspecified reduction_multiplies(T& var); template<class T>
unspecified reduction_bit_and(T& var); template<class T>
unspecified reduction_bit_or(T& var); template<class T>
unspecified reduction_bit_xor(T& var); template<class T>
unspecified reduction_min(T& var); template<class T>
unspecified reduction_max(T& var);
CopyConstructible
and MoveAssignable
.T
, reduction identity and combiner operation as specified in table var
as its live-out object.Function | Reduction Identity | Combiner Operation |
---|---|---|
reduction_plus |
T() |
x + y |
reduction_multiplies |
T(1) |
x * y |
reduction_bit_and |
(~T()) |
X & y |
reduction_bit_or |
T() |
x | y |
reduction_bit_xor |
T() |
x ^ y |
reduction_min |
var |
min(x, y) |
reduction_max |
var |
max(x, y) |
y
and sets s
ot the sum of the squares.
extern int n; extern float x[], y[], a; float s = 0; for_loop(execution::vec, 0, n, reduction(s, 0.0f, plus<>()), [&](int i, float& accum) { y[i] += a*x[i]; accum += y[i]*y[i]; } );— end example ]
Each of the function templates in this section return an induction object of unspecified type having an induction value type and encapsulating an initial value i of that type and, optionally, a stride.
For each element in the input range, an algorithm over input sequence S computes an induction value from an induction variable and ordinal position p within S by the formula i + p * stride if a stride was specified or i + p otherwise. This induction value is passed to the element access function.
An induction object may refer to a live-out object to hold the final value of the induction sequence. When the algorithm using the induction object completes, the live-out object is assigned the value i + n * stride, where n is the number of elements in the input range.
template<class T>
unspecified induction(T&& var); template<class T, class S>
unspecified induction(T&& var, S stride);
remove_cv_t>remove_reference_t>T<<
,
initial value var
, and (if specified) stride stride
. If T
is an lvalue reference
to non-const
type, then the object referenced by var
becomes the live-out object for the
induction object; otherwise there is no live-out object.
template<class I, class... Rest>
void for_loop(no_deduce_t<I> start, I finish, Rest&&... rest); template<class ExecutionPolicy,
class I, class... Rest>
void for_loop(ExecutionPolicy&& exec,
no_deduce_t<I> start, I finish, Rest&&... rest);
template<class I, class S, class... Rest>
void for_loop_strided(no_deduce_t<I> start, I finish,
S stride, Rest&&... rest); template<class ExecutionPolicy,
class I, class S, class... Rest>
void for_loop_strided(ExecutionPolicy&& exec,
no_deduce_t<I> start, I finish,
S stride, Rest&&... rest);
template<class I, class Size, class... Rest>
void for_loop_n(I start, Size n, Rest&&... rest); template<class ExecutionPolicy,
class I, class Size, class... Rest>
void for_loop_n(ExecutionPolicy&& exec,
I start, Size n, Rest&&... rest);
template<class I, class Size, class S, class... Rest>
void for_loop_n_strided(I start, Size n, S stride, Rest&&... rest); template<class ExecutionPolicy,
class I, class Size, class S, class... Rest>
void for_loop_n_strided(ExecutionPolicy&& exec,
I start, Size n, S stride, Rest&&... rest);
ExecutionPolicy
, I
shall be an integral type
or meet the requirements of a forward iterator type; otherwise, I
shall be an integral
type or meet the requirements of an input iterator type. Size
shall be an integral type
and n
shall be non-negative. S
shall have integral type and stride
shall have non-zero value. stride
shall be negative only if I
has integral
type or meets the requirements of a bidirectional iterator. The rest
parameter pack shall
have at least one element, comprising objects returned by invocations of reduction
([parallel.alg.reduction]) and/or induction
([parallel.alg.induction]) function templates
followed by exactly one invocable element-access function, f. For the overloads with an
ExecutionPolicy
, f shall meet the requirements of CopyConstructible
;
otherwise, f shall meet the requirements of MoveConstructible
.
rest
parameter pack. The
length of the input sequence is:
n
, if specified,
finish - start
if neither n
nor stride
is specified,
1 + (finish-start-1)/stride
if stride
is positive,
1 + (start-finish-1)/-stride
.
start
. Each subsequent element is generated by adding
stride
to the previous element, if stride
is specified, otherwise by incrementing
the previous element. I
is an
iterator type, the iterators in the input sequence are not dereferenced before
being passed to f.
— end note ]
induction
, then the additional argument is the
induction value for that induction object corresponding to the position of the application of f in the input
sequence.
template<class ExecutionPolicy,
class InputIterator, class Function>
void for_each(ExecutionPolicy&& exec,
InputIterator first, InputIterator last,
Function f);
f
to the result of dereferencing every iterator in the range [first,last)
.
first
satisfies the requirements of a mutable iterator, f
may
apply nonconstant functions through the dereferenced iterator.
— end note ]
f
exactly last - first
times.f
returns a result, the result is ignored.for_each
does not return a copy of
its Function
parameter, since parallelization may not permit efficient state
accumulation.
for_each
requires
Function
to meet the requirements of CopyConstructible
.
template<class InputIterator, class Size, class Function>
InputIterator for_each_n(InputIterator first, Size n,
Function f);
Function
shall meet the requirements of MoveConstructible
Function
need not meet the requirements of CopyConstructible
.
— end note ]
f
to the result of dereferencing every iterator in the range
[first,first + n)
, starting from first
and proceeding to first + n - 1
.
first
satisfies the requirements of a mutable iterator,
f
may apply nonconstant functions through the dereferenced iterator.
— end note ]
first + n
for non-negative values of n
and first
for negative values.
f
returns a result, the result is ignored.
template<class ExecutionPolicy,
class InputIterator, class Size, class Function>
InputIterator for_each_n(ExecutionPolicy && exec,
InputIterator first, Size n,
Function f);
f
to the result of dereferencing every iterator in the range
[first,first + n)
, starting from first
and proceeding to first + n - 1
.
first
satisfies the requirements of a mutable iterator,
f
may apply nonconstant functions through the dereferenced iterator.
— end note ]
first + n
for non-negative values of n
and first
for negative values.f
returns a result, the result is ignored.for_each_n
requires
Function
to meet the requirements of CopyConstructible
.
template<class F>
auto no_vec(F&& f) noexcept -> decltype(std::forward<F>(f)());
std::forward>F<(f)()
. When invoked within an element access function
in a parallel algorithm using vector_policy
, if two calls to no_vec
are
horizontally matched within a wavefront application of an element access function over input
sequence S, then the execution of f
in the application for one element in S is
sequenced before the execution of f
in the application for a subsequent element in
S; otherwise, there is no effect on sequencing.
f
.
f
returns a result, the result is ignored.f
exits via an exception, then terminate
will be called, consistent
with all other potentially-throwing operations invoked with vector_policy
execution.
extern int* p; for_loop(vec, 0, n[&](int i) { y[i] +=y[i+1]; if(y[i] < 0) { no_vec([]{ *p++ = i; }); } });The updates
*p++ = i
will occur in the same order as if the policy were seq
.
— end example ]
class ordered_update_t { T& ref_; // exposition only public: ordered_update_t(T& loc) noexcept : ref_(loc) {} ordered_update_t(const ordered_update_t&) = delete; ordered_update_t& operator=(const ordered_update_t&) = delete; template <class U> auto operator=(U rhs) const noexcept { return no_vec([&]{ return ref_ = std::move(rhs); }); } template <class U> auto operator+=(U rhs) const noexcept { return no_vec([&]{ return ref_ += std::move(rhs); }); } template <class U> auto operator-=(U rhs) const noexcept { return no_vec([&]{ return ref_ -= std::move(rhs); }); } template <class U> auto operator*=(U rhs) const noexcept { return no_vec([&]{ return ref_ *= std::move(rhs); }); } template <class U> auto operator/=(U rhs) const noexcept { return no_vec([&]{ return ref_ /= std::move(rhs); }); } template <class U> auto operator%=(U rhs) const noexcept { return no_vec([&]{ return ref_ %= std::move(rhs); }); } template <class U> auto operator>>=(U rhs) const noexcept { return no_vec([&]{ return ref_ >>= std::move(rhs); }); } template <class U> auto operator<<=(U rhs) const noexcept { return no_vec([&]{ return ref_ <<= std::move(rhs); }); } template <class U> auto operator&=(U rhs) const noexcept { return no_vec([&]{ return ref_ &= std::move(rhs); }); } template <class U> auto operator^=(U rhs) const noexcept { return no_vec([&]{ return ref_ ^= std::move(rhs); }); } template <class U> auto operator|=(U rhs) const noexcept { return no_vec([&]{ return ref_ |= std::move(rhs); }); } auto operator++() const noexcept { return no_vec([&]{ return ++ref_; }); } auto operator++(int) const noexcept { return no_vec([&]{ return ref_++; }); } auto operator--() const noexcept { return no_vec([&]{ return --ref_; }); } auto operator--(int) const noexcept { return no_vec([&]{ return ref_--; }); } };
An object of type ordered_update_t
is a proxy for an object of type T
intended to be used within a parallel application of an element access function using a
policy object of type ><T<>vector_policy
. Simple increments, assignments, and compound
assignments to the object are forwarded to the proxied object, but are sequenced as though
executed within a no_vec
invocation.
template<T>
ordered_update_t<T> ordered_update(T& loc) noexcept;
{ loc }
.
<experimental/numeric>
synopsisnamespace std { namespace experimental { namespace parallel { inline namespace v2 { template<class InputIterator> typename iterator_traits<InputIterator>::value_type reduce(InputIterator first, InputIterator last); template<class ExecutionPolicy, class InputIterator> typename iterator_traits<InputIterator>::value_type reduce(ExecutionPolicy&& exec, InputIterator first, InputIterator last); template<class InputIterator, class T> T reduce(InputIterator first, InputIterator last, T init); template<class ExecutionPolicy, class InputIterator, class T> T reduce(ExecutionPolicy&& exec, InputIterator first, InputIterator last, T init); template<class InputIterator, class T, class BinaryOperation> T reduce(InputIterator first, InputIterator last, T init, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class T, class BinaryOperation> T reduce(ExecutionPolicy&& exec, InputIterator first, InputIterator last, T init, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class T> OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, T init); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class T> OutputIterator exclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, T init); template<class InputIterator, class OutputIterator, class T, class BinaryOperation> OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, T init, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class T, class BinaryOperation> OutputIterator exclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, T init, BinaryOperation binary_op); template<class InputIterator, class OutputIterator> OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result); template<class ExecutionPolicy, class InputIterator, class OutputIterator> OutputIterator inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result); template<class InputIterator, class OutputIterator, class BinaryOperation> OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class BinaryOperation> OutputIterator inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class BinaryOperation, class T> OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op, T init); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class BinaryOperation, class T> OutputIterator inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op, T init); template<class InputIterator, class UnaryOperation, class T, class BinaryOperation> T transform_reduce(InputIterator first, InputIterator last, UnaryOperation unary_op, T init, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class UnaryOperation, class T, class BinaryOperation> T transform_reduce(ExecutionPolicy&& exec, InputIterator first, InputIterator last, UnaryOperation unary_op, T init, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class UnaryOperation, class T, class BinaryOperation> OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, T init, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class UnaryOperation, class T, class BinaryOperation> OutputIterator transform_exclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, T init, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class UnaryOperation, class BinaryOperation> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, BinaryOperation binary_op); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class UnaryOperation, class BinaryOperation> OutputIterator transform_inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class UnaryOperation, class BinaryOperation, class T> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, BinaryOperation binary_op, T init); template<class ExecutionPolicy, class InputIterator, class OutputIterator, class UnaryOperation, class BinaryOperation, class T> OutputIterator transform_inclusive_scan(ExecutionPolicy&& exec, InputIterator first, InputIterator last, OutputIterator result, UnaryOperation unary_op, BinaryOperation binary_op, T init); } } } }
template<class InputIterator>
typename iterator_traits<InputIterator>::value_type
reduce(InputIterator first, InputIterator last);
reduce(first, last, typename iterator_traits<InputIterator>::value_type{})
.template<class InputIterator, class T>
T reduce(InputIterator first, InputIterator last, T init);
reduce(first, last, init, plus<>())
.template<class InputIterator, class T, class BinaryOperation>
T reduce(InputIterator first, InputIterator last, T init,
BinaryOperation binary_op);
GENERALIZED_SUM(binary_op, init, *first, ..., *(first + (last - first) - 1))
.binary_op
shall not invalidate iterators or subranges, nor modify elements in the
range [first,last)
.last - first
) applications of binary_op
.reduce
and accumulate
is that the behavior
of reduce
may be non-deterministic for non-associative or non-commutative binary_op
.
template<class InputIterator, class OutputIterator, class T>
OutputIterator exclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
T init);
exclusive_scan(first, last, result, init, plus<>())
.template<class InputIterator, class OutputIterator, class T, class BinaryOperation>
OutputIterator exclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
T init, BinaryOperation binary_op);
i
in [result,result + (last - first))
the
value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., *(first + (i - result) - 1))
.
result
.binary_op
shall not invalidate iterators or subranges, nor modify elements in the
ranges [first,last)
or [result,result + (last - first))
.
last - first
) applications of binary_op
.exclusive_scan
and inclusive_scan
is that
exclusive_scan
excludes the i
th input element from the i
th
sum. If binary_op
is not mathematically associative, the behavior of
exclusive_scan
may be non-deterministic.
template<class InputIterator, class OutputIterator>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result);
inclusive_scan(first, last, result, plus<>())
.
template<class InputIterator, class OutputIterator, class BinaryOperation>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
BinaryOperation binary_op);template<class InputIterator, class OutputIterator, class BinaryOperation, class T>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
BinaryOperation binary_op, T init);
i
in [result,result + (last - first))
the value of
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, *first, ..., *(first + (i - result)))
or
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., *(first + (i - result)))
if init
is provided.
result
.
binary_op
shall not invalidate iterators or subranges, nor modify elements in the
ranges [first,last)
or [result,result + (last - first))
.
last - first
) applications of binary_op
.exclusive_scan
and inclusive_scan
is that
inclusive_scan
includes the i
th input element in the i
th sum.
If binary_op
is not mathematically associative, the behavior of
inclusive_scan
may be non-deterministic.
template<class InputIterator, class UnaryFunction, class T, class BinaryOperation>
T transform_reduce(InputIterator first, InputIterator last,
UnaryOperation unary_op, T init, BinaryOperation binary_op);
GENERALIZED_SUM(binary_op, init, unary_op(*first), ..., unary_op(*(first + (last - first) -
1)))
.
unary_op
nor binary_op
shall invalidate subranges, or modify elements in the range [first,last)
last - first
) applications each of unary_op
and binary_op
.transform_reduce
does not apply unary_op
to init
.template<class InputIterator, class OutputIterator,
class UnaryOperation,
class T, class BinaryOperation>
OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
UnaryOperation unary_op,
T init, BinaryOperation binary_op);
i
in [result,result + (last - first))
the value of
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, unary_op(*first), ..., unary_op(*(first + (i
- result) - 1)))
.
result
.unary_op
nor binary_op
shall invalidate iterators or subranges, or modify elements in the
ranges [first,last)
or [result,result + (last - first))
.
last - first
) applications each of unary_op
and binary_op
.transform_exclusive_scan
and transform_inclusive_scan
is that transform_exclusive_scan
excludes the ith input element from the ith sum. If binary_op
is not mathematically associative, the behavior of
transform_exclusive_scan
may be non-deterministic. transform_exclusive_scan
does not apply unary_op
to init
.
template<class InputIterator, class OutputIterator,
class UnaryOperation,
class BinaryOperation>
OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
UnaryOperation unary_op,
BinaryOperation binary_op);template<class InputIterator, class OutputIterator,
class UnaryOperation,
class BinaryOperation, class T>
OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
UnaryOperation unary_op,
BinaryOperation binary_op, T init);
i
in [result,result + (last - first))
the value of
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, unary_op(*first), ..., unary_op(*(first + (i -
result))))
or
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, unary_op(*first), ..., unary_op(*(first + (i
- result))))
if init
is provided.
result
.unary_op
nor binary_op
shall invalidate iterators or subranges, or modify elements in the ranges [first,last)
or [result,result + (last - first))
.
last - first
) applications each of unary_op
and binary_op
.transform_exclusive_scan
and transform_inclusive_scan
is that transform_inclusive_scan
includes the ith input element from the ith sum. If binary_op
is not mathematically associative, the behavior of
transform_inclusive_scan
may be non-deterministic. transform_inclusive_scan
does not apply unary_op
to init
.
<experimental/task_block>
synopsisnamespace std {namespace std::experimental { inline namespace parallelism_v2 {inline namespace v2 {class task_cancelled_exception; class task_block; template<class F> void define_task_block(F&& f); template<class f> void define_task_block_restore_thread(F&& f);}} }}
task_cancelled_exception
namespace std {namespace std::experimental { inline namespace parallelism_v2 {inline namespace v2 {class task_cancelled_exception : public exception { public: task_cancelled_exception() noexcept; virtual const char* what() const noexcept override; };}} }}
The class task_cancelled_exception
defines the type of objects thrown by
task_block::run
or task_block::wait
if they detect than an
exception is pending within the current parallel block. See
task_cancelled_exception
member function what
virtual const char* what() const noexcept
task_block
namespace std {namespace std::experimental { inline namespace parallelism_v2 {inline namespace v2 {class task_block { private: ~task_block(); public: task_block(const task_block&) = delete; task_block& operator=(const task_block&) = delete; void operator&() const = delete; template<class F> void run(F&& f); void wait(); };}} }}
The class task_block
defines an interface for forking and joining parallel tasks. The define_task_block
and define_task_block_restore_thread
function templates create an object of type task_block
and pass a reference to that object to a user-provided function object.
An object of class task_block
cannot be constructed,
destroyed, copied, or moved except by the implementation of the task
block library. Taking the address of a task_block
object via operator&
is ill-formed. Obtaining its address by any other means (including addressof
) results in a pointer with an unspecified value; dereferencing such a pointer results in undefined behavior.
A task_block
is active if it was created by the nearest enclosing task block, where “task block” refers to an
invocation of define_task_block
or define_task_block_restore_thread
and “nearest enclosing” means the most
recent invocation that has not yet completed. Code designated for execution in another thread by means other
than the facilities in this section (e.g., using thread
or async
) are not enclosed in the task block and a
task_block
passed to (or captured by) such code is not active within that code. Performing any operation on a
task_block
that is not active results in undefined behavior.
When the argument to task_block::run
is called, no task_block
is active, not even the task_block
on which run
was called.
(The function object should not, therefore, capture a task_block
from the surrounding block.)
define_task_block([&](auto& tb) { tb.run([&]{ tb.run([] { f(); }); // Error: tb is not active within run define_task_block([&](auto& tb2) { // Define new task block tb2.run(f); ... }); }); ... });— end example ]
task_block
member function template run
template<class F> void run(F&& f);
F
shall be MoveConstructible
. DECAY_COPY(std::forward<F>(f))()
shall be a valid expression.
*this
shall be the active task_block
.
DECAY_COPY(std::forward<F>(f))()
, where DECAY_COPY(std::forward<F>(f))
is evaluated synchronously within the current thread. The call to the resulting copy of the function object is
permitted to run on an unspecified thread created by the implementation in an unordered fashion relative to
the sequence of operations following the call to run(f)
(the continuation), or indeterminately sequenced
within the same thread as the continuation. The call to run
synchronizes with the call to the function
object. The completion of the call to the function object synchronizes with the next invocation of wait
on
the same task_block
or completion of the nearest enclosing task block (i.e., the define_task_block
or
define_task_block_restore_thread
that created this task_block
).
task_cancelled_exception
, as described in run
function may return on a thread other than the one on which it was called; in such cases,
completion of the call to run
synchronizes with the continuation.
run
is ordered similarly to an ordinary function call in a single thread.
— end note ]
f
may be immediate or may be delayed until
compute resources are available. run
might or might not return before the invocation of f
completes.
task_block
member function wait
void wait();
*this
shall be the active task_block
.task_block
have completed.
task_cancelled_exception
, as described in wait
function may return on a thread other than the one on which it was called; in such cases, completion of the call to wait
synchronizes with subsequent operations.
define_task_block([&](auto& tb) { tb.run([&]{ process(a, w, x); }); // Process a[w] through a[x] if (y < x) tb.wait(); // Wait if overlap between [w,x) and [y,z) process(a, y, z); // Process a[y] through a[z] });— end example ]
define_task_block
template<class F>
void define_task_block(F&& f);
template<class F>
void define_task_block_restore_thread(F&& f);
tb
of type task_block
, the expression f(tb)
shall be well-formed
task_block
tb
and calls f(tb)
.
exception_list
, as specified in f
have finished execution.
define_task_block
function may return on a thread other than the one on which it was called
unless there are no task blocks active on entry to define_task_block
(see define_task_block
returns on a different thread,
it synchronizes with operations following the call. define_task_block_restore_thread
function always returns on the same thread as the one on which it was called.
f
will (directly or indirectly) call tb.run(function-object)
.
Every task_block
has an associated exception list. When the task block starts, its associated exception list is empty.
When an exception is thrown from the user-provided function object passed to define_task_block
or
define_task_block_restore_thread
, it is added to the exception list for that task block. Similarly, when
an exception is thrown from the user-provided function object passed into task_block::run
, the exception
object is added to the exception list associated with the nearest enclosing task block. In both cases, an
implementation may discard any pending tasks that have not yet been invoked. Tasks that are already in
progress are not interrupted except at a call to task_block::run
or task_block::wait
as described below.
If the implementation is able to detect that an exception has been thrown by another task within
the same nearest enclosing task block, then task_block::run
or task_block::wait
may throw
task_canceled_exception
; these instances of task_canceled_exception
are not added to the exception
list of the corresponding task block.
When a task block finishes with a non-empty exception list, the exceptions are aggregated into an exception_list
object, which is then thrown from the task block.
The order of the exceptions in the exception_list
object is unspecified.