Document Number: | |
---|---|
Date: | |
Revises: | |
Editor: | NVIDIA Corporation |
Note: this is an early draft. It’s known to be incomplet and incorrekt, and it has lots of bad formatting.
This Technical Specification describes requirements for implementations of an interface that computer programs written in the C++ programming language may use to invoke algorithms with parallel execution. The algorithms described by this Technical Specification are realizable across a broad class of computer architectures.
This Technical Specification is non-normative. Some of the functionality described by this Technical Specification may be considered for standardization in a future version of C++, but it is not currently part of any C++ standard. Some of the functionality in this Technical Specification may never be standardized, and other functionality may be standardized in a substantially changed form.
The goal of this Technical Specification is to build widespread existing practice for parallelism in the C++ standard algorithms library. It gives advice on extensions to those vendors who wish to provide them.
The following referenced document is indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 14882:— is herein called the C++ Standard.
The library described in ISO/IEC 14882:— clauses 17-30 is herein called
the C++ Standard Library. The C++ Standard Library components described in
ISO/IEC 14882:— clauses 25 and, 26.7 and 20.7.2 are herein called the C++ Standard
Algorithms Library.
Unless otherwise specified, the whole of the C++ Standard's Library
introduction (
Since the the extensions described in this Technical Specification are
experimental and not part of the C++ Standard Library, they should not be
declared directly within namespace std
. Unless otherwise specified, all
components described in this Technical Specification are declared in namespace
std::experimental::parallel::v1
.
std
.
— end note ]
Unless otherwise specified, references to such entities described in this
Technical Specification are assumed to be qualified with
std::experimental::parallel::v1
, and references to entities described in the C++
Standard Library are assumed to be qualified with std::
.
Extensions that are expected to eventually be added to an existing header
<meow>
are provided inside the <experimental/meow>
header,
which shall include the standard contents of <meow>
as if by
#include <meow>
For the purposes of this document, the terms and definitions given in the C++ Standard and the following apply.
A parallel algorithm is a function template described by this Technical Specification declared in namespace std::experimental::parallel::v1
with a formal template parameter named ExecutionPolicy
.
Parallel algorithms access objects indirectly accessible via their arguments by invoking the following functions:
sort
function may invoke the following element access functions:
RandomAccessIterator
.
swap
function on the elements of the sequence (as per 25.4.1.1 [sort]/2).
Compare
function object.
This clause describes classes that represent execution policies. An
execution policy is an object that expresses the requirements on the ordering
of functions invoked as a consequence of the invocation of a standard
algorithm. Execution policies afford standard algorithms the discretion to
execute in parallel.
This clause describes classes that are execution policy types. An object
of an execution policy type indicates to an algorithm whether it is allowed to execute
in parallel and expresses the requirements on the element access functions.
std::vector<int> v = ... // standard sequential sort std::sort(vec.begin(), vec.end()); using namespace std::experimental::parallel; // explicitly sequential sort sort(seq, v.begin(), v.end()); // permitting parallel execution sort(par, v.begin(), v.end()); // permitting vectorization as well sort(— end example ]vecpar_vec, v.begin(), v.end()); // sort with dynamically-selected execution size_t threshold = ... execution_policy exec = seq; if (v.size() > threshold) { exec = par; } sort(exec, v.begin(), v.end());
<experimental/execution_policy>
synopsisnamespace std { namespace experimental { namespace parallel { inline namespace v1 {// 2.3, Execution policy type trait template<class T> struct is_execution_policy; template<class T> constexpr bool is_execution_policy_v = is_execution_policy<T>::value;// 2.4, Sequential execution policy class sequential_execution_policy;// 2.5, Parallel execution policy class parallel_execution_policy;// 2.6, Parallel+Vector execution policy classvector_execution_policyparallel_vector_execution_policy;// 2.7, Dynamic execution policy class execution_policy; } } } }
namespace std { namespace experimental { namespace parallel {template<class T> struct is_execution_policy { see below };: integral_constant<bool, see below> { };} } }
is_execution_policy
can be used to detect parallel execution policies for the purpose of
excluding function signatures from otherwise ambiguous overload
resolution participation.
If T
is the type of a standard or implementation-defined execution policy, is_execution_policy<T>
shall be publicly derived from integral_constant<bool,true>
,
otherwise from integral_constant<bool,false>
.
is_execution_policy<T>
shall be a UnaryTypeTrait with a BaseCharacteristic of true_type
if T
is the type of a standard or implementation-defined execution policy, otherwise false_type
.
The behavior of a program that adds specializations for is_execution_policy
is undefined.
namespace std { namespace experimental { namespace parallel {class sequential_execution_policy{ unspecified };} } }
The class sequential_execution_policy
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and require that a parallel algorithm's
execution may not be parallelized.
namespace std { namespace experimental { namespace parallel {class parallel_execution_policy{ unspecified };} } }
The class parallel_execution_policy
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel algorithm's
execution may be parallelized.
namespace std { namespace experimental { namespace parallel {classvector_execution_policyparallel_vector_execution_policy{ unspecified };} } }
The class class
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel algorithm's
execution may be vectorized and parallelized.vector_execution_policyparallel_vector_execution_policy
namespace std { namespace experimental { namespace parallel {class execution_policy { public:// 2.7.1, execution_policy construct/assign template<class T> execution_policy(const T& exec); template<class T> execution_policy& operator=(const T& exec);// 2.7.2, execution_policy object access template<class T> T* get() noexcept; template<class T> const T* get() const noexcept; };} } }
The class execution_policy
is a container for execution policy objects.
execution_policy
allows dynamic control over standard algorithm execution.
std::vector<float> sort_me = ... using namespace std::experimental::parallel;— end example ]execution_policy exec =std::
seq; if(sort_me.size() > threshold) { exec = std::par; }std::
std::sort(exec, sort_me.begin(), sort_me.end());std::sort(exec, std::begin(sort_me), std::end(sort_me));
Objects of type execution_policy
shall be constructible and assignable from objects of
type T
for which is_execution_policy<T>::value
is true
.
execution_policy
construct/assigntemplate<class T> execution_policy(const T& exec);
execution_policy
object with a copy of exec
's state.is_execution_policy<T>::value
is true
.
is_execution_policy<T>::value
is true
.
template<class T> execution_policy& operator=(const T& exec);
exec
's state to *this
.*this
.
is_execution_policy<T>::value
is true
.
execution_policy
object access
const type_info& type() const noexcept;
typeid(T)
, such that T
is the type of the execution policy object contained by *this
.
template<class T> T* get() noexcept;
template<class T> const T* get() const noexcept;
target_type() == typeid(T)
, a pointer to the stored execution policy object; otherwise a null pointer.is_execution_policy<T>::value
is true
.
namespace std { namespace experimental { namespace parallel {constexpr sequential_execution_policy seq = sequential_execution_policy(); constexpr parallel_execution_policy par = parallel_execution_policy(); constexpr vector_execution_policy vec = vector_execution_policy();constexpr sequential_execution_policy seq{}; constexpr parallel_execution_policy par{}; constexpr parallel_vector_execution_policy par_vec{};} } }
The header <experimental/execution_policy>
declares a global object associated with each type of execution policy defined by this Technical Specification.
During the execution of a standard parallel algorithm,
Iif temporary memory resources are required by the algorithm and none are available,
the algorithm throws a std::bad_alloc
exception.
During the execution of a standard parallel algorithm, if the application of a function
objectinvocation of an element access function terminates with an uncaught
exception, the behavior of the program is determined by the type of execution policy used to
invoke the algorithm:
class vector_execution_policyparallel_vector_execution_policy
,
std::terminate
shall be called.
sequential_execution_policy
or
parallel_execution_policy
, the execution of the algorithm terminates with an
exception_list
exception. All uncaught exceptions thrown during
the exception_list
.
for_each
is unspecified. When for_each
is executed sequentially,
only one exception will be contained in the exception_list
object.
— end note ]
std::bad_alloc
, all
exceptions thrown during the execution of
the algorithm are communicated to the caller. It is
unspecified whether an algorithm implementation will "forge ahead" after
encountering and capturing a user exception.
— end note ]
std::bad_alloc
exception even if one or more
user-provided function objects have terminated with an
exception. For example, this can happen when an algorithm fails to
allocate memory while
creating or adding elements to the exception_list
object.
— end note ]
<experimental/exception_list>
synopsisnamespace std { namespace experimental { namespace parallel { inline namespace v1 { class exception_list : public exception { public:typedef exception_ptr value_type; typedef const value_type& reference; typedef const value_type& const_reference; typedef implementation-defined const_iterator; typedef const_iterator iterator; typedef typename iterator_traits::difference_type difference_type; typedef size_t size_type;typedef unspecified iterator; size_t size() const noexcept; iterator begin() const noexcept; iterator end() const noexcept; const char* what() const noexcept override;private: std::list<exception_ptr> exceptions_; // exposition only}; } } } }
The class exception_list
is a container owns a sequence of exception_ptr
objects. The parallel
algorithms may use the exception_list
to communicate uncaught exceptions encountered during parallel execution to the
caller of the algorithm.
The type exception_list::iterator
shall fulfill the requirements of
ForwardIterator
.
size_t size() const noexcept;
exception_ptr
objects contained within the exception_list
.
exception_list::iterator begin() const noexcept;
exception_ptr
object contained within the exception_list
.
exception_list::iterator end() const noexcept;
exception_list
const char* what() const noexcept override;
Function objects passed into parallel algorithms as objects of type BinaryPredicate
,
Compare
, and BinaryOperation
shall not directly or indirectly modify
objects via their arguments.
Parallel algorithms have template parameters named ExecutionPolicy
which describe
the manner in which the execution of these algorithms may be parallelized and the manner in
which they apply user-provided function objectsthe element access functions.
The applications of function objectsinvocations of element access functions
in parallel algorithms invoked with an execution policy object of type sequential_execution_policy
execute in sequential order in the calling thread.
The applications of function objectsinvocations of element access
functions in parallel algorithms invoked with an execution policy object of
type parallel_execution_policy
are permitted to execute in an unordered
fashion in unspecified threads, and indeterminately sequenced within each thread.
using namespace std::experimental::parallel; int a[] = {0,1}; std::vector<int> v; for_each(par, std::begin(a), std::end(a), [&](int i) { v.push_back(i*2+1); });The program above has a data race because of the unsynchronized access to the containerfoo bar
v
.
— end example ]
using namespace std::experimental::parallel; std::atomic<int> x = 0; int a[] = {1,2}; for_each(par, std::begin(a), std::end(a), [&](int n) { x.fetch_add(1, std::memory_order_relaxed); // spin wait for another iteration to change the value of x while (x.load(std::memory_order_relaxed) == 1) { } });The above example depends on the order of execution of the iterations, and is therefore undefined (may deadlock). — end example ]
using namespace std::experimental::parallel; int x=0; std::mutex m; int a[] = {1,2}; for_each(par, std::begin(a), std::end(a), [&](int) { m.lock(); ++x; m.unlock(); });The above example synchronizes access to object
x
ensuring that it is
incremented correctly.
— end example ]
The applications of function objectsinvocations of element access functions
in parallel algorithms invoked with an execution policy of type
are permitted to execute in an unordered fashion in unspecified threads, and unsequenced
with respect to one another within each thread.
vector_execution_policyparallel_vector_execution_policy
vector_execution_policy
policy must not synchronize with each other. Specifically, they must not acquire locks.
— end note ]
parallel_vector_execution_policy
allows the execution of element access functions to be
interleaved on a single thread, synchronization, including the use of mutexes, risks deadlock. Thus the
synchronization with parallel_vector_execution_policy
is restricted as follows:
A standard library function is vectorization-unsafe if it is specified to synchronize with
another function invocation, or another function invocation is specified to synchronize with it, and if
it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions
may not be invoked by user code called from parallel_vector_execution_policy
algorithms.
using namespace std::experimental::parallel; int x=0; std::mutex m; int a[] = {1,2}; for_each(The above program is invalid because the applications of the function object are not guaranteed to run on different threads. — end example ]vecpar_vec, std::begin(a), std::end(a), [&](int) { m.lock(); ++x; m.unlock(); });
m.lock
on the same thread, which may deadlock.
— end note ]
parallel_execution_policy
or the
vector_execution_policyparallel_vector_execution_policy
invocation allow the implementation to fall back to
sequential execution if the system cannot parallelize an algorithm invocation due to lack of
resources.
— end note ]
A parallel algorithm invoked with an execution policy object of type
parallel_execution_policy
or vector_execution_policy
may apply
iterator member functions of a stronger category than its specification requires. In this
case, the application of these member functions are subject to provisions 3. and 4. above,
respectively.
InputIterator
but
receives a concrete iterator of the category RandomAccessIterator
may use
operator[]
. In this case, it is the algorithm caller's responsibility to ensure
operator[]
is race-free.
— end note ]
Algorithms invoked with an execution policy object of type execution_policy
execute internally as if invoked with instances of type
the contained execution policy object.
sequential_execution_policy
,
parallel_execution_policy
, or an implementation-defined execution policy type depending
on the dynamic value of the execution_policy
object.
The semantics of parallel algorithms invoked with an execution policy object of
implementation-defined type are unspecifiedimplementation-defined.
ExecutionPolicy
algorithm overloads
Parallel algorithms coexist alongside their sequential counterparts as overloads
distinguished by a formal template parameter named
The Parallel Algorithms Library provides overloads for each of the algorithms named in
Table 1, corresponding to the algorithms with the same name in the C++ Standard Algorithms Library.
For each algorithm in ExecutionPolicy
. This
is the first template parameter and corresponds to the parallel algorithm's first function
parameter, whose type is ExecutionPolicy&&
.ExecutionPolicy
, which shall be the first template parameter.
In addition, each such overload shall have the new function parameter as the
first function parameter of type ExecutionPolicy&&
.
Unless otherwise specified, the semantics of ExecutionPolicy
algorithm overloads
are identical to their overloads without.
Parallel algorithms
have the requirement
shall not participate in overload resolution unless
is_execution_policy<ExecutionPolicy>::value
is true
is_execution_policy<decay_t<ExecutionPolicy>>::value
is true
.
The algorithms listed in ExecutionPolicy
overloads.
adjacent_difference |
adjacent_find |
all_of |
any_of |
copy |
copy_if |
copy_n |
count |
count_if |
equal |
exclusive_scan |
fill |
fill_n |
find |
find_end |
find_first_of |
find_if |
find_if_not |
for_each |
for_each_n |
generate |
generate_n |
includes |
inclusive_scan |
inner_product |
inplace_merge |
is_heap |
is_heap_until |
is_partitioned |
is_sorted |
is_sorted_until |
lexicographical_compare |
max_element |
merge |
min_element |
minmax_element |
mismatch |
move |
none_of |
nth_element |
partial_sort |
partial_sort_copy |
partition |
partition_copy |
reduce |
remove |
remove_copy |
remove_copy_if |
remove_if |
replace |
replace_copy |
replace_copy_if |
replace_if |
reverse |
reverse_copy |
rotate |
rotate_copy |
search |
search_n |
set_difference |
set_intersection |
set_symmetric_difference |
set_union |
sort |
stable_partition |
stable_sort |
swap_ranges |
transform |
uninitialized_copy |
uninitialized_copy_n |
uninitialized_fill |
uninitialized_fill_n |
unique |
unique_copy |
Define GENERALIZED_SUM(op, a1, ..., aN)
as follows:
a1
when N
is 1
op(GENERALIZED_SUM(op, b1, ..., bMK)
, GENERALIZED_SUM(op, bM, ..., bN))
where
b1, ..., bN
may be any permutation of a1, ..., aN
and0 < M < N
.1 < K+1 = M ≤ N
.
Define GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aN)
as follows:
a1
when N
is 1
op(GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aMK), GENERALIZED_NONCOMMUTATIVE_SUM(op, aM, ..., aN)
where 0 < M < N
1 < K+1 = M ≤ N
.
<experimental/algorithm>
synopsisnamespace std { namespace experimental { namespace parallel { inline namespace v1 { template<class ExecutionPolicy, class InputIterator, class Function> void for_each(ExecutionPolicy&& exec, InputIterator first, InputIterator last, Function f); template<class InputIterator, class Size, class Function> InputIterator for_each_n(InputIterator first, Size n, Function f); } } } }
template<class ExecutionPolicy,
class InputIterator, class Function>
void for_each(ExecutionPolicy&& exec,
InputIterator first, InputIterator last,
Function f);
f
to the result of dereferencing every iterator in the range [first,last)
.
first
satisfies the requirements of a mutable iterator, f
may
apply nonconstant functions through the dereferenced iterator.
— end note ]
f
exactly last - first
times.
f
returns a result, the result is ignored.
for_each
does not return a copy of
its Function
parameter, since parallelization may not permit efficient state
accumulation.
for_each
requires
Function
to meet the requirements of CopyConstructible
.
template<class InputIterator, class Size, class Function>
InputIterator for_each_n(InputIterator first, Size n,
Function f);
Function
shall meet the requirements of MoveConstructible
Function
need not meet the requirements of CopyConstructible
.
— end note ]
f
to the result of dereferencing every iterator in the range
[first,first + n)
, starting from first
and proceeding to first + n - 1
.
first
satisfies the requirements of a mutable iterator,
f
may apply nonconstant functions through the dereferenced iterator.
— end note ]
first + n
for non-negative values of n
and first
for negative values.
f
returns a result, the result is ignored.
template<class ExecutionPolicy,
class InputIterator, class Size, class Function>
InputIterator for_each_n(ExecutionPolicy && exec,
InputIterator first, Size n,
Function f);
f
to the result of dereferencing every iterator in the range
[first,first + n)
, starting from first
and proceeding to first + n - 1
.
first
satisfies the requirements of a mutable iterator,
f
may apply nonconstant functions through the dereferenced iterator.
— end note ]
first + n
for non-negative values of n
and first
for negative values.
f
returns a result, the result is ignored.
for_each_n
requires
Function
to meet the requirements of CopyConstructible
.
<experimental/numeric>
synopsisnamespace std { namespace experimental { namespace parallel { inline namespace v1 { template<class InputIterator> typename iterator_traits<InputIterator>::value_type reduce(InputIterator first, InputIterator last); template<class InputIterator, class T> T reduce(InputIterator first, InputIterator last, T init); template<class InputIterator, class T, class BinaryOperation> T reduce(InputIterator first, InputIterator last, T init, BinaryOperation binary_op);template<class InputIterator, class OutputIterator> OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result);template<class InputIterator, class OutputIterator, class T> OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, T init); template<class InputIterator, class OutputIterator, class T, class BinaryOperation> OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, T init, BinaryOperation binary_op); template<class InputIterator, class OutputIterator> OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result); template<class InputIterator, class OutputIterator, class BinaryOperation> OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op); template<class InputIterator, class OutputIterator,class T,class BinaryOperation, class T> OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result,T init,BinaryOperation binary_op, T init); } } } }
template<class InputIterator>
typename iterator_traits<InputIterator>::value_type
reduce(InputIterator first, InputIterator last);
reduce(first, last, typename iterator_traits<InputIterator>::value_type{})
.
reduce(first, last, typename iterator_traits<InputIterator>::value_type{})
typename iterator_traits<InputIterator>::value_type{}
shall be a valid expression. The operator+
function associated with
iterator_traits<InputIterator>::value_type
shall not invalidate iterators or
subranges, nor modify elements in the range [first,last)
.
last - first
) applications of operator+
.
reduce
and accumulate
is that the behavior
of reduce
may be non-deterministic for non-associative or non-commutative
operator+
.
template<class InputIterator, class T>
T reduce(InputIterator first, InputIterator last, T init);
reduce(first, last, init, plus<>())
.
reduce(first, last, init, plus<>())
operator+
function associated with T
shall not invalidate iterators
or subranges, nor modify elements in the range [first,last)
.
last - first
) applications of operator+
.
reduce
and accumulate
is that the behavior
of reduce
may be non-deterministic for non-associative or non-commutative operator+
.
template<class InputIterator, class T, class BinaryOperation>
T reduce(InputIterator first, InputIterator last, T init,
BinaryOperation binary_op);
GENERALIZED_SUM(binary_op, init, *first, ..., *(first + last - first - 1)*(first + (last - first) - 1))
.
binary_op
shall not invalidate iterators or subranges, nor modify elements in the
range [first,last)
.
last - first
) applications of binary_op
.
reduce
and accumulate
is that the behavior
of reduce
may be non-deterministic for non-associative or non-commutative operator+
binary_op
.
template<class InputIterator, class OutputIterator,
class T>
OutputIterator
exclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
T init);
exclusive_scan(first, last, result, init, plus<>())
.
exclusive_scan(first, last, result, init, plus<>())
operator+
function associated with iterator_traits<InputIterator>::value_type
shall
not invalidate iterators or subranges, nor modify elements in the ranges [first,last)
or
[result,result + (last - first))
.
last - first
) applications of operator+
.
exclusive_scan
and inclusive_scan
is that
exclusive_scan
excludes the i
th input element from the i
th sum.
If the operator+
function is not mathematically associative, the behavior of
exclusive_scan
may be non-deterministic.
template<class InputIterator, class OutputIterator,
class T, class BinaryOperation>
OutputIterator
exclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
T init, BinaryOperation binary_op);
i
in [result,result + (last - first))
the
value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., (*first + i - result - 1)*(first + (i - result) - 1))
.
result
.
binary_op
shall not invalidate iterators or subranges, nor modify elements in the
ranges [first,last)
or [result,result + (last - first))
.
last - first
) applications of binary_op
.
exclusive_scan
and inclusive_scan
is that
exclusive_scan
excludes the i
th input element from the i
th
sum. If binary_op
is not mathematically associative, the behavior of
exclusive_scan
may be non-deterministic.
template<class InputIterator, class OutputIterator>
OutputIterator
inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result);
inclusive_scan(first, last, result, plus<>())
.
inclusive_scan(first, last, result, plus<>())
operator+
function associated with
iterator_traits<InputIterator>::value_type
shall not invalidate iterators or
subranges, nor modify elements in the ranges [first,last)
or
[result,result + (last - first))
.
last - first
) applications of operator+
.
exclusive_scan
and inclusive_scan
is that
exclusive_scan
excludes the i
th input element from the i
th sum.
If the operator+
function is not mathematically associative, the behavior of
inclusive_scan
may be non-deterministic.
template<class InputIterator, class OutputIterator,
class BinaryOperation>
OutputIterator
inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
BinaryOperation binary_op);
template<class InputIterator, class OutputIterator,
class T, class BinaryOperation, class T>
OutputIterator
inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
T init, BinaryOperation binary_op, T init);
i
in [result,result + (last - first))
the value of
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, *first, ..., (*first + i - result)*(first + (i - result)))
or
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., (*first + i - result)*(first + (i - result)))
if init
is provided.
result
.
binary_op
shall not invalidate iterators or subranges, nor modify elements in the
ranges [first,last)
or [result,result + (last - first))
.
last - first)
applications of binary_op
.
exclusive_scan
and inclusive_scan
is that
inclusive_scan
includes the i
th input element in the i
th sum.
If binary_op
is not mathematically associative, the behavior of
inclusive_scan
may be non-deterministic.