| Document Number: | |
|---|---|
| Date: | |
| Revises: | |
| Editor: | NVIDIA Corporation | 
Note: this is an early draft. It’s known to be incomplet and incorrekt, and it has lots of bad formatting.
This Technical Specification describes requirements for implementations of an interface that computer programs written in the C++ programming language may use to invoke algorithms with parallel execution. The algorithms described by this Technical Specification are realizable across a broad class of computer architectures.
This Technical Specification is non-normative. Some of the functionality described by this Technical Specification may be considered for standardization in a future version of C++, but it is not currently part of any C++ standard. Some of the functionality in this Technical Specification may never be standardized, and other functionality may be standardized in a substantially changed form.
The goal of this Technical Specification is to build widespread existing practice for parallelism in the C++ standard algorithms library. It gives advice on extensions to those vendors who wish to provide them.
The following referenced document is indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
ISO/IEC 14882:— is herein called the C++ Standard. The library described in ISO/IEC 14882:— clauses 17-30 is herein called the C++ Standard Library. The C++ Standard Library components described in ISO/IEC 14882:— clauses 25 and 26.7 are herein called the C++ Standard Algorithms Library.
Unless otherwise specified, the whole of the C++ Standard's Library
    introduction (
Since the the extensions described in this Technical Specification are
    experimental and not part of the C++ Standard Library, they should not be
    declared directly within namespace std. Unless otherwise specified, all
    components described in this Technical Specification are declared in namespace 
    std::experimental::parallel.
std. 
    
    — end note ]
  Unless otherwise specified, references to such entities described in this
    Technical Specification are assumed to be qualified with
    std::experimental::parallel, and references to entities described in the C++
    Standard Library are assumed to be qualified with std::.
Extensions that are expected to eventually be added to an existing header
    <meow> are provided inside the <experimental/meow> header,
    which shall include the standard contents of <meow> as if by
    #include <meow>
For the purposes of this document, the terms and definitions given in the C++ Standard and the following apply.
A parallel algorithm is a function template described by this Technical Specification declared in namespace std::experimental::parallel with a formal template parameter named ExecutionPolicy.
This subclause describes classes that represent execution policies. An execution policy is an object that expresses the requirements on the ordering of functions invoked as a consequence of the invocation of a standard algorithm. Execution policies afford standard algorithms the discretion to execute in parallel.
std::vector<int> v = ... // standard sequential sort— end example ]std::sort(vec.begin(), vec.end());std::sort(std::begin(vec), std::end(vec)); using namespace std::experimental::parallel; // explicitly sequential sortsort(seq, v.begin(), v.end());sort(seq, std::begin(v), std::end(v)); // permitting parallel executionsort(par, v.begin(), v.end());sort(par, std::begin(v), std::end(v)); // permitting vectorization as wellsort(vec, v.begin(), v.end());sort(vec, std::begin(v), std::end(v)); // sort with dynamically-selected execution size_t threshold = ... execution_policy exec = seq; if(v.size() > threshold) { exec = par; }sort(exec, v.begin(), v.end());sort(exec, std::begin(v), std::end(v));
<experimental/execution_policy> synopsisnamespace std {
namespace experimental {
namespace parallel {
  
  
    namespace std {
namespace experimental {
namespace parallel {
  template<class T> struct is_execution_policy
    : integral_constant<bool, see below> { };
}
}
}
    is_execution_policy can be used to detect parallel execution policies for the purpose of excluding function signatures from otherwise ambiguous overload resolution participation.
If T is the type of a standard or implementation-defined execution policy, is_execution_policy<T> shall be publicly derived from integral_constant<bool,true>,
    otherwise from integral_constant<bool,false>.
The behavior of a program that adds specializations for is_execution_policy is undefined.
namespace std {
namespace experimental {
namespace parallel {
  class sequential_execution_policy{};
}
}
}
    The class sequential_execution_policy is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and require that a parallel algorithm's execution may not be parallelized.
namespace std {
namespace experimental {
namespace parallel {
 class parallel_execution_policy{};
}
}
}
    The class parallel_execution_policy is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be parallelized.
namespace std {
namespace experimental {
namespace parallel {
 class vector_execution_policy{};
}
}
}
    The class vector_execution_policy is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be vectorized.
namespace std {
namespace experimental {
namespace parallel {
  class execution_policy
  {
    public:
      
    The class execution_policy is a dynamic container for execution policy objects.
    execution_policy allows dynamic control over standard algorithm execution.
std::vector<float> sort_me = ...
        
using namespace std::experimental::parallel;
std::std::std::sort(exec, sort_me.begin(), sort_me.end());
std::sort(exec, std::begin(sort_me), std::end(sort_me));
    
    — end example ]
  Objects of type execution_policy shall be constructible and assignable from objects of
    type T for which is_execution_policy<T>::value is true.
execution_policy construct/assigntemplate<class T> execution_policy(const T& exec); execution_policy object with a copy of exec's state.is_execution_policy<T>::value is true.
          is_execution_policy<T>::value is true.
          template<class T> execution_policy& operator=(const T& exec); exec's state to *this.*this.
        is_execution_policy<T>::value is true.
          is_execution_policy<T>::value is true.
          execution_policy object access
          const type_info& type() const noexcept;
         typeid(T), such that T is the type of the execution policy object contained by *this.
          template<class T> T* get() noexcept;
          template<class T> const T* get() noexcept;
         target_type() == typeid(T), a pointer to the stored execution policy object; otherwise a null pointer.is_execution_policy<T>::value is true.
          is_execution_policy<T> is true.
          namespace std {
namespace experimental {
namespace parallel {
  constexpr sequential_execution_policy seq = sequential_execution_policy();
  constexpr parallel_execution_policy   par = parallel_execution_policy();
  constexpr vector_execution_policy     vec = vector_execution_policy();
}
}
}
    The header <execution_policy> declares a global object associated with each type of execution policy defined by this Technical Specification.
          If temporary memory resources are required by the algorithm and none are available,
          the algorithm throws a std::bad_alloc exception.
        
During the execution of a standard parallel algorithm, if the application of a function object terminates with an uncaught exception, the behavior of the program is determined by the type of execution policy used to invoke the algorithm:
vector_execution_policy,
              std::terminate shall be called.
            sequential_execution_policy or
              parallel_execution_policy, the execution of the algorithm terminates with an
              exception_list exception. All uncaught exceptions thrown during
              the application of user-provided function objects shall be contained in the
              exception_list.
              for_each is unspecified. When for_each is executed sequentially,
                only one exception will be contained in the exception_list object.
              
    — end note ]
  std::bad_alloc, all exceptions thrown during the execution of
                the algorithm are communicated to the caller. It is unspecified whether an algorithm implementation will "forge ahead" after 
                encountering and capturing a user exception.
              
    — end note ]
  std::bad_alloc exception even if one or more
                user-provided function objects have terminated with an exception. For example, this can happen when an algorithm fails to allocate memory while
                creating or adding elements to the exception_list object.
              
    — end note ]
  <experimental/exception_list> synopsisnamespace std {
namespace experimental {
namespace parallel {
  class exception_list : public exception
  {
    public:
      typedef exception_ptr                                             value_type;
      typedef const value_type&                                         reference;
      typedef const value_type&                                         const_reference;
      typedef implementation-defined                                    const_iterator;
      typedef const_iterator                                            iterator;
      typedef typename iterator_traits::difference_type                 difference_type;
      typedef size_t                                                    size_type;
  
      size_t size() const noexcept;
      iterator begin() const noexcept;
      iterator end() const noexcept;
  
    private:
      std::list<exception_ptr> exceptions_; // exposition only
  };
}
}
}
       
      
        The class exception_list is a container of exception_ptr objects parallel
        algorithms may use to communicate uncaught exceptions encountered during parallel execution to the
        caller of the algorithm.
      
        The type exception_list::const_iterator shall fulfill the requirements of
        ForwardIterator.
      
          size_t size() const noexcept;
         exception_ptr objects contained within the exception_list.
        
          exception_list::iterator begin() const noexcept;
         exception_ptr object contained within the exception_list.
        
          exception_list::iterator end() const noexcept;
         exception_list.
        
        Parallel algorithms have template parameters named ExecutionPolicy which describe
        the manner in which the execution of these algorithms may be parallelized and the manner in
        which they apply user-provided function objects.
      
        The applications of function objects in parallel algorithms invoked with an execution policy
        object of type sequential_execution_policy execute in sequential order in the
        calling thread.
      
        The applications of function objects in parallel algorithms invoked with an execution policy
        object of type parallel_execution_policy are permitted to execute in an unordered
        fashion in unspecified threads, and indeterminately sequenced within each thread. 
        
using namespace std::experimental::parallel;
int a[] = {0,1};
std::vector<int> v;
for_each(par, std::begin(a), std::end(a), [&](int i) {
  v.push_back(i*2+1);
});
foo bar
        The program above has a data race because of the unsynchronized access to the container
        v.
      
    — end example ]
  using namespace std::experimental::parallel; std::atomicThe above example depends on the order of execution of the iterations, and is therefore undefined (may deadlock). — end example ]x = 0; int a[] = {1,2}; for_each(par, std::begin(a), std::end(a), [](int n) { x.fetch_add(1, std::memory_order_relaxed); // spin wait for another iteration to change the value of x while (x.load(std::memory_order_relaxed) == 1) { } }); 
using namespace std::experimental::parallel;
int x;
std::mutex m;
int a[] = {1,2};
for_each(par, std::begin(a), std::end(a), [&](int) {
  m.lock();
  ++x;
  m.unlock();
});
        The above example synchronizes access to object x ensuring that it is
        incremented correctly.
      
    — end example ]
  
        The applications of function objects in parallel algorithms invoked with an execution policy
        of type vector_execution_policy are permitted to execute in an unordered fashion
        in unspecified threads, and unsequenced within each thread.
      
        vector_execution_policy
          policy must not synchronize with each other. Specifically, they must not acquire locks.
        
    — end note ]
  
using namespace std::experimental::parallel;
int x;
std::mutex m;
int a[] = {1,2};
for_each(vec, std::begin(a), std::end(a), [&](int) {
  m.lock();
  ++x;
  m.unlock();
});
        The above program is invalid because the applications of the function object are not
        guaranteed to run on different threads.
      
    — end example ]
  m.lock on the same thread, which may deadlock.
      
    — end note ]
  parallel_execution_policy or the
        vector_execution_policy invocation allow the implementation to fall back to
        sequential execution if the system cannot parallelize an algorithm invocation due to lack of
        resources.
      
    — end note ]
  
        If they exist, a parallel algorithm invoked with an execution policy object of type
        parallel_execution_policy or vector_execution_policy may apply
        iterator member functions of a stronger category than its specification requires. In this
        case, the application of these member functions are subject to provisions 3. and 4. above,
        respectively.
      
InputIterator but
        receives a concrete iterator of the category RandomAccessIterator may use
        operator[]. In this case, it is the algorithm caller's responsibility to ensure
        operator[] is race-free.
      
    — end note ]
  
        Algorithms invoked with an execution policy object of type execution_policy
        execute internally as if invoked with instances of type 
        the contained execution policy object.
      sequential_execution_policy,
        parallel_execution_policy, or an implementation-defined execution policy type depending
        on the dynamic value of the execution_policy object.
The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are unspecified.
ExecutionPolicy algorithm overloads
        Parallel algorithms coexist alongside their sequential counterparts as overloads
        distinguished by a formal template parameter named ExecutionPolicy. This
        template parameter corresponds to the parallel algorithm's first function parameter, whose
        type is 
        is the first template parameter and corresponds to the parallel algorithm's first function
        parameter, whose type is ExecutionPolicyExecutionPolicy&&.
      
        Unless otherwise specified, the semantics of ExecutionPolicy algorithm overloads
        are identical to their overloads without.
      
        Parallel algorithms
        have the requirement 
        shall not participate in overload resolution unless
        is_execution_policy<ExecutionPolicy>::value is trueis_execution_policy<ExecutionPolicy>::value is true.
      
The algorithms listed in ExecutionPolicy overloads.
| adjacent_difference | adjacent_find | all_of | any_of | 
| copy | copy_if | copy_n | count | 
| count_if | equal | exclusive_scan | fill | 
| fill_n | find | find_end | find_first_of | 
| find_if | find_if_not | for_each | for_each_n | 
| generate | generate_n | includes | inclusive_scan | 
| inner_product | inplace_merge | is_heap | is_heap_until | 
| is_partitioned | is_sorted | is_sorted_until | lexicographical_compare | 
| max_element | merge | min_element | minmax_element | 
| mismatch | move | none_of | nth_element | 
| partial_sort | partial_sort_copy | partition | partition_copy | 
| reduce | remove | remove_copy | remove_copy_if | 
| remove_if | replace | replace_copy | replace_copy_if | 
| replace_if | reverse | reverse_copy | rotate | 
| rotate_copy | search | search_n | set_difference | 
| set_intersection | set_symmetric_difference | set_union | sort | 
| stable_partition | stable_sort | swap_ranges | transform | 
| uninitialized_copy | uninitialized_copy_n | uninitialized_fill | uninitialized_fill_n | 
| unique | unique_copy | 
      Define GENERALIZED_SUM(op, a1, ..., aN) as follows:
      
a1 when N is 1op(GENERALIZED_SUM(op, b1, ..., bM), GENERALIZED_SUM(op, bM, ..., bN)) where
           
          b1, ..., bN may be any permutation of a1, ..., aN and0 < M < N.
      Define GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aN) as follows:
      
a1 when N is 1op(GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aM), GENERALIZED_NONCOMMUTATIVE_SUM(op, aM, ..., aN) where 0 < M < N.
        <experimental/algorithm> synopsisnamespace std {
namespace experimental {
namespace parallel {
  template<class ExecutionPolicy,
           class InputIterator, class Function>
    void for_each(ExecutionPolicy&& exec,
                  InputIterator first, InputIterator last,
                  Function f);
  template<class InputIterator, class Size, class Function>
    InputIterator for_each_n(InputIterator first, Size n,
                             Function f);
}
}
}
    
    
          template<class ExecutionPolicy,
                   class InputIterator, class Function>
            void for_each(ExecutionPolicy&& exec,
                          InputIterator first, InputIterator last,
                          Function f);
         f to the result of dereferencing every iterator in the range [first,last).
          first satisfies the requirements of a mutable iterator, f may
            apply nonconstant functions through the dereferenced iterator.
          
    — end note ]
  f exactly last - first times.
        f returns a result, the result is ignored.
        for_each does not return a copy of
          its Function parameter, since parallelization may not permit efficient state
          accumulation.
          
            Unlike its sequential form, the parallel overload of for_each requires
            Function to meet the requirements of CopyConstructible, but not
            MoveConstructible.
          
        
          template<class InputIterator, class Size, class Function>
            InputIterator for_each_n(InputIterator first, Size n,
                                     Function f);
         Function shall meet the requirements of MoveConstructible
          
          Function need not meet the requirements of CopyConstructible.
          
    — end note ]
  f to the result of dereferencing every iterator in the range
          [first,first + n), starting from first and proceeding to first + n - 1.
          first satisfies the requirements of a mutable iterator,
            f may apply nonconstant functions through the dereferenced iterator.
          
    — end note ]
  first + n for non-negative values of n and first for negative values.
        f returns a result, the result is ignored.
        
          template<class ExecutionPolicy,
                   class InputIterator, class Size, class Function>
                   InputIterator for_each_n(ExecutionPolicy && exec,
                                            InputIterator first, Size n,
                                            Function f);
           f to the result of dereferencing every iterator in the range
            [first,first + n), starting from first and proceeding to first + n - 1.
            first satisfies the requirements of a mutable iterator,
              f may apply nonconstant functions through the dereferenced iterator.
            
    — end note ]
  first + n for non-negative values of n and first for negative values.
          f returns a result, the result is ignored.
          for_each_n requires
            Function to meet the requirements of CopyConstructible, but not
            MoveConstructible.
          <experimental/numeric>namespace std {
namespace experimental {
namespace parallel {
  template<class InputIterator>
    typename iterator_traits<InputIterator>::value_type
      reduce(InputIterator first, InputIterator last);
  template<class InputIterator, class T>
    T reduce(InputIterator first, InputIterator last T init);
  template<class InputIterator, class T, class BinaryOperation>
    T reduce(InputIterator first, InputIterator last, T init,
             BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator>
    OutputIterator
      exclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result);
  template<class InputIterator, class OutputIterator,
           class T>
    OutputIterator
      exclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init);
  template<class InputIterator, class OutputIterator,
           class T, class BinaryOperation>
    OutputIterator
      exclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init, BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator>
    OutputIterator
      inclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result);
  template<class InputIterator, class OutputIterator,
           class BinaryOperation>
    OutputIterator
      inclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator,
           class T, class BinaryOperation>
    OutputIterator
      inclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init, BinaryOperation binary_op);
}
}
}
    
    
          template<class InputIterator>
            typename iterator_traits<InputIterator>::value_type
              reduce(InputIterator first, InputIterator last);
         reduce(first, last, typename iterator_traits<InputIterator>::value_type{})
        typename iterator_traits<InputIterator>::value_type{}
          shall be a valid expression. The operator+ function associated with
          iterator_traits<InputIterator>::value_type shall not invalidate iterators or
          subranges, nor modify elements in the range [first,last).
        last - first) applications of operator+.
        reduce and accumulate is that the behavior
          of reduce may be non-deterministic for non-associative or non-commutative
          operator+.
        
          template<class InputIterator, class T>
            T reduce(InputIterator first, InputIterator last, T init);
         reduce(first, last, init, plus<>())
        operator+ function associated with T shall not invalidate iterators
          or subranges, nor modify elements in the range [first,last).
        last - first) applications of operator+.
        reduce and accumulate is that the behavior
          of reduce may be non-deterministic for non-associative or non-commutative operator+.
        
          template<class InputIterator, class T, class BinaryOperation>
            T reduce(InputIterator first, InputIterator last, T init,
                     BinaryOperation binary_op);
         GENERALIZED_SUM(binary_op, init, *first, ..., *(first + last - first - 1)).
        binary_op shall not invalidate iterators or subranges, nor modify elements in the
          range [first,last).
        last - first) applications of binary_op.
        reduce and accumulate is that the behavior
          of reduce may be non-deterministic for non-associative or non-commutative operator+binary_op.
        
          template<class InputIterator, class OutputIterator,
                   class T>
            OutputIterator
              exclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result,
                             T init);
         exclusive_scan(first, last, result, init, plus<>())
        operator+ function associated with iterator_traits<InputIterator>::value_type shall
          not invalidate iterators or subranges, nor modify elements in the ranges [first,last) or
          [result,result + (last - first)).
        last - first) applications of operator+.
        exclusive_scan and inclusive_scan is that
          exclusive_scan excludes the ith input element from the ith sum.
          If the operator+ function is not mathematically associative, the behavior of
          exclusive_scan may be non-deterministic.
        
          template<class InputIterator, class OutputIterator,
                   class T, class BinaryOperation>
            OutputIterator
              exclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result,
                             T init, BinaryOperation binary_op);
         i in [result,result + (last - first)) the
          value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., (*first + i - result - 1)).
        result.
        binary_op shall not invalidate iterators or subranges, nor modify elements in the
          ranges [first,last) or [result,result + (last - first)).
        last - first) applications of binary_op.
        exclusive_scan and inclusive_scan is that
          exclusive_scan excludes the ith input element from the ith
          sum. If binary_op is not mathematically associative, the behavior of
          exclusive_scan may be non-deterministic.
        
          template<class InputIterator, class OutputIterator>
            OutputIterator
              inclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result);
         inclusive_scan(first, last, result, plus<>())
        operator+ function associated with
          iterator_traits<InputIterator>::value_type shall not invalidate iterators or
          subranges, nor modify elements in the ranges [first,last) or
          [result,result + (last - first)).
        last - first) applications of operator+.
        exclusive_scan and inclusive_scan is that
          exclusive_scan excludes the ith input element from the ith sum.
          If the operator+ function is not mathematically associative, the behavior of
          inclusive_scan may be non-deterministic.
        
          template<class InputIterator, class OutputIterator,
                   class BinaryOperation>
            OutputIterator
              inclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result,
                             BinaryOperation binary_op);
          
          template<class InputIterator, class OutputIterator,
                   class T, class BinaryOperation>
            OutputIterator
              inclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result,
                             T init, BinaryOperation binary_op);
         i in [result,result + (last - first)) the value of
          GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, *first, ..., (*first + i - result)) or
          GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., (*first + i - result))
          if init is provided.
        result.
        binary_op shall not invalidate iterators or subranges, nor modify elements in the 
          ranges [first,last) or [result,result + (last - first)).
        last - first) applications of binary_op.
        exclusive_scan and inclusive_scan is that
          inclusive_scan includes the ith input element in the ith sum.
          If binary_op is not mathematically associative, the behavior of
          inclusive_scan may be non-deterministic.