While reviewing the C++17 CD and preparing my CppCon talk, I found some issues
revolving around transform_reduce() and the parallel form of inner_product().
1. inner_product() Cannot Be Parallelized
   The parallel version of inner_product() (e.g. ExecutionPolicy overload) is
not useful because the wording limits parallelization. 26.8.5 [inner.product]
p1 of the C++17 CD states that inner_product():
computes its result by initializing the accumulatoraccwith the initial valueinitand then modifying it withacc = acc + (*i1) * (*i2)oracc = binary_op1(acc, binary_op2(*i1,*i2))
Similar to accumulate(), while an inner_product() operation can be
parallelized in principle, the wording for inner_product() prevents
parallelization for reduction operations that are non-commutative and non-associative,
like floating point addition. For such operations, the wording forces each
iteration to depend on the prior iteration to ensure the correct ordering of
accumulation. This introduces a loop-carried dependency that makes it impossible
to parallelize.
2. inner_product() is a Form of transform_reduce()
   Just as reduce() is the parallel counterpart of accumulate() (same basic
operation but without a specific ordering), inner_product() has a natural
counter-part. Consider this possible implementation of inner_product():
template <class InputIt1, class InputIt2, class T, class ReductionOperation, class BinaryTransformOperation> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init, ReductionOperation reduce_op BinaryTransformOperation transform_op) { while (first1 != last1) { init = reduce_op(init, transform_op(*first1, *first2)); ++first1; ++first2; } return value; }
The application of transform_op to the sequence is a binary transform():
template <class InputIt1, class InputIt2, class OutputIt, class BinaryTransformOperation> OutputIt transform(InputIt1 first1, InputIt1 last1, InputIt2 first2, OutputIt o_first, BinaryTransformOperation transform_op) { while (first1 != last1) { *o_first++ = transform_op(*first1++, *first2++); } return o_first; }
And the application of reduce_op accumulates the transformed values:
template <class InputIt, class T, class ReductionOperation> T accumulate(InputIt first, InputIt last, T init, ReductionOperation reduce_op) { for (; first != last; ++first) { init = reduce_op(init, *first); } return init; }
    So, inner_product() is an 
    accumulate()sreduce()s
of a transform()ed
sequence. It’s a transform_reduce()!.
   
3. transform_reduce() is Missing an Overload
   However, transform_reduce() is missing an overload for a binary transform;
only overloads with unary transform operations are currently specified by the
C++17 CD.  I call this missing overload binary-binary transform_reduce(), since both the reduction and transformation
operation are binary operations.
This overload, which is equivalent to parallel inner_product() with weaker
ordering, is very useful. Many of the typical examples of transform_reduce() usage that I’ve seen use tricks to perform a binary-binary transform_reduce() using the unary-binary transform_reduce() that is in the C++17 CD.
The typical transform_reduce() dot product example (similar to what is
found in the original transform_reduce() proposal [N3960]) looks like this:
std::vector<std::tuple<double, double> > XY = // ... double dot_product = std::transform_reduce( std::par_unseq, // Input sequence. XY.begin(), XY.end(), // Unary transformation operation. [](std::tuple<double, double> const& xy) { // Array of struct means this is tricky to execute on vector hardware: // // memory layout: X[0] Y[0] X[1] Y[1] X[2] Y[2] X[3] Y[3] ... // // op #0: load a pack of Xs (they aren’t contiguous; the load will be // strided, may need to access multiple // cache lines and may be harder for the // hardware prefetcher to handle) // op #1: load a pack of Ys (same as above) // op #2: multiply the pack of Xs by vector of Ys return std::get<0>(xy) * std::get(xy); }, double(0.0), // Initial value for reduction. // Binary reduction operation. std::plus<double>() );
Note that this is array-of-structs NOT struct-of-arrays. The HPX and THRUST dot product examples use iterator tricks (zip iterators or counting iterators) to switch to a struct-of-arrays scheme. The zip iterator trick is shown below, using Boost:
std::vector<double> X = // ... std::vector<double> Y = // ... double dot_product = std::transform_reduce( std::par_unseq, // Input sequence. boost::make_zip_iterator(X.begin(), Y.begin()), boost::make_zip_iterator(X.end(), Y.end()), // Unary transformation operation. [](auto&& xy) // std::tuple<double, double>-esque { // Struct of arrays means this is easier to execute on vector hardware: // // memory layout: X[0] X[1] X[2] X[3] ... Y[0] Y[1] Y[2] Y[3] ... // // op #0: load a pack of Xs (elements are contiguous, load will access // only one cache line and the hardware // prefetcher will easily track the memory // stream) // op #1: load a pack of Ys (same as above) // op #2: multiply the pack of Xs by the pack of Ys return std::get<0>(xy) * std::get<1>(xy); }, double(0.0), // Initial value for reduction. // Binary reduction operation. std::plus<double>() );
More examples of this pattern can be found in HPX (zip iterator example and counting iterator example) and THRUST (zip iterator example).
4. transform_reduce() Parameter Order is Odd
   The missing binary-binary transform_reduce() would look like this:
template <class InputIt1, class InputIt2, class BinaryTransformOperation, class T, class ReductionOperation> // Always binary. T transform_reduce(InputIt1 first1, InputIt1 last1, InputIt2 first2, BinaryTransformOperation transform_op, T init, ReductionOperation reduce_op);
Note that the order of parameters, which is consistent with the other transform_reduce() overloads, is inconsistent with inner_product(), transform() and reduce():
template <class InputIt1, class InputIt2, class T, class ReductionOperation, class BinaryTransformOperation> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init, ReductionOperation reduce_op BinaryTransformOperation transform_op) template <class InputIt1, class InputIt2, class OutputIt, class BinaryTransformOperation> OutputIt transform(InputIt1 first1, InputIt1 last1, InputIt2 first2, OutputIt o_first, BinaryTransformOperation transform_op); template <class InputIt, class T, class ReductionOperation> T accumulate(InputIt first, InputIt last, T init, ReductionOperation reduce_op);
The inconsistencies are:
- 
     In inner_product(),accumulate()andreduce()(which is not shown but has the same interface asaccumulate()), the initial valueinitparameter comes after the iterator parameters and before the operation parameters. Intransform_reduce(), it is in between the transformation operation parameter and the reduction operation parameter.
- 
     In inner_product(), the reduction operation parameter comes before the transformation operation parameter.
- 
     inner_product(),accumulate()andreduce()have overloads which do not take any operations and useoperator+for reduction andoperator*for transformation.
5. Proposed Resolutions
I filed a US national body ballot comment about the issue with the parallel inner_product() overload. The following solutions would resolve that comment.
5.1. Resolution 1: Rename inner_product() to transform_reduce() and Fix transform_reduce()
   Rename the useless parallel inner_product() with a binary-binary transform_reduce(), and we should change the transform_reduce() interface
to have the same parameter order as inner_product(), accumulate(), reduce() and transform().
Specifically, we should:
- 
     Rename the ExecutionPolicyoverload forinner_product()totransform_reduce().
- 
     Add a new serial binary-binary transform_reduce()overload.
- 
     Add two new binary-binary transform_reduce()overloads (one serial overload and one parallelExecutionPolicyoverload).
- 
     Change the parameter order for all transform_reduce()overloads so that they are consistent withinner_product(),accumulate(),reduce()andtransform():- 
       The initial value parameter comes after the iterator parameters and before the operation parameters. 
- 
       The reduction operation parameter comes before the transformation operation parameters. 
 
- 
       
- 
     Add transform_reduce()overloads that do not take operation parameters and useoperator+for reduction andoperator*for transformation.
With these changes, a parallel dot product over struct-of-arrays data can be written as:
std::vector<double> X = // ... std::vector<double> Y = // ... double dot_product = std::transform_reduce(std::par_unseq, x.begin(), x.end(), y.begin(), double(0.0));
A parallel word count could be written with binary-binary transform_reduce() as:
bool is_word_beginning(char left, char right) { return std::isspace(left) && !std::isspace(right); } std::size_t word_count(std::string_view s) { if (s.empty()) return 0; // Count the number of characters that start a new word. return std::transform_reduce( std::par_unseq, // "Left" input sequence: s[0], s[1], ..., s[s.size() - 2] s.begin(), s.end() - 1, // "Right" input sequence: s[1], s[2], ..., s[s.size() - 1] s.begin() + 1, // Initial value for reduction: if the first character // is not a space, then it’s the beginning of a word. std::size_t(!std::isspace(s.front()) ? 1 : 0), // Binary transformation operation. std::plus<std::size_t>(), // Binary transformation operation: Return 1 when we hit // a new word. is_word_beginning ); }
5.2. Resolution 2: Weaken inner_product()
   Alternatively, we could weaken ordering required by inner_product(), which
would make it equivalent to binary-binary transform_reduce(). This approach
is not backwards compatible; implementations could
rewrite their serial inner_product() in a way that uses a different ordering
than the one previously required, breaking user code. Additionally, different
implementations might provide different orderings, which could cause portably
issues.
However, this would still leave transform_reduce() with an interface that is
inconsistent with other algorithms that implement very similar behavior. If we
ship transform_reduce() with a broken interface, it will be difficult to fix
it later because we would have to break users code.
5.3. Resolution 3: Weaken inner_product() and Fix transform_reduce()
   Adopt Resolution 2, and also the parts of Resolution 1 that fix the
inconsistencies in transform_reduce():
- 
     Change the parameter order for all transform_reduce()overloads so that they are consistent withinner_product(),accumulate(),reduce()andtransform():- 
       The initial value parameter comes after the iterator parameters and before the operation parameters. 
- 
       The reduction operation parameter comes before the transformation operation parameters. 
 
- 
       
- 
     Add transform_reduce()overloads that do not take operation parameters and useoperator+for reduction andoperator*for transformation.
5.4. Resolution 4: Remove Parallel inner_product()
   The simplest approach, and least desirable, is to simply remove the parallel inner_product() overloads. This removes useful functionality which was
intended to go into the standard (e.g. binary transform_reduce()).
Additionally, it would leave transform_reduce()s interface inconsistent with
the other algorithms, and it will be difficult to fix that interface in the
future because it would be a breaking change.
6. Acknowledgement
Thanks to:
- 
     Agustín Bergé for helping identify this issue. 
- 
     Hartmut Kaiser, JF Bastien, Michael Garland and Jared Hoberock for providing feedback.