1. Motivation
ISO/IEC 19570:2018 introduced data-parallel types to the C++ Extensions for
Parallelism TS [P1928R3]. That paper, and several ancillary papers, do an
excellent job of setting out the main features of an extension to C++ which
allows generic data parallel programming on arbitrary targets. However, it is
inevitable that the programmer will want to make some use of target-specific
intrinsics in order to unlock some of the more unusual features of those
specific platforms. This requires that the programmer is able to allow a SIMD
value to be used in a call to a target intrinsic, and that the result of the
intrinsic call can be used to generate a new SIMD value. This is already
permitted in 
In this paper we will propose a function called 
2. Background
Although 
constexpr explicit operator implementation - defined () const ; constexpr explicit basic_simd ( const implementation - defined & init ); 
These allow a SIMD value to be converted into a target specific type which is used to call a compiler-intrinsic. The result is then converted back into the equivalent SIMD type:
simd < float > addsub ( simd < float > a , simd < float > b ) { return static_cast < simd < float >> ( _mm256_addsub_ps ( static_cast < __m256 > ( a ), static_cast < __m256 > ( b ))); } 
In this example the inputs, which are native register-sized SIMD values, are explicitly converted into their target-specific typed values. Those target specific types are used to call the intrinsics, and then the target specific return value is converted back into the closest SIMD representation.
This example is straightforward, since the SIMD values are explicitly the
correct native size. For targets which supported several different register
sizes (e.g., some variants of AVX support 128-, 256- and 512-bit registers) the
code can use a 
simd < float > addsub ( simd < float > a , simd < float > b ) { if constexpr ( simd < float >:: size () == 4 ) return simd < float > ( _mm_addsub_ps ( static_cast < __m128 > ( a ), static_cast < __m128 > ( b ))); else if constexpr ( simd < float >:: size () == 8 ) return simd < float > ( _mm256_addsub_ps ( static_cast < __m256 > ( a ), static_cast < __m256 > ( b ))); else error (); // Invalid native register } 
Things become more tricky when dealing with SIMD values which are larger than their intrinsic types. Such types cannot be converted into a type which can be used to call an intrinsic. Instead, the big SIMD must be broken down into small pieces which may be used to call the intrinsic, and then the results of that glued back together. Here is one way to do that for a SIMD value which is twice as big as a native register:
// Assumes AVX is in use, and that each native register is therefore 8xfloat simd < float , 16 > addsub ( simd < float , 16 > a , simd < float , 16 > b ) { // Get register-sized pieces auto [ lowA , highA ] = simd_split < simd < float >> ( a ); auto [ lowB , highB ] = simd_split < simd < float >> ( b ); // Call the intrinsic on each pair of pieces. auto resultLow = simd < float > ( _mm256_addsub_ps ( static_cast < __m256 > ( lowA ), static_cast < __m256 > ( lowB ))); auto resultHigh = simd < float > ( _mm256_addsub_ps ( static_cast < __m256 > ( highA ), static_cast < __m256 > ( highB ))); // Glue the individual results back together. return simd_concat ( resultLow , resultHigh ); } 
This is now getting verbose, and it only handles SIMD value inputs which are
twice the size of a register value. To use the intrinsic with larger SIMD
values, or SIMD value which don’t map into native register-sized pieces, more
work is needed (e.g., if 
The boiler-plate code needed to handle this is technical straight-forward, but
verbose. A completely generic solution which could handle arbitrary SIMD value
size would also requires additional mechanisms like immediately invoked lambdas
or index sequences to be used too. Rather than requiring every user of 
3. simd_invoke 
   The 
template < typename Fn , typename ... Args > auto simd_invoke ( Fn && fn , Args && ...); 
The 
inline auto native_addsub ( simd < float > lhs , simd < float > rhs ) { auto nativeLhs = static_cast < __m256 > ( lhs ); auto nativeRhs = static_cast < __m256 > ( rhs ); return simd < float > ( _mm256_addsub_ps ( nativeLhs , nativeRhs )); } 
Given this wrapper for native-sized intrinsic calls, we can now use 
auto addsub ( simd < float , 32 > x , simd < float , 32 > y ) { return simd_invoke ( native_addsub , x , y ); } 
The 
Let’s look how the example from § 2 Background looks like with 
| Before | After | 
|---|---|
| 
 | 
 | 
By default the function will use the native size for the element type, with the
aim of calling the intrinsic with the largest permitted builtin type. However, 
auto addsub ( simd < float , 32 > x , simd < float , 32 > y ) { auto do_native = []( simd < float , 4 > lhs , simd < float , 4 > rhs ) { return simd < float , 4 > ( _mm_addsub_ps ( static_cast < __m128 > ( lhs ), static_cast < __m128 > ( rhs ))); }; return simd_invoke < 4 > ( do_native , x , y ); } 
Being able to define a different size is useful for two reasons:
- 
     We may wish to process data in smaller input sizes than native size. For example, if the data needs to be upcast to a large element size for the operation, then it can be useful to choose a smaller block size to begin with so that the upcast data fills a native register. This could be more efficient than accepting native register sized data and then upcasting to several registers. 
- 
     If the element type of the arguments are different then the simd_invoke uint8_t simd_invoke 
So far we have only considered what happens when 
auto addsub ( simd < float , 19 > x , simd < float , 19 > y ) { // Invoke the most appropriate intrinsic for the given simd types. auto do_native = [] < typename T , typename ABI > ( basic_simd < T , ABI > lhs , basic_simd < T , ABI > rhs ) { constexpr auto size = basic_simd < T , ABI >:: size ; if constexpr ( size <= 4 ) return simd < float , size > ( _mm_addsub_ps ( static_cast < __m128 > ( lhs ), static_cast < __m128 > ( rhs ))); else return simd < float , size > ( _mm256_addsub_ps ( static_cast < __m256 > ( lhs ), static_cast < __m256 > ( rhs ))); }; return simd_invoke ( do_native , x , y ); } 
In this example the local lambda function can accept SIMD inputs of any size,
and will choose the most appropriate intrinsic to use. For example, given the
block of 3 tail elements the lambda utility will convert the 3 elements to an 
vmovups ymm0 , ymmword ptr [ rsi ] vmovups ymm1 , ymmword ptr [ rsi +32 ] vmovups xmm2 , xmmword ptr [ rsi +64 ] mov rax , rdi vaddsubps ymm0 , ymm0 , ymmword ptr [ rdx ] vaddsubps ymm1 , ymm1 , ymmword ptr [ rdx +32 ] vaddsubps xmm2 , xmm2 , xmmword ptr [ rdx +64 ] vmovups ymmword ptr [ rdi ], ymm0 vmovups ymmword ptr [ rdi +32 ], ymm1 vmovups xmmword ptr [ rdi +64 ], xmm2 
Notice how the the load, addsub and store instructions work respectively in 
Having now described the basic operation of 
3.1. Using indexed Callable invocations
When a large SIMD value is broken down into pieces to invoke the Callable function, the size of that piece can be obtained from the Callable function invocation’s parameter type but it can also be useful to pass in the index of that piece too. For example, suppose a function is invoking an intrinsic to perform a special type of memory store operation. Each register-sized sub-piece of the incoming SIMD needs to know it offset so that it can be written to the correct pointer offset. The following example code illustrates how this could happen:
auto special_memory_store ( simd < float , 32 > x , float * ptr ) { // Invoke the most appropriate intrinsic for the given simd types. auto do_native = [ = ] < typename T , typename ABI > ( basic_simd < T , ABI > data , auto idx ) { ( _mm256_special_store_ps ( ptr + idx , static_cast < __m256 > ( data ))); }; simd_invoke_indexed ( do_native , x ); } 
The 
Note in this example that the Callable does not return a value itself, nor does the call to 
3.1.1. Design option - probing the index capabilities
Rather than requiring the user to invoke an indexed callable using 
3.2. Considerations in using simd_invoke 
   These are a set of considerations for using 
All the arguments that are passed to the callable function must be 
When more than one SIMD-like object is used as an argument, all of the SIMD-like
objects must have the same size (i.e., you can’t call 
For the native block size to be deduced, all the SIMD-like values must have the same native size. This is to ensure that when the SIMD-like arguments are broken into piecees they will always map to the same respective sizes. If the size cannot be deduced like this then the user must explicitly supply the block size.
The order of invoking the Callable function for each sub-block is undefined.
When the Callable returns a value, it must return a SIMD-like object. This is
because there is no way to take non-SIMD results from the Callable and merge
them together, except by using 
When the Callable returns a SIMD-like object, it need not have the same size as its input arguments. For example, the Callable function could perform an operation like extracting multiples of some index, where the results have to be concatenated back together.
When the Callable returns a SIMD-like object, every invocation of the callable must return a SIMD-like object with the same element type. This is to ensure that the results can be glued together.
4. Wording
4.1. Add new section [simd.invoke]
�function calling utility [simd.simd_invoke]simd_invoke template < std :: size_t BlockSize = 0 , typename Fn , typename ... Args > constexpr auto simd_invoke ( Fn && fn , Args && ... args ); template < std :: size_t BlockSize = 0 , typename Fn , typename ... Args > constexpr auto simd_invoke_indexed ( Fn && fn , Args && ... args ); Constraints:
.sizeof ...( Args ) > 0 
Each argument in
is eitherArgs ... orbasic_simd .basic_simd_mask 
For every argument in
,Args ... ’s must be identical.Args :: size 
is either non-zero, or every argument inBlockSize must have the same native size.Args ... 
The
return type is eitherFn , orvoid , orbasic_simd .basic_simd_mask Effects:
Let
be asplit - size if that is non-zero, or native size, otherwise.BlockSize 
Invokes
withFn split byargs . The number of invocations issplit - size . The invocation order ofArgs :: size / split - size + ( Args :: size % split - size == 0 ? 0 : 1 ) over split args is unspecified.Fn 
If
returns non-Fn type, collects the allvoid invocation results and returns it as a single value.Fn Returns:
If the
returns non-Fn , concatenate everyvoid call result usingFn and returns that value. Otherwise, returnssimd_concat .void