1. Motivation
ISO/IEC 19570:2018 introduced data-parallel types to the C++ Extensions for
Parallelism TS [P1928R3]. That paper, and several ancillary papers, do an
excellent job of setting out the main features of an extension to C++ which
allows generic data parallel programming on arbitrary targets. However, it is
inevitable that the programmer will want to make some use of target-specific
intrinsics in order to unlock some of the more unusual features of those
specific platforms. This requires that the programmer is able to allow a SIMD
value to be used in a call to a target intrinsic, and that the result of the
intrinsic call can be used to generate a new SIMD value. This is already
permitted in
for SIMD values which fit into a native
register, but it is harder to achieve when the SIMD value spans multiple registers.
In this paper we will propose a function called
which makes it
easy to repeatedly apply a target specific intrinsic to native-sized pieces of
large SIMD value arguments, and to marshall their individual results back into a
SIMD result.
2. Background
Although
has been carefully crafted to include APIs which access all
of the common or desirable features of SIMD instruction sets, it is inevitable
that the user will sometimes want to take advantage of instructions which are
specific to a particular platform. For example, a DSP target may have a special
type of accumulator instruction, or algorithm specific instruction (e.g., AES
crypto). Clearly, calling these intrinsics results in non-portable code, but the
increase in hardware-accelerated performance on a given target could be a worthwhile
trade-off.
already includes some provision for interacting with target-specific
data types by providing the following ([P1928R4] 28.9.6.1-4):
constexpr explicit operator implementation - defined () const ; constexpr explicit basic_simd ( const implementation - defined & init );
These allow a SIMD value to be converted into a target specific type which is used to call a compiler-intrinsic. The result is then converted back into the equivalent SIMD type:
simd < float > addsub ( simd < float > a , simd < float > b ) { return static_cast < simd < float >> ( _mm256_addsub_ps ( static_cast < __m256 > ( a ), static_cast < __m256 > ( b ))); }
In this example the inputs, which are native register-sized SIMD values, are explicitly converted into their target-specific typed values. Those target specific types are used to call the intrinsics, and then the target specific return value is converted back into the closest SIMD representation.
This example is straightforward, since the SIMD values are explicitly the
correct native size. For targets which supported several different register
sizes (e.g., some variants of AVX support 128-, 256- and 512-bit registers) the
code can use a
conditional to select which size to use:
simd < float > addsub ( simd < float > a , simd < float > b ) { if constexpr ( simd < float >:: size () == 4 ) return simd < float > ( _mm_addsub_ps ( static_cast < __m128 > ( a ), static_cast < __m128 > ( b ))); else if constexpr ( simd < float >:: size () == 8 ) return simd < float > ( _mm256_addsub_ps ( static_cast < __m256 > ( a ), static_cast < __m256 > ( b ))); else error (); // Invalid native register }
Things become more tricky when dealing with SIMD values which are larger than their intrinsic types. Such types cannot be converted into a type which can be used to call an intrinsic. Instead, the big SIMD must be broken down into small pieces which may be used to call the intrinsic, and then the results of that glued back together. Here is one way to do that for a SIMD value which is twice as big as a native register:
// Assumes AVX is in use, and that each native register is therefore 8xfloat simd < float , 16 > addsub ( simd < float , 16 > a , simd < float , 16 > b ) { // Get register-sized pieces auto [ lowA , highA ] = simd_split < simd < float >> ( a ); auto [ lowB , highB ] = simd_split < simd < float >> ( b ); // Call the intrinsic on each pair of pieces. auto resultLow = simd < float > ( _mm256_addsub_ps ( static_cast < __m256 > ( lowA ), static_cast < __m256 > ( lowB ))); auto resultHigh = simd < float > ( _mm256_addsub_ps ( static_cast < __m256 > ( highA ), static_cast < __m256 > ( highB ))); // Glue the individual results back together. return simd_concat ( resultLow , resultHigh ); }
This is now getting verbose, and it only handles SIMD value inputs which are
twice the size of a register value. To use the intrinsic with larger SIMD
values, or SIMD value which don’t map into native register-sized pieces, more
work is needed (e.g., if
was used then the pieces would be
of size 8, 8, and 4 respectively, and this would require a suitable call to the
intrinsic of the appropriate size).
The boiler-plate code needed to handle this is technical straight-forward, but
verbose. A completely generic solution which could handle arbitrary SIMD value
size would also requires additional mechanisms like immediately invoked lambdas
or index sequences to be used too. Rather than requiring every user of
to have to write their own intrinsic call handlers, we can abstract
the general mechanism into something that is easily reused. In particular we
want to break a set of SIMD value arguments into smaller pieces, call an
intrinsic on each, and then glue the results of those intrinsic calls back
together. We achieve this though a proposed function called
which
we will describe in the remainder of this paper.
3. simd_invoke
description
The
function is rather like
in that it takes a
callable object and a set of arguments. It’s basic signature is as follows:
template < typename Fn , typename ... Args > auto simd_invoke ( Fn && fn , Args && ...);
The
parameter should be a callable that accepts some arguments which can be
used to invoke an intrinsic from a native register. For example, to continue our example from above,
we could create a utility function which calls the
intrinsic from native-sized SIMD values:
inline auto native_addsub ( simd < float > lhs , simd < float > rhs ) { auto nativeLhs = static_cast < __m256 > ( lhs ); auto nativeRhs = static_cast < __m256 > ( rhs ); return simd < float > ( _mm256_addsub_ps ( nativeLhs , nativeRhs )); }
Given this wrapper for native-sized intrinsic calls, we can now use
to break down a large SIMD value into individual register-sized
calls:
auto addsub ( simd < float , 32 > x , simd < float , 32 > y ) { return simd_invoke ( native_addsub , x , y ); }
The
function accepts any number of arguments and will break each
one down into native-sized pieces which are used to invoke the supplied
function, and then glue the results back together again on completion.
Let’s look how the example from § 2 Background looks like with
applied
using before-after table:
Before | After |
---|---|
|
|
By default the function will use the native size for the element type, with the
aim of calling the intrinsic with the largest permitted builtin type. However,
can also be explicitly given the size of block to use. For
example, the following code breaks down into pieces of size 4 instead (and also
supplies the callable as a local lambda, to make the function self-contained):
auto addsub ( simd < float , 32 > x , simd < float , 32 > y ) { auto do_native = []( simd < float , 4 > lhs , simd < float , 4 > rhs ) { return simd < float , 4 > ( _mm_addsub_ps ( static_cast < __m128 > ( lhs ), static_cast < __m128 > ( rhs ))); }; return simd_invoke < 4 > ( do_native , x , y ); }
Being able to define a different size is useful for two reasons:
-
We may wish to process data in smaller input sizes than native size. For example, if the data needs to be upcast to a large element size for the operation, then it can be useful to choose a smaller block size to begin with so that the upcast data fills a native register. This could be more efficient than accepting native register sized data and then upcasting to several registers.
-
If the element type of the arguments are different then the
function cannot determine the appropriate native size for itself. For example, the first argument might be a SIMD value containing float elements, and the second argument havesimd_invoke
elements. The number of native elements in each argument cannot be inferred, and theuint8_t
must be told how many elements to use in each block.simd_invoke
So far we have only considered what happens when
is given SIMD
value arguments which are multiples of some native register size, but as we saw
in our introduction, it might be useful to be able to call invoke on SIMD values
of arbitrary size. For example, suppose that the add/sub is being called on a
SIMD value with 19 elements. In that case on a target with a native size of 8
elements
would need to break the calls down into pieces of sizes
8, 8, and 3 respectively, and the callable would need to be able to handle SIMD
values of arbitrary size. The following example shows how this might work:
auto addsub ( simd < float , 19 > x , simd < float , 19 > y ) { // Invoke the most appropriate intrinsic for the given simd types. auto do_native = [] < typename T , typename ABI > ( basic_simd < T , ABI > lhs , basic_simd < T , ABI > rhs ) { constexpr auto size = basic_simd < T , ABI >:: size ; if constexpr ( size <= 4 ) return simd < float , size > ( _mm_addsub_ps ( static_cast < __m128 > ( lhs ), static_cast < __m128 > ( rhs ))); else return simd < float , size > ( _mm256_addsub_ps ( static_cast < __m256 > ( lhs ), static_cast < __m256 > ( rhs ))); }; return simd_invoke ( do_native , x , y ); }
In this example the local lambda function can accept SIMD inputs of any size,
and will choose the most appropriate intrinsic to use. For example, given the
block of 3 tail elements the lambda utility will convert the 3 elements to an
register and call
. This lambda function can then be
called by
to enable a completely arbitrarily sized set of SIMD
arguments to be mapped onto their underlying intrinsics. For the example above,
the following code was generated:
vmovups ymm0 , ymmword ptr [ rsi ] vmovups ymm1 , ymmword ptr [ rsi +32 ] vmovups xmm2 , xmmword ptr [ rsi +64 ] mov rax , rdi vaddsubps ymm0 , ymm0 , ymmword ptr [ rdx ] vaddsubps ymm1 , ymm1 , ymmword ptr [ rdx +32 ] vaddsubps xmm2 , xmm2 , xmmword ptr [ rdx +64 ] vmovups ymmword ptr [ rdi ], ymm0 vmovups ymmword ptr [ rdi +32 ], ymm1 vmovups xmmword ptr [ rdi +64 ], xmm2
Notice how the the load, addsub and store instructions work respectively in
,
and
registers to cope with the different sizes.
Having now described the basic operation of
we can consider some
of the rules for using it, and a useful extension to it which makes certain
scenarios easier to deal with.
3.1. Using indexed Callable invocations
When a large SIMD value is broken down into pieces to invoke the Callable function, the size of that piece can be obtained from the Callable function invocation’s parameter type but it can also be useful to pass in the index of that piece too. For example, suppose a function is invoking an intrinsic to perform a special type of memory store operation. Each register-sized sub-piece of the incoming SIMD needs to know it offset so that it can be written to the correct pointer offset. The following example code illustrates how this could happen:
auto special_memory_store ( simd < float , 32 > x , float * ptr ) { // Invoke the most appropriate intrinsic for the given simd types. auto do_native = [ = ] < typename T , typename ABI > ( basic_simd < T , ABI > data , auto idx ) { ( _mm256_special_store_ps ( ptr + idx , static_cast < __m256 > ( data ))); }; simd_invoke_indexed ( do_native , x ); }
The
function is now used instead, and this expects that
the Callable will take an extra parameter giving the compile-time offset of the
first index of the sub-SIMD value (i.e., 0, 8, 16 and 24 in this example).
Note in this example that the Callable does not return a value itself, nor does the call to
.
3.1.1. Design option - probing the index capabilities
Rather than requiring the user to invoke an indexed callable using
an alternative would be to allow
to probe
the Callable to see if it accepts one extra index parameter, and if it does to
pass in that index to each Callable invocation. The advantage of doing this is
to have a single function which can use the index or not depending on the
Callable being used. The disadvantage is that the mechanism of probing the
Callable may be fragile and lead to obscure errors.
3.2. Considerations in using simd_invoke
These are a set of considerations for using
. In the following,
SIMD-like will be taken to be a type that is either a
, or a
, and whose native size can be queried.
All the arguments that are passed to the callable function must be
or
values.
When more than one SIMD-like object is used as an argument, all of the SIMD-like
objects must have the same size (i.e., you can’t call
with SIMD values
which don’t break down into pieces whose respective sizes are the same).
For the native block size to be deduced, all the SIMD-like values must have the same native size. This is to ensure that when the SIMD-like arguments are broken into piecees they will always map to the same respective sizes. If the size cannot be deduced like this then the user must explicitly supply the block size.
The order of invoking the Callable function for each sub-block is undefined.
When the Callable returns a value, it must return a SIMD-like object. This is
because there is no way to take non-SIMD results from the Callable and merge
them together, except by using
.
When the Callable returns a SIMD-like object, it need not have the same size as its input arguments. For example, the Callable function could perform an operation like extracting multiples of some index, where the results have to be concatenated back together.
When the Callable returns a SIMD-like object, every invocation of the callable must return a SIMD-like object with the same element type. This is to ensure that the results can be glued together.
4. Wording
4.1. Add new section [simd.invoke]
�function calling utility [simd.simd_invoke]
simd_invoke template < std :: size_t BlockSize = 0 , typename Fn , typename ... Args > constexpr auto simd_invoke ( Fn && fn , Args && ... args ); template < std :: size_t BlockSize = 0 , typename Fn , typename ... Args > constexpr auto simd_invoke_indexed ( Fn && fn , Args && ... args ); Constraints:
.
sizeof ...( Args ) > 0 Each argument in
is either
Args ... or
basic_simd .
basic_simd_mask For every argument in
,
Args ... ’s must be identical.
Args :: size
is either non-zero, or every argument in
BlockSize must have the same native size.
Args ... The
return type is either
Fn , or
void , or
basic_simd .
basic_simd_mask Effects:
Let
be a
split - size if that is non-zero, or native size, otherwise.
BlockSize
Invokes
with
Fn split by
args . The number of invocations is
split - size . The invocation order of
Args :: size / split - size + ( Args :: size % split - size == 0 ? 0 : 1 ) over split args is unspecified.
Fn If
returns non-
Fn type, collects the all
void invocation results and returns it as a single value.
Fn Returns:
If the
returns non-
Fn , concatenate every
void call result using
Fn and returns that value. Otherwise, returns
simd_concat .
void