1. Motivation
ISO/IEC 19570:2018 (1) introduced data-parallel types to the C++ Extensions for
Parallelism TS. [P1928R8] is proposing to make that a part of C++ IS. Intel
supports the concept of a standard interface to SIMD instruction capabilities
and in document [P2638R0] we made a number of suggestions for extending the
capabilities of 
2. Background
While 
Our aim in this paper is to accept that the generic permute API discussed in [P2664R6] does everything that we could possible want in a permute API, but
that it is useful to standardise a few specific and common types of permute so
that users of 
2.1. Original std :: simd 
   In [P1928R0] there were only a few functions which could be used to
achieve any sort of element permutation across or within 
simd < float , 5 > x ; …const auto [ piece0 , piece1 ] = simd_split < 3 > ( x ); 
Those smaller pieces could be acted on using any of the 
The 
Since the original proposal a few changes have taken place. The original 
As a small aside, the 
One of the purposes of the original 
2.2. Suggestion to include selected named permute functions
Two suggestions were made in [P0918R1]: add an arbitrary compile-time shuffle function, and add a specific type of permutation called an interleave. The arbitrary shuffle has been replaced by a generated-permute function for reasons explained in [P2664R6] and will not be considered further here.
The other proposal was to add a function called interleave which accepts two
SIMD values and interleaves respective elements into a new SIMD value which is
twice the size. For example, given inputs 
2.3. insert extract resize 
   In [P2638R0] Intel suggested three new functions:
- 
     insert std :: simd std :: simd 
- 
     extract std :: simd std :: simd 
- 
     resize std :: simd 
These all operate purely statically, using fixed sizes and positions. There was
a discussion of whether they should work dynamically too (e.g., insert data into
some run-time specified position of another 
There was feedback that the names should have more thought. For example, 
In Intel’s experience of using these functions in some of our software products
we have found that no-one is using 
Those functions can be implemented using the functions proposed in [P2664R6]. For example:
template < std :: size_t _NewSize , typename _Tp , typename _Abi > constexpr auto resize ( const basic_simd < _Tp , _Abi >& v ) { constexpr int _OldSize = basic_simd < _Tp , _Abi >:: size ; return permute < _NewSize > ( v , []( std :: size_t i ) { return ( i < _OldSize ) ? i : simd_zero_element ; }); } template < std :: size_t _Begin , std :: size_t _End , typename _Tp , typename _Abi > constexpr auto extract ( const simd_impl < _Tp , _OldSize >& v ) { return permute < _End - _Begin > ( v , []( std :: size_t i ) { return i + _Begin ; }); } 
2.4. Named permute functions in other data parallel libraries
Other data parallel libraries provide functions for common permutation
operations though often it will be to expose selected native instructions. For
example, [Highway] provides 
2.5. Powerful generic std :: simd 
   In [P2664R6] Intel proposed a comprehensive permutation API which covered the most common requirements of most users:
- 
     Static permute generator - using a callable generator (e.g., lambda) to generate a static permutation. This can allow the compiler to perform arbitrary permutes in what are often very sophisticated and efficient ways. 
- 
     Dynamic permute using another simd 
- 
     Mask-based permute where elements corresponding to the active elements of a simd 
- 
     Memory-based permute where values are gathered from or scattered to a specific memory region (i.e., safely, with no access permitted outside the specificed region). 
The subject of naming came into the discussions and at time of writing this hasn’t been completely
resolved (e.g., should there be a 
One problem with the generic API though is that although it can do anything
required, there are inevitably certain operations which are so common that every
user of 
3. Addition of common utility functions
In our experience, the most useful of all the generic permute functions proposed in [P2664R6] is the generated permute. Here is an example to refresh the reader:
const auto dupEvenIndex = []( auto idx ) { return ( idx / 2 ) * 2 ; }; const auto de = permute ( values , dupEvenIndex ); 
For every index position in the output value the generator lambda function is called with that index to determine what the source index should be for that element. In this particular example the generator is duplicating every even value.
The generator function can take both an index parameter and also a size parameter. This gives the ability to write a reversing permute generator:
auto rev = permute ( s , []( auto idx , auto size ) { return ( size - 1 ) - idx ; }); 
Finally, the output size of the generated simd may also be specified:
auto repeat_simd ( simd < float , 3 > s ) { return permute < 7 > ( s , []( auto idx , auto size ) { return idx % size ; }); } 
In this example a simd value such as 
If the onus is placed on the user of 
Another issue with requiring the user to provide their own generators is that it
is very easy to write something quite obscure. For example, 
One final issue is that inevitably bugs will creep in to generator functions, and providing a set of known-working standard permute generators will improve development time by reducing debug time.
An initial step (and indeed, one that was suggested at the C++ committee meeting) is to propose a set of generator functions that can be dropped in to a generated permute. This feels initially promising, such as creating a lambda for reverse:
namespace simd_perm { constexpr auto reverse = []( auto idx , auto size ) { return ( size - 1 ) - idx ; }; } auto rev = permute ( values , simd_perm :: reverse ); 
Note that we have assumed that the function is put into some suitable namespace. We shall return to the question of naming these functions shortly.
Unfortunately following the idea of a named permute generator as above quickly leads to a problem; permutation often (even mostly) requires a change in size as well as a change in index order. For example, suppose we build a generator which reads values which are multiples of some constant - a strided read. The generator can be defined and used as below. Note that we aren’t using a lambda for this example because we wish to be able to template parameterise the stride distance.
template < int N > struct stride { constexpr auto operator ()( auto idx ) const { return idx * N ; }; }; auto everyThird = permute < 3 > ( s , stride < 3 > {});; 
In this example we have to provide how many output values we require, as well as the generator. If we don’t provide the correct number then the permutation will fail as it tries to read invalid indexes. Therefore, for generic common permutation it isn’t sufficient to provide named permute generators, we may also need to provide wrappers to make them useful.
To complete this particular example we would want a 
namespace __details { template < int N > struct stride_helper { constexpr auto operator ()( auto idx ) { return idx * N ; }; }; } namespace simd_permute { template < int N , typename _Simd > constexpr auto stride ( const _Simd & s ) { return permute < _Simd :: size / N > ( s , __details :: stride_helper < N > {}); } } auto everyThird = simd_permute :: stride < 3 > ( s ); 
We are faced with some small dilemma with how this relates to existing functions
in the C++ libraries. There is already a range adaptor object 
If we used the same name, but placed it inside a suitable namespace, then we will have to rely on either namespace qualifiers (e.g., simd_permutes::stride) to identify the correct function to call, or overloading:
using namespace simd_permutes ; auto r0 = stride < 2 > ( s ); // Call the stride operation on a simd. auto r1 = stride ( c , 2 ); // Call ranges::stride. This might even convert a simd into a range 
For now we will assume that the 
In the existing proposal we have a mixture of naming conventions. For example:
permute compress expand simd_select simd_cat simd_split 
Clearly we need to agree on a common naming convention. Most recently [P1928R8] is using   prefix on functions, but it may be more
appropriate to provide namespaces. This topic will be discussed further in a
separate paper. Until that is resolved, this proposal’s main intent is to
discuss the ideas of creating common functions, with their naming to be decided
later.
3.1. Named permute functions overview
Next we can consider what functions should be provided as part of the standard 
and obvious that they will get wide-spread use. However, we should also avoid
the temptation of providing functions which map onto specific native hardware
implementations, unless the functions are truly useful on all targets. It isn’t
appropriate to expose functions which force a degree of target-specifity in a
portable library. The list of suggested functions below takes as its inspiration
suggestions from [P2664R6], [P2638R0], existing names from 
| Function | Behaviour | 
|---|---|
|  | Return a new containing the first N values. Replaceswhere the output must be smaller (i.e., a shrink). Can trap incorrect resizing (e.g., to something bigger). | 
|  | Return a new which is bigger than the original value by new elements with the given value, which defaults to value initialised. Replaceswhere the new size is bigger than the original. | 
|  | Return a new which contains every N’th element of the input. | 
|  | Existing renamed/moved. | 
|  | Return a new of the same size with its elements in reverse order. | 
|  | Create a new which is N copies of the original (e.g.,is) | 
|  | Create a new in which every element repeats N times (e.g.,is) | 
|  | Treat the as a row-major matrix with the given number of rows and columns and returns its transpose (i.e., in column major) | 
|  | (aka interleave?) Generate a new which interleaves or zips corresponding element together.returns. Useful for working with matrix-like values (e.g., signal processing). | 
|  | (aka deinterleave?) Generate a new which deinterleaves the input.gives. Useful for working with matrix-like values (e.g., signal processing). | 
|  | Renamed version of . Should it be called? Should it take a tuple of simd values instead? | 
|  | Extract a which contains the elements in the range [N..M). Equivalent to. | 
|  | Perform a left rotation on the elements of the such that the element at the MIDDLE index becomes the first element. This matches the behaviour ofbut not/which also allows negative rotations or rotations larger than the container (i.e., mod container size) | 
| / | return a of the same size as the input but with the elements shifted left or right and value-initialised elements inserted. These would behave like. The disadvantage of these is that shifting a mask using this function would be the opposite of what might be expected. Ais similar to a plain set of bits, and shifting a set of bits left would move them towards the MSB. In contrast, applyingto awould move the bits towards the LSB. Maybe this shouldn’t matter, because ais not an integer which can be shifted, but conceptually it might be confusing. Also, should value initialised elements be shifted in, or should the new values be left valid but undefined? | 
|  | Given two objects perform an extraction of afrom. Used to implement unaligned operations on targets where aligned operations are either inefficient or non-existant. | 
3.2. Design option - functions or function objects
Although the paper has described a collection of "functions" to perform named permutes, we could  make these all function objects which then gives us some more freedom in how they can be used. For example, in Intel’s example implementation of 
auto demo1 ( const simd < float >& x ) { return x | rotate < 5 > | reverse ; } auto demo2 ( const simd < float , 16 >& x ) { return x | transpose < 4 > ; } auto even ( const simd < float >& x ) { return x | stride < 2 > ; } auto odd ( const simd < float >& x ) { return x | drop < 1 > | stride < 2 > ; } auto dupEven ( const simd < float >& x ) { return x | stride < 2 > | repeat_each < 2 > ; } auto dupOdd ( const simd < float >& x ) { return x | drop < 1 > | stride < 2 > | repeat_each < 2 > ; } 
We are not currently proposing that such pipelines should be formalised into the proposal, but we do ask for direction in deciding whether the suggested named permute functions should be functions or function objects.
3.3. Design option - constant parameters
There is a discrepancy between the signatures of function like 
In order to more closely match the function signature from functions in 
template < typename _T , typename _Abi , int _N > constexpr auto take ( const basic_simd < _T , _Abi >& s , std :: integral_constant < int , _N > ); auto first3 = take ( s , std :: integral_constant < int , 3 > {}); 
This gives a more familiar call site, but is rather verbose for the common case.
4. Implementation Experience
One of the drivers for this paper was the observation that in software which
used Intel’s example implementation of 
5. Conclusion
In this paper we have set out an argument for providing a set of common
permutation functions which have wide-spread use across different types of
software which will use