1. Motivation
ISO/IEC 19570:2018 (1) introduced data-parallel types to the C++ Extensions for
Parallelism TS. [P1928R8] is proposing to make that a part of C++ IS. Intel
supports the concept of a standard interface to SIMD instruction capabilities
and in document [P2638R0] we made a number of suggestions for extending the
capabilities of
to handle named permutation for common operations
like
,
and
. Intel then also proposed in [P2664R6] an even
more capable generic API for handling permutes using generated static indexes,
dynamic indexes, selection by mask, and memory-based gather/scatter operations.
In this paper we will revisit the original proposals from [P2638R0] and see
how they can be modified in the light of the generic API, and from feedback
received from the community.
2. Background
While
is undoubtedly most useful for expressing parallel operations
(e.g., arithmetic) across entire sets of input arguments, many real-world
algorithms require the data to be marshalled into new positions before
operating on it. For example, modifying the order of values, concatenating or
splitting data, or pulling out specific elements for some purpose. In the
discussions around
there have been of number of attempts to defined
some sort of permute function, and these will be summarised below.
Our aim in this paper is to accept that the generic permute API discussed in [P2664R6] does everything that we could possible want in a permute API, but
that it is useful to standardise a few specific and common types of permute so
that users of
aren’t required to keep reinventing them. In all of
the summaries of other proposals below we can learn from the them and provide something that
mimics the original intent with the new generic API.
2.1. Original std :: simd
proposal
In [P1928R0] there were only a few functions which could be used to
achieve any sort of element permutation across or within
values.
Those functions were relatively modest in scope and they built upon the ability
to split and concatenate simd values at compile-time. For example, a simd value
containing 5 elements could be split into two smaller pieces of sizes 2 and 3
respectively:
simd < float , 5 > x ; …const auto [ piece0 , piece1 ] = simd_split < 3 > ( x );
Those smaller pieces could be acted on using any of the
features, and
then later recombined using
, possibly in a different order or with
duplication:
The
,
and
functions can be used to achieve several permutation
effects, but they are unwieldy for anything complicated, and since they
are also compile-time they preclude any dynamic runtime permutation.
Since the original proposal a few changes have taken place. The original
function shown above, which could accept an arbitrary set of indexes to perform
the split, was removed and the original
function was renamed to
. The new variant
now breaks a
value into a set of
identically sized pieces, and a remainder. The concatenation function remains in
its original state.
As a small aside, the
operation is equivalent to what is called chunking
in the views/ranges library, and
does something different. It
may be better to call simd’s function
instead to avoid confusion.
One of the purposes of the original
and
functions was to allow
a large
to be broken into pieces which had a native register size so
that builtin operations (intrinsics) could be called on individual pieces, and
the results glued back together. Such manual permutation for these purposes has
been superseded by [P2929R0] which proposes a function called
which automates this task.
2.2. Suggestion to include selected named permute functions
Two suggestions were made in [P0918R1]: add an arbitrary compile-time shuffle function, and add a specific type of permutation called an interleave. The arbitrary shuffle has been replaced by a generated-permute function for reasons explained in [P2664R6] and will not be considered further here.
The other proposal was to add a function called interleave which accepts two
SIMD values and interleaves respective elements into a new SIMD value which is
twice the size. For example, given inputs
and
,
the output would be
. Interleaving is a common
operation and is often directly supported by processors with explicit
instructions (e.g., Intel’s
), so as in [P0918R1] and also in
other SIMD libraries there will be named functions which expose that
instruction. This is a relatively common operations in certain workloads (e.g.,
signal processing) and so it was called out as a specific function to allow
programmers from having to keep defining their own versions of what is a
basic utility.
2.3. insert
, extract
and resize
In [P2638R0] Intel suggested three new functions:
-
- Place a copy of the elements from oneinsert
within anotherstd :: simd
, replacing the original elements at the given insertion position.std :: simd -
- Create a smallerextract
value from a sub-set of the elements of anotherstd :: simd
value.std :: simd -
- Grow or shrink aresize
to a new number of elements.std :: simd
These all operate purely statically, using fixed sizes and positions. There was
a discussion of whether they should work dynamically too (e.g., insert data into
some run-time specified position of another
) but there was no consensus to
pursue this.
There was feedback that the names should have more thought. For example,
doesn’t make the
value into which the insertion is taking
place any larger, but instead replaces elements. Also there was some concern
that
might silently do something unexpected, and that having functions
which explicitly truncate (and lose) elements to shrink a
, or explicitly
grow a
by adding value-initialised elements to reach the new size might be
clearer. However, against this is that other containers in C++ (e.g., vector)
have
methods which ambiguously (and dynamically too) grow or shrink a
container without warning.
In Intel’s experience of using these functions in some of our software products
we have found that no-one is using
, but
and
are both found in real code.
Those functions can be implemented using the functions proposed in [P2664R6]. For example:
template < std :: size_t _NewSize , typename _Tp , typename _Abi > constexpr auto resize ( const basic_simd < _Tp , _Abi >& v ) { constexpr int _OldSize = basic_simd < _Tp , _Abi >:: size ; return permute < _NewSize > ( v , []( std :: size_t i ) { return ( i < _OldSize ) ? i : simd_zero_element ; }); } template < std :: size_t _Begin , std :: size_t _End , typename _Tp , typename _Abi > constexpr auto extract ( const simd_impl < _Tp , _OldSize >& v ) { return permute < _End - _Begin > ( v , []( std :: size_t i ) { return i + _Begin ; }); }
2.4. Named permute functions in other data parallel libraries
Other data parallel libraries provide functions for common permutation
operations though often it will be to expose selected native instructions. For
example, [Highway] provides
,
,
,
, and so on, which map efficiently to Intel instructions. In
this case the choice of which functions to expose will guide the user to use
permutes which are known to be efficient on a particular target, but this may
come at the expense of portability.
2.5. Powerful generic std :: simd
API
In [P2664R6] Intel proposed a comprehensive permutation API which covered the most common requirements of most users:
-
Static permute generator - using a callable generator (e.g., lambda) to generate a static permutation. This can allow the compiler to perform arbitrary permutes in what are often very sophisticated and efficient ways.
-
Dynamic permute using another
value as an index.simd -
Mask-based permute where elements corresponding to the active elements of a
are compressed or expanded, and inactive elements discarded.simd -
Memory-based permute where values are gathered from or scattered to a specific memory region (i.e., safely, with no access permitted outside the specificed region).
The subject of naming came into the discussions and at time of writing this hasn’t been completely
resolved (e.g., should there be a
prefix or a namespace?). Fundamentally there is nothing
wrong with the set of permute functions or their behaviours.
One problem with the generic API though is that although it can do anything
required, there are inevitably certain operations which are so common that every
user of
will write their own version of it. We received feedback
that we should explore the provision of a set of common functions of this type,
which brings us to the purpose of this paper.
3. Addition of common utility functions
In our experience, the most useful of all the generic permute functions proposed in [P2664R6] is the generated permute. Here is an example to refresh the reader:
const auto dupEvenIndex = []( auto idx ) { return ( idx / 2 ) * 2 ; }; const auto de = permute ( values , dupEvenIndex );
For every index position in the output value the generator lambda function is called with that index to determine what the source index should be for that element. In this particular example the generator is duplicating every even value.
The generator function can take both an index parameter and also a size parameter. This gives the ability to write a reversing permute generator:
auto rev = permute ( s , []( auto idx , auto size ) { return ( size - 1 ) - idx ; });
Finally, the output size of the generated simd may also be specified:
auto repeat_simd ( simd < float , 3 > s ) { return permute < 7 > ( s , []( auto idx , auto size ) { return idx % size ; }); }
In this example a simd value such as
will be repeated to fill 7
elements, forming the value
.
If the onus is placed on the user of
to provide the generator
functions for all types of permutes then they will inevitably duplicate
common permute generators at any level from a single translation unit, in
multiple files across a whole project, and even across unrelated projects.
Within a file or project the user can obviously create a single permute
generator which can be reused, but we don’t want entirely different projects to
have to keep duplicating common functions either.
Another issue with requiring the user to provide their own generators is that it
is very easy to write something quite obscure. For example,
does not immediately tell the reader what is happening,
but
is much clearer. With discipline the user can
make the code clear with appropriate naming, but it adds to the workload of the
user to get this right, and it would be simpler to use a standardised function
provided by
itself.
One final issue is that inevitably bugs will creep in to generator functions, and providing a set of known-working standard permute generators will improve development time by reducing debug time.
An initial step (and indeed, one that was suggested at the C++ committee meeting) is to propose a set of generator functions that can be dropped in to a generated permute. This feels initially promising, such as creating a lambda for reverse:
namespace simd_perm { constexpr auto reverse = []( auto idx , auto size ) { return ( size - 1 ) - idx ; }; } auto rev = permute ( values , simd_perm :: reverse );
Note that we have assumed that the function is put into some suitable namespace. We shall return to the question of naming these functions shortly.
Unfortunately following the idea of a named permute generator as above quickly leads to a problem; permutation often (even mostly) requires a change in size as well as a change in index order. For example, suppose we build a generator which reads values which are multiples of some constant - a strided read. The generator can be defined and used as below. Note that we aren’t using a lambda for this example because we wish to be able to template parameterise the stride distance.
template < int N > struct stride { constexpr auto operator ()( auto idx ) const { return idx * N ; }; }; auto everyThird = permute < 3 > ( s , stride < 3 > {});;
In this example we have to provide how many output values we require, as well as the generator. If we don’t provide the correct number then the permutation will fail as it tries to read invalid indexes. Therefore, for generic common permutation it isn’t sufficient to provide named permute generators, we may also need to provide wrappers to make them useful.
To complete this particular example we would want a
permute function for
which was implemented as something like this:
namespace __details { template < int N > struct stride_helper { constexpr auto operator ()( auto idx ) { return idx * N ; }; }; } namespace simd_permute { template < int N , typename _Simd > constexpr auto stride ( const _Simd & s ) { return permute < _Simd :: size / N > ( s , __details :: stride_helper < N > {}); } } auto everyThird = simd_permute :: stride < 3 > ( s );
We are faced with some small dilemma with how this relates to existing functions
in the C++ libraries. There is already a range adaptor object
which has the same behaviour
(i.e., pick out every N’th value of some input), but which works on a dynamic
range. Our
-specific version of stride differs in having to be called with a
of a static size, return a
of a different static size, and provide a
template parameter rather than a variable parameter. This does change the
function’s signature and call-site use. However, the advantage of reusing an
existing name is that it is immediately clear what the intent and behaviour are
supposed to be. We are not trying to invent a new name for an existing operation
as that would require the user to learn an alternate name for every operations
which already exists.
If we used the same name, but placed it inside a suitable namespace, then we will have to rely on either namespace qualifiers (e.g., simd_permutes::stride) to identify the correct function to call, or overloading:
using namespace simd_permutes ; auto r0 = stride < 2 > ( s ); // Call the stride operation on a simd. auto r1 = stride ( c , 2 ); // Call ranges::stride. This might even convert a simd into a range
For now we will assume that the
version of such a function will be named to
match the existing C++ name, but use overloading to differentiate it. Note that
we could make the function signature more similar to what exists already if we
used a
as a parameter, instead of using a template
parameter, but this leads to verbose call sites. We discuss this option further
later in this proposal, but for now use a template parameter to give a concise
call site.
In the existing proposal we have a mixture of naming conventions. For example:
permute compress expand simd_select simd_cat simd_split
Clearly we need to agree on a common naming convention. Most recently [P1928R8] is using
prefix on functions, but it may be more
appropriate to provide namespaces. This topic will be discussed further in a
separate paper. Until that is resolved, this proposal’s main intent is to
discuss the ideas of creating common functions, with their naming to be decided
later.
3.1. Named permute functions overview
Next we can consider what functions should be provided as part of the standard
permute package. We should provide functions which are sufficiently common
and obvious that they will get wide-spread use. However, we should also avoid
the temptation of providing functions which map onto specific native hardware
implementations, unless the functions are truly useful on all targets. It isn’t
appropriate to expose functions which force a degree of target-specifity in a
portable library. The list of suggested functions below takes as its inspiration
suggestions from [P2664R6], [P2638R0], existing names from
, and the names of generator functions we use in Intel’s
software products. Note that all of these functions will apply to either
or
, since they map onto the underlying
functions which
also work on either of those types.
Function | Behaviour |
---|---|
| Return a new containing the first N values. Replaces where the output must be smaller (i.e., a shrink). Can trap incorrect resizing (e.g., to something bigger).
|
| Return a new which is bigger than the original value by new elements with the given value, which defaults to value initialised. Replaces where the new size is bigger than the original.
|
| Return a new which contains every N’th element of the input.
|
| Existing renamed/moved.
|
| Return a new of the same size with its elements in reverse order.
|
| Create a new which is N copies of the original (e.g., is )
|
| Create a new in which every element repeats N times (e.g., is )
|
| Treat the as a row-major matrix with the given number of rows and columns and returns its transpose (i.e., in column major)
|
| (aka interleave?) Generate a new which interleaves or zips corresponding element together. returns . Useful for working with matrix-like values (e.g., signal processing).
|
| (aka deinterleave?) Generate a new which deinterleaves the input. gives . Useful for working with matrix-like values (e.g., signal processing).
|
| Renamed version of . Should it be called ? Should it take a tuple of simd values instead?
|
| Extract a which contains the elements in the range [N..M). Equivalent to .
|
| Perform a left rotation on the elements of the such that the element at the MIDDLE index becomes the first element. This matches the behaviour of but not / which also allows negative rotations or rotations larger than the container (i.e., mod container size)
|
/
| return a of the same size as the input but with the elements shifted left or right and value-initialised elements inserted. These would behave like . The disadvantage of these is that shifting a mask using this function would be the opposite of what might be expected. A is similar to a plain set of bits, and shifting a set of bits left would move them towards the MSB. In contrast, applying to a would move the bits towards the LSB. Maybe this shouldn’t matter, because a is not an integer which can be shifted, but conceptually it might be confusing. Also, should value initialised elements be shifted in, or should the new values be left valid but undefined?
|
| Given two objects perform an extraction of a from . Used to implement unaligned operations on targets where aligned operations are either inefficient or non-existant.
|
3.2. Design option - functions or function objects
Although the paper has described a collection of "functions" to perform named permutes, we could make these all function objects which then gives us some more freedom in how they can be used. For example, in Intel’s example implementation of
we make them function objects which derive from a named permute base-class, which then allows us to compose them into pipelines of permute transformations:
auto demo1 ( const simd < float >& x ) { return x | rotate < 5 > | reverse ; } auto demo2 ( const simd < float , 16 >& x ) { return x | transpose < 4 > ; } auto even ( const simd < float >& x ) { return x | stride < 2 > ; } auto odd ( const simd < float >& x ) { return x | drop < 1 > | stride < 2 > ; } auto dupEven ( const simd < float >& x ) { return x | stride < 2 > | repeat_each < 2 > ; } auto dupOdd ( const simd < float >& x ) { return x | drop < 1 > | stride < 2 > | repeat_each < 2 > ; }
We are not currently proposing that such pipelines should be formalised into the proposal, but we do ask for direction in deciding whether the suggested named permute functions should be functions or function objects.
3.3. Design option - constant parameters
There is a discrepancy between the signatures of function like
and
. Most
permute
functions operate on statically sized
values with known element. It is
therefore not possible to permute with dynamic sizing. Consequently the permute
functions take constant template parameters, rather than dynamic parameters as
with the
functions.
In order to more closely match the function signature from functions in
, we could modify the signature of
to be
more similar:
template < typename _T , typename _Abi , int _N > constexpr auto take ( const basic_simd < _T , _Abi >& s , std :: integral_constant < int , _N > ); auto first3 = take ( s , std :: integral_constant < int , 3 > {});
This gives a more familiar call site, but is rather verbose for the common case.
4. Implementation Experience
One of the drivers for this paper was the observation that in software which
used Intel’s example implementation of
, different developers were
inventing their own versions of common permutes, often in slightly different
ways. We wanted to guide different users to use a common language in the sorts
of permutes that they should be using, and to avoid unnecessary duplication of
effort in building what are sometimes quite sophisticated features from the
underlying permute functions. We do still encounter permute functions which are
unique in their behaviour, and don’t deserve to be pulled out into a common
library, and we have disregarded those functions in this proposal.
5. Conclusion
In this paper we have set out an argument for providing a set of common
permutation functions which have wide-spread use across different types of
software which will use
. There are still some open questions about
naming conventions, but our main goal with this paper is to set out a proposed
direction for what sorts of functions should be commonly provide, and to get
feedback from the wider community.