P2664R9
Extend std::simd with permutation API

Published Proposal,

This version:
http://wg21.link/P2664R9
Authors:
(Intel)
(Intel)
Audience:
LEWG
Project:
ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21

Abstract

Proposal to extend std::simd with features enabling permutation of data within and across SIMD values.

1. Motivation

ISO/IEC 19570:2018 introduced data-parallel types to the C++ Extensions for Parallelism TS. [P1915R0] asked for feedback on std::simd in Parallelism TS 2. [P1928R4] aims to make that a part of C++ IS. Intel supports the concept of a standard interface to SIMD instruction capabilities and has made further suggestions for other APIs and facilities for std::simd in document [P2638R0].

One of the extra features we suggested was the ability to be able to permute elements across or within SIMD values. These operations are critical for supporting formatting of data in preparation for other operations (e.g., transpose, strided operations, interleaving, and many more), but are relatively poorly supported in the current std::simd proposal. Many modern SIMD processors include sophisticated support for permutation instructions so in this document we shall propose a set of API functions which make those underlying instructions accessible in a generic way.

In this document we shall start with some background into the current std::simd proposal and extensions, describe the state of the art on other simd libraries, and then describe a proposal for some new API functions to add to std::simd.

2. Revision History

R8 => R9

R7 => R8

R6 => R7

R5 => R6

R4 => R5

R3 => R4

R2 => R3

R1 => R2

R0 => R1

3. Background

In ISO/IEC 19570:2018 there were only a few functions which could be used to achieve any sort of element permutation across or within simd<> values. Those functions were relatively modest in scope and they built upon the ability to split and concatenate simd values at compile-time. For example, a simd value containing 5 elements could be split into two smaller pieces of sizes 2 and 3 respectively:

simd<float, 5> x;const auto [piece0, piece1] = simd_split<3>(x);

Those smaller pieces could be acted on using any of the std::simd features, and then later recombined using simd_cat, possibly in a different order or with duplication:

const auto output = simd_cat(piece1, piece1, piece0);

The simd_split and simd_cat functions can be used to achieve several permutation effects, but they are unwieldy for anything complicated, and since they are also compile-time they preclude any dynamic runtime permutation.

In [P2638R0] Intel suggested extra operations to insert and extract SIMD values to and from other SIMD containers:

const auto t = extract<2, 5>(v); // Read out 3 SIMD values, starting from position 2
insert<5>(v, t); // Put the 3 previously read values back into the container at position 5 

These are convenient, but they still don’t allow arbitrary or expressive permute operations.

Two suggestions were made in [P0918R1]: add an arbitrary compile-time shuffle function, and add a specific type of permutation called an interleave. The compile-time shuffle can be used as follows:

const auto p = shuffle<3, 4, 1, 1, 2, 3>(x);
// p::value_type will be the same as x::value_type
// p::size will be 6

The indexes can be arbitrary, and therefore allow any duplication and reordering of elements. The output std::simd object will have the same number of elements as there are supplied indexes, and the type of the SIMD output will be that of the input. It is noted also in the shuffle API that this function can be applied to basic_simd_mask values too, and that shuffling across multiple SIMD values can be achieved by first concatenating the values into a single larger SIMD:

const auto p = shuffle<10, 9, 0, 1>(simd_cat(x, y));

In addition to the shuffle function, [P0918R1] also proposes a function called interleave which accepts two SIMD values and interleaves respective elements into a new SIMD value which is twice the size. For example, given inputs [a0 a1 a2] and [b0 b1 b2], the output would be [a0 b0 a1 b1 a2 b2]. Interleaving is a common operation and is often directly supported by processors with explicit instructions (e.g., Intel’s punpcklwd), so as in [P0918R1] and also in other simd libraries there will be named functions which expose that instruction. There will be other common permutation operations which also have specialist hardware support and it is common for them to be exposed in named functions also. For example, [Highway] provides DupOdd, DupEven, ReverseBlocks, Shuffle0231, and so on, which map efficiently to underlying instructions. While this can ensure that good code is generated for specific function, it does mean that:

Neither of these is desirable, and in our suggestions below we provide an alternative.

4. Extending std::simd to support permutations

There are four classes of permutation which could be supported by std::simd;

Modern processor instruction sets typically support all 4 of these classes of instruction. In the following sections we shall examine in more detail each of these classes and the potential APIs needed to expose those features to the user. We shall also examine what it means to permute memory.

std::simd could be modified to allow permutations either by adding to the base definition of std::simd itself (specifically, to include a new simd::operator[] or simd_mask::operator[]), or by introducing overloaded free functions which extend std::simd. We shall consider both options below.

Note that a basic_simd_mask is a special case of a type of SIMD and can therefore be permuted in the same way as a conventional SIMD. In our discussions below we assume that the permutations would apply equally to a basic_simd_mask unless we note otherwise.

4.1. Using compile-time computed indexes

The first proposal is to provide a permute function which accepts a compile-time function which is used to generate the indexes to use in the permutation. This is a very powerful concept: firstly, it allows the task of computing the indexes to be offloaded to the compiler using an index-generating function, and secondly it works on simd<> values of arbitrary size. It’s definition would be something like:

template<std::size_t SizeSelector = 0, typename T, typename Abi, typename PermuteGenerator>
constexpr resize_simd_t<output-size, basic_simd<T, Abi>>
permute(const basic_simd<T, Abi>& v, PermuteGenerator&& fn)

It can be used as in the following example:

const auto dupEvenIndex = [](size_t idx) -> size_t { return (idx / 2) * 2; };
const auto de = permute(values, dupEvenIndex);

Note that the permutation function maps from the index of each output element to the index which should be used as the source for that position. So, for example, the dupEvenIndex function would map the output indexes [0, 1, 2...) to the source indexes [0 0 2 2 4 4). This example permutation is common enough that it has hardware support in Intel processors, and the compiler can map the above function directly to the corresponding instruction:

vmovsldup ymm0, ymm0

By default, the permute function will generate as many elements in the output simd as there were input values, but the output size can be explicitly specified if it cannot be inferred or it needs to be modified. For example, the following permute generates 4 values with a stride of 3 (i.e., indexes [0, 3, 6, 9]).

const auto stride3 = permute<4>(values, [](size_t idx) -> size_t { return idx * 3; });

4.1.1. Special generator indexes

The obvious index representation to return from each generator invocation is an integral value in the range [0...size). The compiler can enforce at compile-time that the index is in the valid range and dynamic indexes would not be permitted. In addition, there are three other types of index value which can be returned which give the compiler even more flexibility.

Here is an example where zeros are inserted into the even-indexed output positions of a SIMD value while the odd-indexed values are copied directly from the input.

permute(v, [](auto idx) { (idx % 2) ? simd_zero_element : idx; });

4.1.2. Generator function size argument

It can be useful to create libraries of common permutation generator functions to handle common use cases, but such functions need to be able to know the extent of the indexes that they are allowed to permute. The generator function is allowed to accept an optional std::size_t argument giving the size of the SIMD value being permuted. For example, the following generator extracts elements of a SIMD starting at the half-way point:

auto topHalf = [](std::integral auto idx, std::size_t simd_size) {
  return (simd_size / 2) + idx;
};

4.1.3. Design option - more special constants

There are other possible special constants that could be given special status in the compile-time permute, including the ability to insert NaN, +ve infinity, -ve infinite, max_value, min_value, and so on. Adding these is left as a future exercise and will not be considered further in this paper.

4.1.4. Implementation experience

In other simd or vector libraries which provide named specific permute functions, the reason often given is that it allows specific hardware instructions to be used to efficiently implement those permutes. For example, Intel has the vmovsldup instruction as noted above, so some libraries provide functions in their API with names which reflect this (e.g., such a function might be called DupLow). The disadvantages are that it hinders portability, and only allows access to the hardware features that the authors of the library have chosen to expose.

Using the generator-based approach makes it easier to access a wide range of underlying special purpose instructions, or to fall back to equivalent instructions sequences where they are not available, but it is desirable that such behavior should be efficient. To judge whether the generator-based permutation API was efficient and useful we implemented the extensions in the Intel std::simd library and looked at the performance of different compile-time patterns and their generated instruction sequences. We found that GCC and LLVM were able to determine the correct hardware instructions to use for a variety of common permutation patterns, and in some cases the compiler was even able to use the side-effects of other instructions to effect permutations in ways that human programmers had overlooked. In all cases, where the pattern was too complicated to map to native hardware features the compiler fell back to using general purpose permutation patterns, such as Intel’s permutex2var family of instructions.

The generator function works by creating a list of compile-time constants which are used to perform the permutation. Because the list is compile-time it can be generated using completely arbitrary (and potentially very inefficient) code, but the outcome will always be a list of constants. Therefore the compiler’s ability to generate a permutation is dependent upon how well it can convert a list of constants into a permutation, not how well the programmer is at writing clear permute functions.

As a small illustration of the compiler’s ability, the following table shows some compile-time function permute calls and the code that the LLVM 14.0.0 compiler has generated for each:

Permute call Purpose Output from clang 14.0.0
permute (x, [](auto idx) { return idx & ~1; }); Duplicate even elements vmovsldup zmm0, zmm0
permute (x, [](auto idx) { return idx ^ 1; }); Swap even/odd elements in each pair (complex-valued IQ swap) vpermilps ymm0, ymm0, 177
permute<8>(x, [](auto idx) { return idx + 8; }); Extract upper half of a 16-element vector. Note that the instruction sequence accepts a zmm input and returns a ymm output. vextractf64x4 ymm0, zmm0, 1

In each case the compiler has accepted the compile-time index constants and converted them into an instruction which efficiently implements the desired permutation. We can safely leave the compiler to determine the best instruction from an arbitrary compile-time function.

4.2. Using another SIMD as the dynamic index

The second proposal for the permute API is to allow the required indexes to be passed in using a second SIMD value containing the run-time indexes:

template<typename T, typename AbiT, std::integral Idx, typename AbiIdx>
constexpr resize_simd_t<basic_simd<Idx, AbiIdx>::size, basic_simd<T, AbiT>>
permute(const basic_simd<T, AbiT>& v, const basic_simd<Idx, AbiIdx>& indexes)

This can be used as in the following example:

simd<unsigned, 8> indexes = getIndexes();
simd<float, 5> values = getValues();
...
const auto result = permute(values, indexes);
// result::value_type is float
// result::size is 8

The permute function will return a new SIMD value which has the same element type as the input value being permuted, and the same number of elements as the indexing SIMD (i.e., float and 8 in the example above). The permute may duplicate or arbitrarily reorder the values. The index values must be of integral type, with no implicit conversions from other types permitted. The behavior is undefined if any index is outside the valid index range. Dynamic permutes across multiple input values are handled by concatenating the input values together. The indexes in the selector will only be treated as indexes, and will never have special properties as described above (e.g., no zero element selection)

In addition to the function called permute, a subscript operator will also be provided:

template <std::integral Idx, typename AbiIdx>
constexpr simd<T, basic_simd<Idx, AbiIdx>::size()>
simd::operator[](const basic_simd<Idx, AbiIdx> &indexes) const;

Here is an example of how this could be used:

simd<int, 8> indexes = getIndexes();
simd<float, 5> values = getValues();
...
const auto result = values[indexes];
// result::value_type is float
// result::size is 8

4.2.1. Design option - C++23 multi-index subscript

Sometimes the subscripts to use in a permute might be known at compile time, but to use them in a permutation requires them to be put into an index simd first:

constexpr simd<int> indexes = {3, 6, 1, 6}; // Note - assumes initialiser list from P2876R0.
auto output = values[indexes];

In C++23 multi-subscript operator was introduced and this could be used to allow a more compact representation:

auto output = values[3, 6, 1, 6];
// output::value_type is values::value_type
// output::size is 4

It would be good if it could also appear on the left-hand-side;

output_simd[2, 5, 6] = x; // Overwrite selected elements with a single value.

Should non-constant indexes be allowed too? Our experience with GCC and LLVM suggests that work would be needed on their code generation for non-constant variable indexes to be improved though, as they are both currently poor at this.

4.2.2. Implementation experience

We implemented the runtime-indexed permute API in Intel’s example std::simd implementation. Small index operations (e.g., indexing one native sized simd by another) were mapped directly onto the underlying native instruction and so were as efficient as calling an intrinsic. More complicated cases for permutation, such as where either the index or data SIMD parameters were bigger than a native register are also handled with comparable efficiency to hand-written intrinsic code.

4.3. Permutation using simd masks (compress/expand)

A third variant of permutation is where a basic_simd_mask is used as a proxy for an index sequence instead. The presence or absence of an element in the basic_simd_mask will cause the element to be used or not. The following diagrams illustrate this:

…

On the left the values in the SIMD are compressed; only those elements which have their respective basic_simd_mask element set will be copied into the next available space in the output SIMD. Their relative order will be maintained, and the selected elements will be stored contiguously in the output. The output value will have the same number of elements as the input value, with any unused elements being left in a valid but unspecified state. The behavior of the function is similar to std::copy_if, although other simd libraries may call this function compression or compaction instead. The expansion function on the right of the diagram has the opposite effect, copying values from a contiguous region to potentially non-contiguous positions indicated by a mask bit. The two functions have prototypes as follows:

template<typename T, typename Abi>
constexpr basic_simd<T, Abi>
compress(const typename basic_simd<T, Abi>::mask_type& mask, const basic_simd<T, Abi>& value);

template<typename T, typename Abi>
constexpr basic_simd<T, Abi>
compress(const typename basic_simd<T, Abi>::mask_type& mask,
         const basic_simd<T, Abi>& value, T fill_value);

template<typename T, typename Abi>
constexpr basic_simd<T, Abi>
expand(const typename basic_simd<T, Abi>::mask_type& mask,
       const basic_simd<T, Abi>& value, const simd<T, Abi>& original);

In both functions all basic_simd or basic_simd_mask values must have the same value_type and size.

A compression operation is typically used to throw away unwanted values (i.e., values which do not satisfy the mask condition). Unwanted positions in the output will be left unspecified, unless the user supplies a third parameter specifying the value which should be used to fill those unwanted positions.

An expansion operation is typically used to overwrite selected elements of some existing SIMD value with new values taken from the contiguous beginning of another SIMD value. The expansion operation therefore has a slightly different meaning for its fill value.

4.3.1. Unused value element initialisation

In the design shown above the compress function comes in two variants, the first of which leaves unused values in the output in an unitialised but valid state, and the second function provides a default value to use for the initialisation. Is is worth looking at this in more detail to understand why there isn’t a single function which always initializes its unused elements. Consider a compression operation:

simd<int, 6> values = ...;
simd_mask<int, 6> mask = ...;

auto compressed = compress(values, mask);
// compressed::value_type is int
// compressed::size is 6

The output value has the same size as the mask selector, but the actual mask bits won’t be known until run-time. This raises the question of what values should be put into the unused output elements.

The current proposal would leave the unused elements in a valid but unspecified state. This is comparable to functions like std::shift_left and std::shift_right. The alternative is for the compress function to leave the unused values in a value-initialized state.

The advantage of an unspecified state is that the code doesn’t need to make a special effort to insert values which might not be used anyway, but has the disadvantage that the unspecified state might catch out the unwary user. Using a value initialized state helps the unwary user by providing a default state which is sane and repeatable, but may come with a performance cost.

Intel have an example implementation of std::simd, and our experience with that demonstrates that when native support is available (e.g., with AVX-512 ISA) it makes no difference to performance whether the unused values are initialised to a specific value or left uninitialized, since the instruction itself inserts the values into unused positions as an integrated part of its operation. However, a synthesized code sequence is more expensive when setting the elements to a specific value since it is necessary to also compute the population count of the mask, turn that into a mask and then arrange for the unused values to be overwritten with something else using that mask.

Given that the compress function is essentially only extracting valid values from the input SIMD and discarding the others, and that there is a potential performance penalty that comes from initialising unused values, then we think that the default should be to leave unused elements in unspecified states.

Related to this is what to do when a programmer does want to initialize unused values to a known value. With the proposed interface the programmer would be required to compute a new partial mask from their original selection mask (i.e., compute the population count of the original mask, and then turn that into a mask of the lower elements), and then use that to conditionally blend in their desired value. This can be inefficient, and for targets like Intel AVX-512, it doesn’t exploit the hardware’s ability to automatically insert a value anyway. For this reason we propose that an overload of compress is provided that accepts a third parameter giving a value to insert into unused elements:

template<typename T, typename Abi>
basic_simd<T, Abi>
compress(const basic_simd<T, Abi>& value,
         const typename basic_simd<T, Abi>::mask_type& mask,
         T default_value);

This makes the programmer’s intent explicit in the code, it simplifies the code, and it allows those targets which support automatic insertion of a value to efficiently make use of that feature.

4.3.2. Implementation experience

In Intel’s example implementation of std::simd we have implemented permute-by-mask. On modern processors which support AVX-512 or VBMI2 we found the mapping from compress to be efficient, and allowed a natural representation of this compaction operation where needed. Implementing permute-by-mask on older processors which lacked native support was a little harder, but could still be implemented as efficiently as an experienced programmer could achieve using hand-written intrinsics. Determining the most efficient implementation under all conditions showed that less experienced programmers would struggle to implement this feature. Therefore, making this feature available in std::simd itself removed any non-portable calls to native compression/expansion instructions, and also allowed the library implementer to use efficient sequences for different scenarios.

4.4. Memory-based permutes

When permuting values which are stored in memory, the operations are normally called gather (reading data from memory), or scatter (writing values to memory). We propose that four functions are provided which implement these two operations, each provided in masked and unmasked forms.

4.4.1. Range-based gather function

The gather function is defined as follows:

template<typename ReturnSimdType = /*deduce*/,
         typename Range, std::integral Idx, typename AbiIdx, typename... Flags>
    requires std::ranges::sized_range<Range> && std::ranges::contiguous_range<Range>
constexpr auto
partial_gather_from(const Range&& in, const simd<Idx, AbiIdx>& indexes,
                    simd_flags<Flags...> f = {});

template<typename ReturnSimdType = /*deduce*/,
         typename Range, std::integral Idx, typename AbiIdx, typename... Flags>
    requires std::ranges::sized_range<Range> && std::ranges::contiguous_range<Range>
constexpr auto
unchecked_gather_from(const Range&& in, const simd<Idx, AbiIdx>& indexes,
                      simd_flags<Flags...> f = {});

template<typename ReturnSimdType = /*deduce*/,
         typename Range, std::integral Idx, typename AbiIdx, typename... Flags>
constexpr auto
partial_gather_from(Range&& in, const typename simd<Idx, AbiIdx>::mask_type& mask,
                    const simd<Idx, AbiIdx>& indexes, simd_flags<Flags...> f = {});

template<typename ReturnSimdType = /*deduce*/,
         typename Range, std::integral Idx, typename AbiIdx, typename... Flags>
constexpr auto
unchecked_gather_from(Range&& in, const typename simd<Idx, AbiIdx>::mask_type& mask,
                      const simd<Idx, AbiIdx>& indexes, simd_flags<Flags...> f = {});

The gather operation loads values from within a constrained contiguous finite range. By default it returns a new simd whose type is deduced from the input parameters: it will have the same element type as the input range elements and the same number of elements as the supplied index simd. Alternatively, the user may provide an template parameter specifying the return type. That enables the user to perform a conversion as part of the gather operation in common with the load/store operations in [P1928R15].

If an index is supplied which is outside the valid input range then we have two options for how the implementation should behave:

The safe version ensures that no memory is accessed when it shouldn’t be, but at the cost of reduced performance. If the programmer knows that the indexes are always valid, they may prefer to elide the safety checks. This tradeoff also exists in the load/store functions already found in [P1928R15] so we follow the same pattern in dealing with it:

For both of these overloads a masked equivalent is also available. In the masked overload the mask’s parameter type matches that of the supplied indexes since it is effectively turning indexes on and off. Any values which are unmasked will be value initialized in the result.

Both overloads accept a flags parameter which will mirror any control over the memory operations provided by simd_unchecked_load, simd_partial_load, and the memory-loading constructors (e.g., cache bypass, prefetching).

The default partial_gather_from function could be used as follows:

int array[1024];
simd<unsigned> indexes = GetIndexes();
...
const auto r1 = partial_gather_from(array, indexes);

By specifying a return type parameter a conversion can be implemented as part of the gather:

int array[1024];
simd<unsigned> indexes = GetIndexes();
using OutType = simd<float, simd<unsigned>::size>;
...
const auto r1 = partial_gather_from<OutType>(array, indexes);

When the programmer is certain that all the indexes are valid they may use the unchecked gather:

int array[1024];
simd<unsigned> indexes = GetIndexes();
using OutType = simd<float, simd<unsigned>::size>;
...

OutType result = {};
if (all_of(indexes < 1024))
  // Only gather if every index is valid. No need for the gather itself to check
  // element-by-element.
  result = unchecked_gather_from<OutType>(array, indexes);

4.4.2. Range-based scatter function

The scatter function is defined as:

template<typename Range,
         typename T, typename TAbi,
         std::integral Idx, typename IdxAbi, typename... Flags>
constexpr void
partial_scatter_to(const simd<T, TAbi>& values, Range&& out,
                   const simd<Idx, IdxAbi>& indexes, simd_flags<Flags...> f = {});
template<typename Range,
         typename T, typename TAbi,
         std::integral Idx, typename IdxAbi, typename... Flags>
constexpr void
unchecked_scatter_to(const simd<T, TAbi>& values, Range&& out,
                     const simd<Idx, IdxAbi>& indexes, simd_flags<Flags...> f = {});

template<std::contiguous_iterator Iter,
         typename T, typename TAbi,
         std::integral Idx, typename IdxAbi, typename... Flags>
constexpr void
partial_scatter_to(const simd<T, TAbi>& values, Range&& out,
                   const typename simd<Idx, AbiIdx>::mask_type& mask,
                   const simd<Idx, IdxAbi>& indexes,
                   simd_flags<Flags...> f = {});

template<std::contiguous_iterator Iter,
         typename T, typename TAbi,
         std::integral Idx, typename IdxAbi, typename... Flags>
constexpr void
unchecked_scatter_to(const simd<T, TAbi>& values, Range&& out,
                     const typename simd<Idx, AbiIdx>::mask_type& mask,
                     const simd<Idx, IdxAbi>& indexes,
                     simd_flags<Flags...> f = {});

The scatter function will write each input element value to the respective indexed position of the output range. The output range must be contiguous and finite. For partial_scatter_to the function will not write to memory if the corresponding index is out-of-range. For unchecked_scatter_to all indexed locations will be written to, even if they are out-of-range. The unchecked variant may be faster, but it should be used more carefully.

In the masked overload, the mask parameter type has been chosen to match the index type to be consistent with the gather function. The values written to memory will be of the range’s value type, with a conversion from the input type taking place if they are different. The conversion to the range value type must be value preserving, unless simd_flags_convert is passed as an option.

The partial_scatter_to and unchecked_scatter_to functions should accept any flags similar to those found in simd_unchecked_store or simd_partial_store, allowing control over effects such memory caching, write-through, and so on.

The memory writes performed by partial_scatter_to and unchecked_scatter_to are unsequenced. The programmer cannot rely on the values being written to the output range in any defined order.

The scatter could be used as follows:

int array[1024];
simd<int> values = ...;
simd<int> indexes = ...;
simd<int>::mask_type mask = ...;
...
partial_scatter_to(values, array, mask, indexes);

4.4.3. Design alternative for gather/scatter naming

In this proposal we have used the names partial_gather_from and unchecked_gather_from, which mirror the names simd_partial_load and simd_unchecked_load from [P1928R15]. The unchecked version clearly does what its name suggests, but the partial variant might not be quite the right name. While it is indeed potentially only loading some elements (or storing some elements in scatter), so is indeed a partial operation, maybe an alternative name might be better. For example, checked_gather_from, implicit_masked_gather_from, or just plain gather_from.

4.4.4. A note on iterator-based gather and scatter

In early revisions of this paper there was a proposal to make the gather and scatter functions accept contiguous iterators to indicate where to gather and scatter to memory. Although these functions directly mirrored common hardware instructions, which also accept a pointer and a vector of indexes, the functions are potentially very dangerous. Given a pointer, and a large enough index element type, any part of the execution space can be read or written, with writes being particularly catastrophic if mishandled. In common with the direction being taken in [P1928R15] to make other memory-related operations range-safe, we have replaced the iterator-based gather and scatter operations with range-based operations.

If a programmer only has an iterator referencing the region from which to gather or scatter, then they must convert that iterator to a suitable range. For example, if there exists a pointer to a lookup table, and it is known that the lookup table can only contain N entries, then the programmer could implement that as follows:

const uint16_t* ptr = ...;
const simd<unsigned> indexes = ...;
...
auto v = unchecked_gather_from(std::span(ptr, N), indexes);

Using a raw iterator and converting it to a span or range should be only a stop-gap measure, and ideally the code should be refactored to allow the appropriate input range to be used directly. Therefore to encourage good use of ranges no iterator-based overloads for gather or scatter are provided, unlike [P1928R15].

4.4.5. Implementation experience

Intel’s example implementation of std::simd has been modified to include both iterator-based and range-based versions of gather_from and scatter_to, with and without runtime checking. The code has been evaluated for performance.

Our benchmarking showed that the addition of range checking, value-initialization, and write suppression for invalid indexes did not significantly degrade execution performance on modern machines running AVX-512. In fact, we found that the choice of index values (i.e., what happens if the indexes are in the same cache line, adjacent cache lines, consecutive, reversed, etc.) had a much greater impact on execution performance than whether the range check was present or not.

When rewriting older iterator-based code in terms of the range-based API, we could implement the code as follows:

const uint16_t* ptr = ...;
const simd<unsigned> indexes = ...;

// Original iterator-based API
auto v_old = gather_from(ptr, indexes);

// New API
auto v_new = unchecked_gather_from(std::span(ptr, N), indexes);

The generated code for these two uses is identical, so the performance levels of the original API can be achieved while using an API which encourages the programmer to think about the ranges that they are using for scatter and gather.

4.4.6. A note on proposing that gather and scatter should be member functions

At Varna 2023 there was a suggestion to make gather_from and scatter_to become member functions. Reasons for this included:

Since that meeting [P3299R0] (range-based constructors) has been incorporated into [P1928R15]. In particular simd_unchecked_load and simd_partial_load replace copy_from, and simd_unchecked_store and simd_partial_store replace copy_to. We have incorporated the sameb mechanisms into this proposal, and so the gather and scatter functions now have improved memory safety and conversion capabilities, and we will abandon any further exploration of gather/scatter member functions.

5. Summary

In this document we have described four styles of interface for handling permutes: compile-time generated, simd-indexed, mask-indexed, and memory based. In combination, these interfaces allow all types of common permute to be expressed in a way which is clean, concise, and efficient for the compiler to implement.

6. Wording

6.1. Add new section [simd.static_permute]

simd static permute [simd.static_permute]

constexpr static int simd_zero_element   = /* Implementation defined */;
constexpr static int simd_uninit_element = /* Implementation defined */;

Constraints:

  • These cannot take values which could be mistaken for valid index values, so they must be outside the range [0..max-impl-size).

Remarks:

  • Special constants can be returned by the permutation generator function to indicate special behavior of that element.

// Free function to generate compile-time permute of simd
template<std::size_t SizeSelector = 0, typename T, typename Abi, typename PermuteGenerator>
constexpr resize_simd_t<output-size, basic_simd<T, Abi>>
permute(const basic_simd<T, Abi>& v, PermuteGenerator&& fn)

// Free function to generate compile-time permute of simd_mask
template<std::size_t SizeSelector = 0, std::size_t Bytes, typename Abi,
         typename PermuteGenerator>
constexpr resize_simd_t<output-size, basic_simd_mask<Bytes, Abi>>
permute(const basic_simd_mask<Bytes, Abi>& v, PermuteGenerator&& fn)

Let:

  • output-size is v.size if SizeSelector is zero, otherwise SizeSelector.

Constraints:

  • std::integral<std::invoke_result<PermuteGenerator, std::size_t> || std::integral<std::invoke_result<PermuteGenerator, std::size_t, std::size_t> is true.

Returns:

  • A basic_simd or basic_simd_mask object of size output-size where the ith element is initialized as follows. For each output index [0..output-size) the result of the expression simd(gen(integral_constant<size_t, i>())), or simd(gen(integral_constant<size_t, i>(), integral_constant<size_t,output-size)) is computed. The result index x is used to select a value from v as follows:

    • A value in the range [0..size) will use v[x].

    • The value simd_zero_element will use T().

    • The value simd_uninit_element will use a valid but unspecified value of the compiler’s choosing.

    • Any other value is undefined.

Remarks:

  • The calls to the generator are unsequenced with respect to each other.

6.2. Add new section [simd.dynamic_permute]

simd dynamic permute [simd.dynamic_permute]
// Free function to permute simd by dynamic index
template<typename T, typename AbiT, std::integral Idx, typename AbiIdx>
constexpr resize_simd_t<basic_simd<Idx, AbiIdx>::size, basic_simd<T, AbiT>>
permute(const basic_simd<T, AbiT>& v, const basic_simd<Idx, AbiIdx>& indexes)

// Free function to permute simd mask by dynamic index
template<std::size_t Bytes, typename AbiT, std::integral Idx, typename AbiIdx>
constexpr resize_simd_t<basic_simd_mask<Idx, AbiIdx>::size, basic_simd_mask<Bytes, AbiT>>
permute(const basic_simd_mask<Bytes, AbiT>& v, const basic_simd<Idx, AbiIdx>& indexes)

// Member function to permute simd by dynamic index using subscript operator
template<std::integral Idx, typename AbiIdx>
constexpr resize_simd_t<basic_simd<Idx, AbiIdx>::size, basic_simd<T, AbiT>>
basic_simd::operator[](const basic_simd<Idx, AbiIdx>& indexes)

// Member function to permute a simd mask by dynamic index using subscript operator
template<std::integral Idx, typename AbiIdx>
constexpr resize_simd_t<basic_simd_mask<Idx, AbiIdx>::size, basic_simd_mask<Bytes, AbiT>>
basic_simd_mask::operator[](const basic_simd<Idx, AbiIdx>& indexes)

Preconditions:

  • All values in indexes must be in the range [0..size).

  • All index elements must be integral types.

Returns:

  • A basic_simd or basic_simd_mask object of size simd_size_v<Idx, AbiIdx> where the ith element is a copy of v[indexes[i]] if i is in the range [0..v.size) or undefined otherwise.

6.3. Add new section [simd.mask_permute]

simd mask permute [simd.mask_permute]
// Free function to compress simd values using a mask selector
(1) template<typename T, typename Abi>
constexpr basic_simd<T, Abi>
compress(const basic_simd<T, Abi>::mask_type& selector, const basic_simd<T, Abi>& v)

// Free function to compress simd_mask elements using a mask selector
(2) template<std::size_t Bytes, typename Abi>
constexpr basic_simd_mask<Bytes, Abi>
compress(const basic_simd_mask<Bytes, Abi>& selector,
         const basic_simd_mask<Bytes, Abi>& v)

// Free function to compress simd values using a mask selector
// with a fill value for unused elements
(3) template<typename T, typename Abi>
constexpr basic_simd<T, Abi>
compress(const basic_simd<T, Abi>::mask_type& selector,
         const basic_simd<T, Abi>& v, T fill_value)

// Free function to compress simd_mask elements using a mask selector
// with a fill value for unused elements
(4) template<std::size_t Bytes, typename Abi>
constexpr basic_simd_mask<Bytes, Abi>
compress(const basic_simd_mask<Bytes, Abi>& selector,
         const basic_simd_mask<Bytes, Abi>& v, bool fill_value)

Returns:

  • A basic_simd or basic_simd_mask object in which the first [0..reduce_count(selector)) values are copies of those elements in v whose corresponding mask element is set. The relative order of the values from v is unchanged.

  • For (1) and (2) output values in the range [reduce_count(selector)...v.size)) will be valid but unspecified values of type T.

  • For (3) and (4) output values in the range [reduce_count(selector)...v.size)) will be copies of fill_value.

Remarks:

  • The compress operation is used to throw away unwanted values, so the fill_value is used to put sane values into unused positions. This is in contrast to expand where the operation is typically used to overwrite selected elements of an existing SIMD value and so unused positions come from another SIMD value.

// Free function to expand simd values using a mask selector
template<typename T, typename Abi>
constexpr basic_simd<T, Abi>
expand(const typename basic_simd<T, Abi>::mask_type& selector, const basic_simd<T, Abi>& v,
       const basic_simd<T, Abi>& original = ())

// Free function to expand simd_mask elements using a mask selector
template<std::size_t Bytes, typename Abi>
constexpr basic_simd_mask<Bytes, Abi>
expand(const basic_simd_mask<Bytes, Abi>& selector, const basic_simd_mask<Bytes, Abi>& v,
       const basic_simd_mask<Bytes, Abi>& original = ())

Returns:

  • A basic_simd or basic_simd_mask object in which the first [0..reduce_count(selector)) values are copied to those output elements whose respective selector elements are set. Any output elements whose respective selector elements are not set will have the corresponding element from original copied into that position.

Remarks:

  • expand is often used to overwrite selected positions in an existing SIMD value, so these functions take a complete SIMD value as the source from which to take unused mask selector elements. This is in contrast to compress where the compression operation is effectively throwing away unwanted elements and the fill_value is being used to put sane values into unused positions.

6.4. Add new section [simd.memory_permute]

simd memory permute [simd.memory_permute]
1) template<typename ReturnSimdType = /* deduce */,
         typename GatherRange, std::integral Idx, typename Abi, typename... Flags>
   constexpr simd<std::ranges::range_value_t<GatherRange>, simd<Idx, Abi>::size>
   partial_gather_from(GatherRange&& in, const simd<Idx, Abi>& indexes,
                       simd_flags<Flags...> f = {});

2) template<typename ReturnSimdType = /* deduce */,
            typename GatherRange, std::integral Idx, typename Abi, typename... Flags>
   constexpr simd<std::ranges::range_value_t<GatherRange>, simd<Idx, Abi>::size>
   unchecked_gather_from(GatherRange&& in, const simd<Idx, Abi>& indexes,
                         simd_flags<Flags...> f = {});

3) template<typename ReturnSimdType = /* deduce */,
            typename GatherRange, std::integral Idx, typename Abi, typename... Flags>
   constexpr simd<std::ranges::range_value_t<GatherRange>, simd<Idx, Abi>::size>
   partial_gather_from(GatherRange&& in, const typename simd<Idx, AbiIdx>::mask_type& mask,
                       const simd<Idx, Abi>& indexes, simd_flags<Flags...> f = {});

4) template<typename ReturnSimdType = /* deduce */,
            typename GatherRange, std::integral Idx, typename Abi, typename... Flags>
   constexpr simd<std::ranges::range_value_t<GatherRange>, simd<Idx, Abi>::size>
   unchecked_gather_from(GatherRange&& in, const typename simd<Idx, AbiIdx>::mask_type& mask,
                         const simd<Idx, Abi>& indexes, simd_flags<Flags...> f = {});

Let:

  • ReturnSimdType be simd<std::ranges::range_value_t<GatherRange>, basic_simd<Idx, AbiIdx>::size> if it isn’t specified.

  • mask be simd&lt;Idx, AbiIdx>::mask_type(true) for the overloads with no mask parameter.

Constraints:

  • std::integral<Idx> is true.

  • std::ranges::contiguous_range<GatherRange> is true.

  • std::ranges::sized_range<GatherRange> is true.

Preconditions:

For overloads 2 and 4 then all indexes are contained within the given range.

Mandates

  • If ReturnSimdType is specified at the call-site, then:

    • It must be a basic_simd whose size is basic_simd<Idx, AbiIdx>::size.

    • If the template parameter pack Flags does not contain the type identifying simd_flag_convert then the conversion from range__value_t<GatherRange> to ReturnSimdtype::value_type is value-preserving.

Returns:

  • For overloads 1 and 3 (checked), returns a value of type ReturnSimdType where the ith element will be mask[i] && (indexes[i] < in.size()) ? in[indexes[i]] : typename ReturnSimdType::value_type().

  • For overloads 2 and 4 (unchecked), returns a value of type ReturnSimdType where the ith element will be mask[i] ? in[indexes[i]] : typename ReturnSimdType::value_type().

1) template<typename T, typename TAbi,
            typename ScatterRange,
            std::integral Idx, typename IdxAbi, typename... Flags>
   constexpr void
   partial_scatter_to(const simd<T, TAbi>& value, ScatterRange&& out, const simd<Idx,
                      IdxAbi>& indexes, simd_flags f = {});

2) template<typename T, typename TAbi,
            typename ScatterRange,
            std::integral Idx, typename IdxAbi, typename... Flags>
   constexpr void
   unchecked_scatter_to(const simd<T, TAbi>& value, ScatterRange&& out, const simd<Idx,
                        IdxAbi>& indexes, simd_flags f = {});

3) template<typename T, typename TAbi,
            typename ScatterRange,
            std::integral Idx, typename IdxAbi, typename... Flags>
   constexpr void
   partial_scatter_to(const simd<T, TAbi>& value, ScatterRange&& out,
                      const typename simd<Idx, AbiIdx>::mask_type& mask,
                      const simd<Idx, IdxAbi>& indexes, simd_flags<Flags...> f = {});

4) template<typename T, typename TAbi,
            typename ScatterRange,
            std::integral Idx, typename IdxAbi, typename... Flags>
   constexpr void
   unchecked_scatter_to(const simd<T, TAbi>& value, ScatterRange&& out,
                        const typename simd<Idx, AbiIdx>::mask_type& mask,
                        const simd<Idx, IdxAbi>& indexes, simd_flags<Flags...> f = {});

Let:

mask be simd&lt;Idx, AbiIdx>::mask_type(true) for the overloads with no mask parameter.

Constraints:

  • std::integral<Idx> is true.

  • simd_size_v<simd<T, TAbi>> == simd_size_v<simd<Idx, IdxAbi>> is true.

  • std::ranges::contiguous_range<ScatterRange> is true.

  • std::ranges::sized_range<ScatterRange> is true.

Preconditions:

  • For overloads 2 and 4 then all indexes are contained within the given range.

  • No element value appears more than once in the indexes.

Mandates: If the template parameter pack Flags does not contain the type identifying simd_flag_convert, then the conversion from T to std::ranges::range_value_t<It> is value-preserving.

Effects:

  • For overloads 1 and 3 (partial), for all i in the range of [0, basic_simd<T, Abi>::size()), if mask[𝑖] && indexes[i] < ranges::size(r) is true, evaluates ranges::data(r)[𝑖ndexes[i]] = v[𝑖].

  • For overloads 2 and 4 (unchecked), for all i in the range of [0, basic_simd<T, Abi>::size()), if mask[𝑖] is true, evaluates ranges::data(r)[𝑖ndexes[i]] = v[𝑖].

7. Acknowledgments

Thank you to Matthias Kretz for useful offline discussions regarding the behaviour and contents of the permutation API.

8. Polls

8.1. Varna Meeting - 16th May 2023

POLL: SIMD permute should not support negative indices

SF F N A SA
2 10 0 0 0

POLL: SIMD permute should always pass both an index and a size to the invocable.

SF F N A SA
2 2 6 3 1

POLL: Rename gather to gather_from and scatter to scatter_to and make them member functions that do conversion and take simd_flags<...> (like copy_to/copy_from).

SF F N A SA
6 8 1 0 0

8.2. Telecon Meeting - 2nd May 2023

Poll: We should promise more committee time to pursuing P2664R2 (Proposal to extend std::simd with permutation API), knowing that our time is scarce and this will leave less time for other work.

SF F N A SA
6 4 1 0 0

Poll: Pursue extended indices in the permute callable (e.g., negative indices, special values for zero element, etc.).

SF F N A SA
5 4 1 0 0

Poll: Pursue multi-index subscripting for dynamic permutations in the initial revision.

SF F N A SA
0 1 8 2 0

Poll : Pursue operator[](simd) subscripting for dynamic permutations in the initial revision.

SF F N A SA
1 6 2 0 0

Poll: Pursue operator[](simd_mask) subscripting for dynamic expand (output[] = input) in the initial revision.

SF F N A SA
0 0 7 2 0

References

Informative References

[DNMSO]
Matthias Kretz. Non-Member Subscript Operator. URL: https://web-docs.gsi.de/~mkretz/DNMSO.pdf
[Highway]
Google. Google Highway. URL: https://github.com/google/highway/
[P0918R1]
Tim Shen. More simd<> Operations. 29 March 2018. URL: https://wg21.link/p0918r1
[P1915R0]
Matthias Kretz. Expected Feedback from simd in the Parallelism TS 2. 7 October 2019. URL: https://wg21.link/p1915r0
[P1928R4]
Matthias Kretz. std::simd - Merge data-parallel types from the Parallelism TS 2. 19 May 2023. URL: https://wg21.link/p1928r4
[P2638R0]
Daniel Towner. Intel's response to P1915R0 for std::simd parallelism in TS 2. 22 September 2022. URL: https://wg21.link/p2638r0
[P3299R0]
Daniel Towner, Matthias Kretz. Range constructors for std::simd. 22 May 2024. URL: https://wg21.link/p3299r0