1. Motivation
When iterating over large dynamic data sets using
([P1928R8])
loop, there will inevitably be situations where the very last block of data
doesn’t fill the entire
object. This remainder needs to be processed
using a partially filled
object. For example:
void fn ( float * ptr , std :: size_t count ) { // Process complete SIMD blocks. auto wholeBlocks = count / simd < float >:: size ; for ( int i = 0 ; wholeBlocks ; ++ i ) { auto block = simd < float > ( i * simd < float >:: size ); process ( block ); // Process an entire simd-worth of data. } // Process the remainder. auto remainder = count % simd < float >:: size ; if ( remainder > 0 ) { simd_mask < float > remainderMask ([ = count ]( auto idx ) { return idx < count ; > }); auto remainderBlock = simd_load < simd < float >> ( ptr + ( count - remainder ), remainder , simd_default_init_flag ); process ( remainderBlock , remainderMask ); // Do the work on part of the SIMD only. } }
In this example the remainder has been handled by creating a mask in which only
the bits
are active. Note that the partial load has been
handled using the
function described in [P3299R1] which is
memory-safe and likely to be efficiently implemented. However, the processing
itself is taking a simd and only operating on the subset of its elements which
correspond to the remainder, and for this processing a suitable remainder mask
must be generated.
In the example the remainder mask has been created using a mask generator, where each bit in the mask is created using a comparison against the number of required bits. There are other ways of creating that mask, three variants of which are illustrated here:
int numRemainderBits = ...; // Assume some sort of `iota` simd object [[P3319R1]] // This is quite compact, but will have some runtime conversion to deal // with the `float` comparison. auto remainder1 = simd < float >:: iota () < numRemainderBits ; // Like the previous, but explicitly avoid the runtime conversion to float. auto tmp = simd < uint32_t >:: iota () < numRemainderBits ; // Create an n-element mask. auto remainder2 = simd_mask < float > ( tmp ); // Convert to the correct type of mask. // Use the facilities of new constructors [[P2876R1]] to build a mask from an // integer bit set. This generates efficient code on compact mask // machines (e.g., Intel AVX-512, AVX-10). It doesn’t handle masks containing more than // 64 elements without a change in type. auto m = ( uint64_t ( 1 ) << numRemainderBits ) - 1 ; auto remainder3 = simd_mask < float > ( m );
One serious issue with this selection of methods is that there is no single obvious style to use to generate the best code across a range of targets. For example, the last method works well on compact-mask targets (e.g., Intel AVX-512), but poorly on wide-mask targets (e.g., Intel SSE). Adding conditional code around the mask to reflect on the target and generate the mask differently just leads to a reduction in portability and an increase in verbosity.
Manual mask generation can introduce subtle issues for corner cases. For
example, constructing from a compact integer as in the last code snippet above
will work only if the integer itself is constructed properly. If the integer
type was too small (e.g., uint16_t for a 64-bit mask) it may silently fail for
some targets. Or in a wide-mask variant in which the mask is generated using the
comparison
this will fail if
has
more than 256 elements, leading to future portability issues.
To avoid the issues with manual mask generation we propose that a named
constructor is provided which populates a
with exactly N bits active
at positions
. By making this function part of
itself the
implementation can choose the most efficient implementation for the target, and
it can correctly handle all possible corner cases:
static constexpr basic_simd_mask basic_simd_mask::n_elements ( simd - size - type count );
Given a count of zero, an empty mask will be returned. When the count is in the range
a mask containing just that many bits will be
returned. When the count is larger than the mask, a mask with all bits set will be returned.
2. Implementation experience
Intel’s implementation of
has had this named constructor since very
early on, and it is used throughout our example code base. It makes generating
efficient mask remainders across all Intel targets efficient and easy, and it
makes the code’s intent very obvious.
3. Wording
3.1. Modify [simd.mask.overview]
Add the following to the [simd.mask.ctor] section:
static constexpr basic_simd_mask n_elements ( simd - size - type count ) noexcept ;
3.2. Modify basic_simd_mask constructors [simd.mask.ctor]
�
constructors [simd.mask.ctor]
basic_simd_mask static constexpr basic_simd_mask n_elements ( simd - size - type count ) noexcept ; Effects:
Initialises the ith element with the result of
for all i in the range of
i < count .
[ 0. . size ) Remarks:
The count can be any valid value of
.
< i > simd - size - type </ i >
If
is zero or less, an empty mask is returned.
count If
is greater than
count , a full mask is returned.
size Throws:
Nothing.