P3440R0
Add n_elements named constructor to std::simd

Published Proposal,

This version:
http://wg21.link/P3440R0
Author:
(Intel)
Audience:
LEWG
Project:
ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21

Abstract

Proposal to add std::simd_mask::n_elements named constructor to create a mask containing an exact number of set bits. Such a function is notably useful for handling loop remainders.

1. Motivation

When iterating over large dynamic data sets using std::simd ([P1928R8]) loop, there will inevitably be situations where the very last block of data doesn’t fill the entire std:simd object. This remainder needs to be processed using a partially filled std::simd object. For example:

void fn(float* ptr, std::size_t count)
{
  // Process complete SIMD blocks.
  auto wholeBlocks = count / simd<float>::size;
  for (int i=0; wholeBlocks; ++i)
  {
    auto block = simd<float>(i * simd<float>::size);
    process(block);  // Process an entire simd-worth of data.
  }

  // Process the remainder.
  auto remainder = count % simd<float>::size;
  if (remainder > 0)
  {
    simd_mask<float> remainderMask ([=count](auto idx) { return idx < count; >});
    auto remainderBlock =
      simd_load<simd<float>>(ptr + (count - remainder), remainder, simd_default_init_flag);
    process(remainderBlock, remainderMask); // Do the work on part of the SIMD only.
  }
}

In this example the remainder has been handled by creating a mask in which only the bits [0..remainder) are active. Note that the partial load has been handled using the load_from(range) function described in [P3299R1] which is memory-safe and likely to be efficiently implemented. However, the processing itself is taking a simd and only operating on the subset of its elements which correspond to the remainder, and for this processing a suitable remainder mask must be generated.

In the example the remainder mask has been created using a mask generator, where each bit in the mask is created using a comparison against the number of required bits. There are other ways of creating that mask, three variants of which are illustrated here:

int numRemainderBits = ...;

// Assume some sort of `iota` simd object [[P3319R1]]
// This is quite compact, but will have some runtime conversion to deal
// with the `float` comparison.
auto remainder1 = simd<float>::iota() < numRemainderBits;

// Like the previous, but explicitly avoid the runtime conversion to float.
auto tmp = simd<uint32_t>::iota() < numRemainderBits; // Create an n-element mask.
auto remainder2 = simd_mask<float>(tmp); // Convert to the correct type of mask.

// Use the facilities of new constructors [[P2876R1]] to build a mask from an
// integer bit set. This generates efficient code on compact mask
// machines (e.g., Intel AVX-512, AVX-10). It doesn’t handle masks containing more than
// 64 elements without a change in type.
auto m = (uint64_t(1) << numRemainderBits) - 1;
auto remainder3 = simd_mask<float>(m);

One serious issue with this selection of methods is that there is no single obvious style to use to generate the best code across a range of targets. For example, the last method works well on compact-mask targets (e.g., Intel AVX-512), but poorly on wide-mask targets (e.g., Intel SSE). Adding conditional code around the mask to reflect on the target and generate the mask differently just leads to a reduction in portability and an increase in verbosity.

Manual mask generation can introduce subtle issues for corner cases. For example, constructing from a compact integer as in the last code snippet above will work only if the integer itself is constructed properly. If the integer type was too small (e.g., uint16_t for a 64-bit mask) it may silently fail for some targets. Or in a wide-mask variant in which the mask is generated using the comparison simd<uint8_t>::iota() < n this will fail if simd<uint8_t> has more than 256 elements, leading to future portability issues.

To avoid the issues with manual mask generation we propose that a named constructor is provided which populates a simd_mask with exactly N bits active at positions [0..N). By making this function part of std::simd itself the implementation can choose the most efficient implementation for the target, and it can correctly handle all possible corner cases:

static constexpr basic_simd_mask
basic_simd_mask::n_elements(simd-size-type count);

Given a count of zero, an empty mask will be returned. When the count is in the range [0..count) a mask containing just that many bits will be returned. When the count is larger than the mask, a mask with all bits set will be returned.

2. Implementation experience

Intel’s implementation of std::simd has had this named constructor since very early on, and it is used throughout our example code base. It makes generating efficient mask remainders across all Intel targets efficient and easy, and it makes the code’s intent very obvious.

3. Wording

3.1. Modify [simd.mask.overview]

Add the following to the [simd.mask.ctor] section:

static constexpr basic_simd_mask n_elements(simd-size-type count) noexcept;

3.2. Modify basic_simd_mask constructors [simd.mask.ctor]

basic_simd_mask constructors [simd.mask.ctor]

static constexpr basic_simd_mask n_elements(simd-size-type count) noexcept;

Effects:

  • Initialises the ith element with the result of i < count for all i in the range of [0..size).

Remarks:

  • The count can be any valid value of <i>simd-size-type</i>.

    • If count is zero or less, an empty mask is returned.

    • If count is greater than size, a full mask is returned.

Throws:

  • Nothing.

References

Informative References

[P1928R8]
Matthias Kretz. std::simd - Merge data-parallel types from the Parallelism TS 2. 9 November 2023. URL: https://wg21.link/p1928r8