Document number: N3759
Date: 2013-08-30
Project: Programming Language C++, Library Working Group
Reply-to: Matthias Kretz <kretz@kde.org> <kretz@compeng.uni-frankfurt.de>

SIMD Vector Types

Introduction
Motivation
Problem
Proposal — SIMD Types
Acknowledgements
References

Preface

In Bristol N3571 (A Proposal to add Single Instruction Multiple Data Computation to the Standard Library) was discussed and the following straw poll was taken: “Should C++ include a fixed length vector type to abstract vector registers”

2/2/1/4/6 SF/WF/N/WA/SA
Consensus not to move forward

This poll was assuming a slightly different approach than presented in this proposal. It did not make sense to take over the discussion then to talk about this approach. In/after that session I was told that I should write a new proposal.

Introduction — SIMD Registers and Operations

Since many years the number of SIMD instructions and the size of SIMD registers have been growing. Newer microarchitectures introduce new operations to optimize certain (common or specialized) operations. Additionally, the size of SIMD registers has increased and may increase further in the future.

The typical minimal set of SIMD instructions for a given scalar data type comes down to the following:

Load instructions: load N successive scalar values starting from a given address into a SIMD register. SSE examples:
```
movaps (%rax),%xmm0
movups 0x4(%rax),%xmm1
```
Store instructions: store from a SIMD register to N successive scalar values at a given address. SSE examples:
```
movaps %xmm0,(%rax)
movups %xmm1,0x4(%rax)
```

Arithmetic instructions. SSE examples:

addps %xmm0,%xmm1
mulps %xmm1,%xmm1
divps %xmm1,%xmm0

Compare instructions. SSE examples:
```
cmpeqps %xmm0,%xmm1
```
Bitwise instructions. SSE examples:
```
andps %xmm0,%xmm1
xorps %xmm0,%xmm0
```

The set of available operations can differ considerably between different microarchitectures of the same CPU family. Furthermore there are different SIMD register sizes. Future extensions will certainly add more instructions and larger SIMD registers.

Motivation

There is no need to motivate SIMD programming. It is very much needed, the open question is only: “How?”.

There have been several approaches to vectorization. I'd like to only discuss the merits of SIMD types here.

SIMD registers and operations are the low-level ingredients to SIMD programming. Higher-level abstractions can be built on top of these. If the lowest-level access to SIMD is not provided, users of C++ will be constrained to work within the limits of the provided abstraction.

In some cases the compiler might generate better code if only the intent is stated instead of an exact sequence of operations. Thus higher-level abstractions might seem preferable to low-level SIMD types. In my experience this is not the case because programming with SIMD types makes intent very clear and compilers can optimize sequences of SIMD operations just like they can for scalar operations. SIMD types do not lead to an easy and obvious answer to efficient and easily usable data structures, though.

One major benefit from SIMD types is that the programmer can gain an intuition for SIMD. This subsequently influences further design of data structures and algorithms to better suit SIMD architectures.

There are already many users of SIMD intrinsics (and thus a primitive form of SIMD types). Providing a cleaner and portable SIMD API would provide many of them with a better alternative. Thus SIMD types in C++ would capture existing practice.

The challenge remains in providing portable SIMD types and operations.

Problem

C++ has no means to use SIMD operations directly. There are indirect uses through loop vectorization or optimized algorithms (that use extensions to C/C++ or assembler for their implementation).

All compiler vendors (that I worked with) add intrinsics support to their compiler products to make these operations accessible from C. These intrinsics are inherently not portable and most of the time very directly bound to a specific instruction. (Compilers are able to statically evaluate and optimize SIMD code written via intrinsics, though.)

Algorithms on Large Datasets

Solutions that encode data parallel operations via the application of operations or algorithms on a larger set of data, quickly lead to inefficient use of the CPUs caches. Consider valarray:

1  std::valarray data(N);
2  // initialize it somehow
3  data *= 2.f;
4  data += 1.f;

A compiler will be able to vectorize lines 3 and 4.

§ 6.2 [stmt.expr]: "All side effects from an expression statement are completed before the next statement is executed." does not necessarily stop the compiler from combining the operations on lines 3 and 4. Still, (for a larger number of expressions and containers) it cannot be reasonably expected/required that compilers will be able to combine the statements. Thus, we must assume that line 3 will iterate over N values and only afterwards line 4 is allowed to iterate over the same memory again. But modern CPUs require small working-sets to make efficient use of their caches. Therefore N should not be too large. General and exact bounds for N do not exist, though.

Any solution to achieving smaller working sets requires some form of loop construct with current C++.

The following is not a solution for this example unless expression templates were used in the implementation:

...
3  data = data * 2.f + 1.f;

The Array Building Blocks Approach

For the reasons sketched in the section above, Intel experimented with an approach that would split the algorithm in optimized working sets at runtime. This required special sections of code which generated the code at runtime that subsequently was executed to do the actual work. The semantics in different sections of the code were thus slightly different: Something which can be really surprising and hard to understand for developers.

Whenever you take the approach of expressing the algorithm on the whole large data, you will either end up with inefficient cache usage or a solution resembling Intel Array Building Blocks. (I'd be happy to learn of another — better — solution, of course.)

Cache Efficient on Large Data == Loops

Therefore loops still remain the state of the art for cache efficient processing of large data sets. The loop count can be reduced by executing the loop body simultaneously on SIMD vectors. (Additionally the loop can be divided onto multiple cores.)

Portability Considerations

The main portability concern stems from different SIMD register widths for different targets. This is a real problem mainly when SIMD types are used in I/O. But this is comparable to differing endianness. It simply requires software to use a portable interchange format (e.g. SoA or AoS of scalar types).

Typically the compiler is told the target microarchitecture via flags. Obviously this will create code that is not guaranteed to run on older/incompatible CPUs. An implementation might decide to compile the code for several targets, though and decide for the best one to use at runtime. But it is also possible to achieve runtime selection of the target microarchitecture without help from the compiler (e.g. Krita uses Vc this way).

Proposal

This is a pure library proposal. It does depend on implementation-specific language extensions, though (e.g. SIMD intrinsics/builtins). The proposal builds upon the experience from the Vc library [1, 2].

Room for Follow-up Proposals

This proposal focuses on the core class only. Therefore a lot of interesting functionality that is present in Vc will not be discussed here. Follow-up proposals can address the following issues:

Mask types and solutions for conditional assignment / predicates
Load/store optimizations: streaming and prefetching
Gather/scatter
Shuffles, swizzles, vector shift/rotate
Portable optimized (de)interleaving
Iterators / Ranges (make iteration with SIMD types easier)
Containers for SIMD
“valarray<simd_vector<T>, N>”
support for load/store with half-precision floats
“fix” allocators to support SIMD types (over-aligned) throughout the C++ standard (new, std::allocator)

Types

Provide at least the following new types:

int16_v
int32_v
uint16_v
uint32_v
float_v
double_v

These types should probably be provided as well (they are not provided in Vc — I never had a need for them):

int8_v
uint8_v
int64_v
uint64_v

Each class has a single data member, an implementation-specific object to access the SIMD register (this could be __m256 with AVX intrinsics). If the type is not supported for SIMD on the target platform, the data member will be a single scalar value. For example double_v has one double member for ARM NEON.

The sizes of these SIMD types therefore depend on the natural size suggested by the architecture of the execution environment. This is similar to § 3.9.1 [basic.fundamental] p2: “Plain ints have the natural size suggested by the architecture of the execution environment”.

These types should be instantiations of a SIMD vector template class: typedef simd_vector float_v;. This makes generic code slightly easier to create, without the need for SFINAE:

template<typename T> void someFunction(simd_vector<T> v);
// vs.
template<typename V> typename std::enable_if<is_simd_vector<V>::value, void>::type someFunction(V v);

The SIMD types all need to be inside a namespace whose name depends on the ABI/target. For instance the symbols for SSE and AVX SIMD types must be different so that incompatible object files/libraries fail to link.

inline namespace ABI/target-dependent {

The simd_vector<T> class:

template<typename T>
class simd_vector
{
  typedef implementation-defined simd_type;
  simd_type data;
public:
  typedef T value_type;

  // the number of values in the SIMD vector
  static constexpr size_t size = sizeof(simd_type) / sizeof(value_type);

  // in Vc operator new / delete are overloaded to work around the fact that new ignores the alignof
  // of the allocated type. If this does not get solved in the standard the following will be needed:
  void *operator new(size_t size);
  void *operator new(size_t, void *p) { return p; }
  void *operator new[](size_t size);
  void *operator new[](size_t , void *p) { return p; }
  void operator delete(void *ptr, size_t);
  void operator delete(void *, void *) {}
  void operator delete[](void *ptr, size_t);
  void operator delete[](void *, void *) {}

  // init to zero
  float_v();

  // copy
  simd_vector(const simd_vector &);
  simd_vector &operator=(const simd_vector &);

  // implicit conversion from compatible simd_vector<T>
  template<typename U> simd_vector(simd_vector<U>, typename enable_if<is_integral<value_type>::value
    && is_integral<U>::value && size == simd_vector<U>::size, void *>::type = nullptr);

  // static_cast from vectors of possibly (depending on target) different size
  // (dropping values or filling with 0 if the size is not equal)
  template<typename U> explicit simd_vector(simd_vector<U>, typename enable_if<!(
    is_integral<value_type>::value && is_integral<U>::value && size == simd_vector<U>::size),
    void *>::type = nullptr);

  // broadcast with implicit conversions
  simd_vector(value_type);

  // load member functions
  void load(const value_type *mem);
  template<typename U> void load(const U *mem) ;

  // load ctors (optional)
  explicit Vector(const value_type *mem);
  template<typename U> explicit Vector(const U *mem);

  // store functions
  void store(value_type *mem) const;
  template<typename U> void store(U *mem) const;

  // unary operators
  simd_vector &operator++();
  simd_vector  operator++(int);
  simd_vector &operator--();
  simd_vector  operator--(int);

  simd_vector operator~() const;
  simd_vector operator+() const;
  simd_vector<typename negate_type<T>type> operator-() const;

  // assignment operators
  simd_vector &operator+=(simd_vector<T> x);
  simd_vector &operator-=(simd_vector<T> x);
  simd_vector &operator*=(simd_vector<T> x);
  simd_vector &operator/=(simd_vector<T> x);
  simd_vector &operator%=(simd_vector<T> x);
  simd_vector &operator&=(simd_vector<T> x);
  simd_vector &operator|=(simd_vector<T> x);
  simd_vector &operator^=(simd_vector<T> x);

  simd_vector &operator<<=(simd_vector<T> x);
  simd_vector &operator>>=(simd_vector<T> x);
  simd_vector &operator<<=(int x);
  simd_vector &operator>>=(int x);

  // scalar entries access
  implementation-defined &operator[](size_t index);
  value_type operator[](size_t index) const;
};

typedef simd_vector<   float>  float_v;
typedef simd_vector<  double> double_v;
typedef simd_vector< int64_t>  int64_v;
typedef simd_vector<uint64_t> uint64_v;
typedef simd_vector< int32_t>  int32_v;
typedef simd_vector<uint32_t> uint32_v;
typedef simd_vector< int16_t>  int16_v;
typedef simd_vector<uint16_t> uint16_v;
typedef simd_vector<  int8_t>   int8_v;
typedef simd_vector< uint8_t>  uint8_v;

} // namespace

Conversions

Implicit conversion between vectors and implicit conversion from scalars to vectors follows the rules of conversion between scalar types as closely as possible. The rules are:

If U implicitly converts to T then implicit conversion from U to simd_vector<T> also works.
If simd_vector<U>::size == simd_vector<T>::size works portably, then implicit conversion from simd_vector<U> to simd_vector<T> also works.

In Vc the following guarantees are made:

float_v::size == int32_v::size == uint32_v::size
int16_v::size == uint16_v::size

It is very convenient to have an integer vector of the same size as a float vector. But on the other hand this does not map naturally to all targets (e.g. AVX). A solution that is closer to reality is to only guarantee the equality for integer vectors:

```
int64_v::size == uint64_v::size
```
```
int32_v::size == uint32_v::size
```
```
int16_v::size == uint16_v::size
```
```
 int8_v::size ==  uint8_v::size
```

The convenience can be restored with an abstraction above the simd_vector<T> types.

Explicit conversions between vectors of possibly different size are allowed and will either drop the last values or fill the remaining values in the destination with zeros. (Together with vector shift/rotate functions and a bit of care this allows portable code that casts between int and float vectors.)

Loads / Stores

The most important portable way to efficiently fill a complete SIMD vector is via loads. A load requires a pointer to at least size consecutive values in memory. Whether load constructors should be specified needs to be discussed. Their use does not explicitly state the operation:

float *data = ...;
float_v a(data); // it is not obvious what this line of code does
float_v b;
b.load(data);    // this is a lot clearer

On the other hand there should be a way to fill a vector on construction (e.g. to ease the use of const).

As for loads, the most important portable way to efficiently store the results from a SIMD vector is via store functions.

The overloads of load/store with value_type* exists to support arguments that have an implicit conversion to value_type*.

Loads and stores can optionally do a conversion from/to memory of a different type (e.g. float_v fun(short *mem) { return float_v(mem); }). This feature is important because:

hardware may provide special support for converting loads/stores (e.g. Intel MIC)
the pattern is otherwise much harder to use (e.g. using float for storage and double_v for calculation)

Binary Operators

All arithmetic, logical, and bit-wise operators are overloaded for combinations of different simd_vector<T> and builtin scalar types.

template<typename L, typename R> typename determine_return_type<L, R>::type operator+(L &&x, R &&y)
{
  typedef typename determine_return_type<L, R>::type V;
  return internal_add_function(V(std::forward<L>(x)), V(std::forward<R>(y)));
}

It is possible to get correct automatic conversion. The determine_return_type type decides which type combinations are allowed and what conversions are done.

Consider:

void f(int32_v x, unsigned int y) {
  auto z = x * y;
  ...

We expect the type of z in to be uint32_v. Analogue to 1 * 1u yielding an unsigned int.

On the other hand the following example must always fail to compile:

void f(int32_v x, double_v y) {
  auto z = x * y;
  ...

While 1 * 1. is well-defined and yields a double, the issue with SIMD vectors is that their number of entries is not guaranteed to match. (Consider x holding four values and y holding two values. The semantics of a multiplication of x with y is unclear. It could mean [x0 * y0, x1 * y0, x2 * y1, x3 * y1], or [x0 * y0, x1 * y1, x2, x3], or [x0 * y0, x1 * y1, x2 * y0, x3 * y1], or [x0 * y0, x1 * y1], or [x2 * y0, x3 * y1]. Or, for a different target, it could even simply mean [x0 * y0].)

Via SFINAE the operators can be defined such that only the following type combinations work:
simd_vector<T> × is_convertible<From, simd_vector<T> && !is_narrowing_float_conversion<From, T>

A follow-up proposal will define the determine_return_type class. This proposal could also discuss operator implementations for better error reporting (via static_assert) if a simd_vector is combined with an incompatible second operand.

Math Functions

All the functions in <cmath> can/should be overloaded to accept SIMD vectors as input.

Examples

Convert Many Points to Polar Coordinates

With the simd_vector<T> class it is possible to explicitly vectorize simple loops:

std::array<float, 1024> x_data, y_data, r_data, phi_data;
// fill x and y with random values from -1 to 1
// The following loop converts the x,y coordinates into r,phi polar coordinates.
for (int i = 0; i < 1024; i += float_v::size) {
  const float_v x(&x_data[i]);
  const float_v y(&y_data[i]);
  const float_v r = sqrt(x * x + y * y);
  const float_v phi = atan2(y, x) * 57.29578f;
  r.store(&r_data[i]);
  phi.store(&phi_data[i]);
}

already shows a few issues that will be solved in follow-up proposals:

The alignment of the float arrays is not guaranteed to work for aligned SIMD loads and stores.
The loads and stores are not important to the algorithm, but still they take up most of the code in the loop.
If the size of the arrays is not a multiple of the SIMD width, out-of-bounds accesses would result or an extra scalar epilogue were required.

If this example is compiled for an x86 target with SSE2 instructions the loop body calculates four results in parallel. If it is compiled for an x86 target with AVX instructions the loop body calculates eight results in parallel. If the target does not support float SIMD vectors the loop body calculates just one result.

Kalman-Filter

This is a (slightly reduced) Kalman-Filter example from CERN/RHIC track reconstruction. It is used to fit the track parameters of several particles in parallel.

struct Covariance {
  float_v C00,
          C10, C11,
          C20, C21, C22,
          C30, C31, C32, C33,
          C40, C41, C42, C43, C44;
};

struct Track {
  float_v x, y, tx, ty, qp, z;
  float_v chi2;
  Covariance C;

  float_v NDF;
  ...
};

struct HitInfo {
  float_v cos_phi, sin_phi, sigma2, sigma216;
};

void Filter(Track &track, const HitInfo &info, float_v m) {
  Covariance &C = track.C;
  const float_v residual = info.cos_phi * track.x + info.sin_phi * track.y - m; // ζ = Hr - m
  const float_v F0 = info.cos_phi * C.C00 + info.sin_phi * C.C10; //  CHᵀ
  const float_v F1 = info.cos_phi * C.C10 + info.sin_phi * C.C11;
  const float_v F2 = info.cos_phi * C.C20 + info.sin_phi * C.C21;
  const float_v F3 = info.cos_phi * C.C30 + info.sin_phi * C.C31;
  const float_v F4 = info.cos_phi * C.C40 + info.sin_phi * C.C41;
  const float_v HCH = F0 * info.cos_phi + F1 * info.sin_phi;      // HCHᵀ
  const float_v wi = 1.f / (info.sigma2 + HCH);
  const float_v zetawi = residual * wi;                           // (V + HCHᵀ)⁻¹ ζ
  const float_v K0 = F0 * wi;
  const float_v K1 = F1 * wi;
  const float_v K2 = F2 * wi;
  const float_v K3 = F3 * wi;
  const float_v K4 = F4 * wi;
  track. x -= F0 * zetawi;                                        // r -= CHᵀ (V + HCHᵀ)⁻¹ ζ
  track. y -= F1 * zetawi;
  track.tx -= F2 * zetawi;
  track.ty -= F3 * zetawi;
  track.qp -= F4 * zetawi;
  C.C00 -= K0 * F0;                                               // C -= CHᵀ (V + HCHᵀ)⁻¹ HC
  C.C10 -= K1 * F0;
  C.C11 -= K1 * F1;
  C.C20 -= K2 * F0;
  C.C21 -= K2 * F1;
  C.C22 -= K2 * F2;
  C.C30 -= K3 * F0;
  C.C31 -= K3 * F1;
  C.C32 -= K3 * F2;
  C.C33 -= K3 * F3;
  C.C40 -= K4 * F0;
  C.C41 -= K4 * F1;
  C.C42 -= K4 * F2;
  C.C43 -= K4 * F3;
  C.C44 -= K4 * F4;
  track.chi2 += residual * zetawi;                                // χ² += ζ (V + HCHᵀ)⁻¹ ζ
  track.NDF += 1;
}

In the Filter function a new measurement (m) is added to the Track state. In the data structures used here, the objects each contain the data of float_v::size tracks. So instead of working with one track at a time, the code explicitly states that multiple tracks can be filtered together and that their data is stored interleaved in memory. (The corresponding track (t_n) in the memory layout of a Track object (with e.g. SSE) looks like this:

[x_t₀ x_t₁ x_t₂ x_t₃
y_t₀ y_t₁ y_t₂ y_t₃
tx_t₀ ...
]

With AVX:

[x_t₀ x_t₁ x_t₂ x_t₃
x_t₄ x_t₅ x_t₆ x_t₇
y_t₀ y_t₁ y_t₂ y_t₃
y_t₄ ...
]

)

Acknowledgements

Thanks to Chandler Carruth for a good discussion about SIMD in C++ and comments on the structure and focus of this proposal.
Thanks to Matthias Bach, David Rohr, and Volker Lindenstruth for API discussions and encouragement.
Thanks to Igor Kulakov and Maksym Zyzak for all the feedback about usage of Vc in RHIC software.
This work was supported by GSI Helmholtzzentrum für Schwerionenforschung and the Hessian LOEWE initiative through the Helmholtz International Center for FAIR (HIC for FAIR).

References

[1] Kretz, M. and Lindenstruth, V. 2011. Vc: A C++ library for explicit vectorization. Software: Practice and Experience. (2011).
[2] Vc Repository http://code.compeng.uni-frankfurt.de/projects/vc