Document number: N3759
Date: 2013-08-30
Project: Programming Language C++, Library Working Group
Reply-to: Matthias Kretz <kretz@kde.org> <kretz@compeng.uni-frankfurt.de>
Introduction
Motivation
Problem
Proposal — SIMD Types
Acknowledgements
References
In Bristol N3571 (A Proposal to add Single Instruction Multiple Data Computation to the Standard Library) was discussed and the following straw poll was taken: “Should C++ include a fixed length vector type to abstract vector registers”
This poll was assuming a slightly different approach than presented in this proposal. It did not make sense to take over the discussion then to talk about this approach. In/after that session I was told that I should write a new proposal.
Since many years the number of SIMD instructions and the size of SIMD registers have been growing. Newer microarchitectures introduce new operations to optimize certain (common or specialized) operations. Additionally, the size of SIMD registers has increased and may increase further in the future.
The typical minimal set of SIMD instructions for a given scalar data type comes down to the following:
movaps (%rax),%xmm0 movups 0x4(%rax),%xmm1
movaps %xmm0,(%rax) movups %xmm1,0x4(%rax)
addps %xmm0,%xmm1 mulps %xmm1,%xmm1 divps %xmm1,%xmm0
cmpeqps %xmm0,%xmm1
andps %xmm0,%xmm1 xorps %xmm0,%xmm0
The set of available operations can differ considerably between different microarchitectures of the same CPU family. Furthermore there are different SIMD register sizes. Future extensions will certainly add more instructions and larger SIMD registers.
There is no need to motivate SIMD programming. It is very much needed, the open question is only: “How?”.
There have been several approaches to vectorization. I'd like to only discuss the merits of SIMD types here.
SIMD registers and operations are the low-level ingredients to SIMD programming. Higher-level abstractions can be built on top of these. If the lowest-level access to SIMD is not provided, users of C++ will be constrained to work within the limits of the provided abstraction.
In some cases the compiler might generate better code if only the intent is stated instead of an exact sequence of operations. Thus higher-level abstractions might seem preferable to low-level SIMD types. In my experience this is not the case because programming with SIMD types makes intent very clear and compilers can optimize sequences of SIMD operations just like they can for scalar operations. SIMD types do not lead to an easy and obvious answer to efficient and easily usable data structures, though.
One major benefit from SIMD types is that the programmer can gain an intuition for SIMD. This subsequently influences further design of data structures and algorithms to better suit SIMD architectures.
There are already many users of SIMD intrinsics (and thus a primitive form of SIMD types). Providing a cleaner and portable SIMD API would provide many of them with a better alternative. Thus SIMD types in C++ would capture existing practice.
The challenge remains in providing portable SIMD types and operations.
C++ has no means to use SIMD operations directly. There are indirect uses through loop vectorization or optimized algorithms (that use extensions to C/C++ or assembler for their implementation).
All compiler vendors (that I worked with) add intrinsics support to their compiler products to make these operations accessible from C. These intrinsics are inherently not portable and most of the time very directly bound to a specific instruction. (Compilers are able to statically evaluate and optimize SIMD code written via intrinsics, though.)
Solutions that encode data parallel operations via the application of operations or algorithms on a larger set of data, quickly lead to inefficient use of the CPUs caches. Consider valarray:
1 std::valarrayA compiler will be able to vectorize lines 3 and 4.data(N); 2 // initialize it somehow 3 data *= 2.f; 4 data += 1.f;
§ 6.2 [stmt.expr]: "All side effects from an expression statement are completed before the next statement is executed." does not necessarily stop the compiler from combining the operations on lines 3 and 4. Still, (for a larger number of expressions and containers) it cannot be reasonably expected/required that compilers will be able to combine the statements. Thus, we must assume that line 3 will iterate over N values and only afterwards line 4 is allowed to iterate over the same memory again. But modern CPUs require small working-sets to make efficient use of their caches. Therefore N should not be too large. General and exact bounds for N do not exist, though.
Any solution to achieving smaller working sets requires some form of loop construct with current C++.
The following is not a solution for this example unless expression templates were used in the implementation:
... 3 data = data * 2.f + 1.f;
For the reasons sketched in the section above, Intel experimented with an approach that would split the algorithm in optimized working sets at runtime. This required special sections of code which generated the code at runtime that subsequently was executed to do the actual work. The semantics in different sections of the code were thus slightly different: Something which can be really surprising and hard to understand for developers.
Whenever you take the approach of expressing the algorithm on the whole large data, you will either end up with inefficient cache usage or a solution resembling Intel Array Building Blocks. (I'd be happy to learn of another — better — solution, of course.)
Therefore loops still remain the state of the art for cache efficient processing of large data sets. The loop count can be reduced by executing the loop body simultaneously on SIMD vectors. (Additionally the loop can be divided onto multiple cores.)
The main portability concern stems from different SIMD register widths for different targets. This is a real problem mainly when SIMD types are used in I/O. But this is comparable to differing endianness. It simply requires software to use a portable interchange format (e.g. SoA or AoS of scalar types).
Typically the compiler is told the target microarchitecture via flags. Obviously this will create code that is not guaranteed to run on older/incompatible CPUs. An implementation might decide to compile the code for several targets, though and decide for the best one to use at runtime. But it is also possible to achieve runtime selection of the target microarchitecture without help from the compiler (e.g. Krita uses Vc this way).
This is a pure library proposal. It does depend on implementation-specific language extensions, though (e.g. SIMD intrinsics/builtins). The proposal builds upon the experience from the Vc library [1, 2].
new
, std::allocator
)
Provide at least the following new types:
int16_v
int32_v
uint16_v
uint32_v
float_v
double_v
int8_v
uint8_v
int64_v
uint64_v
Each class has a single data member, an implementation-specific object to access the SIMD register (this could be __m256
with AVX intrinsics).
If the type is not supported for SIMD on the target platform, the data member will be a single scalar value.
For example double_v
has one double
member for ARM NEON.
The sizes of these SIMD types therefore depend on the natural size suggested by the architecture of the execution environment.
This is similar to § 3.9.1 [basic.fundamental] p2: “Plain int
s have the natural size suggested by the architecture of the execution environment”.
These types should be instantiations of a SIMD vector template class: typedef simd_vector
.
This makes generic code slightly easier to create, without the need for SFINAE:
template<typename T> void someFunction(simd_vector<T> v); // vs. template<typename V> typename std::enable_if<is_simd_vector<V>::value, void>::type someFunction(V v);
The SIMD types all need to be inside a namespace whose name depends on the ABI/target. For instance the symbols for SSE and AVX SIMD types must be different so that incompatible object files/libraries fail to link.
inline namespace ABI/target-dependent {
The simd_vector<T>
class:
template<typename T> class simd_vector { typedef implementation-defined simd_type; simd_type data; public: typedef T value_type; // the number of values in the SIMD vector static constexpr size_t size = sizeof(simd_type) / sizeof(value_type); // in Vc operator new / delete are overloaded to work around the fact that new ignores the alignof // of the allocated type. If this does not get solved in the standard the following will be needed: void *operator new(size_t size); void *operator new(size_t, void *p) { return p; } void *operator new[](size_t size); void *operator new[](size_t , void *p) { return p; } void operator delete(void *ptr, size_t); void operator delete(void *, void *) {} void operator delete[](void *ptr, size_t); void operator delete[](void *, void *) {} // init to zero float_v(); // copy simd_vector(const simd_vector &); simd_vector &operator=(const simd_vector &); // implicit conversion from compatible simd_vector<T> template<typename U> simd_vector(simd_vector<U>, typename enable_if<is_integral<value_type>::value && is_integral<U>::value && size == simd_vector<U>::size, void *>::type = nullptr); // static_cast from vectors of possibly (depending on target) different size // (dropping values or filling with 0 if the size is not equal) template<typename U> explicit simd_vector(simd_vector<U>, typename enable_if<!( is_integral<value_type>::value && is_integral<U>::value && size == simd_vector<U>::size), void *>::type = nullptr); // broadcast with implicit conversions simd_vector(value_type); // load member functions void load(const value_type *mem); template<typename U> void load(const U *mem) ; // load ctors (optional) explicit Vector(const value_type *mem); template<typename U> explicit Vector(const U *mem); // store functions void store(value_type *mem) const; template<typename U> void store(U *mem) const; // unary operators simd_vector &operator++(); simd_vector operator++(int); simd_vector &operator--(); simd_vector operator--(int); simd_vector operator~() const; simd_vector operator+() const; simd_vector<typename negate_type<T>type> operator-() const; // assignment operators simd_vector &operator+=(simd_vector<T> x); simd_vector &operator-=(simd_vector<T> x); simd_vector &operator*=(simd_vector<T> x); simd_vector &operator/=(simd_vector<T> x); simd_vector &operator%=(simd_vector<T> x); simd_vector &operator&=(simd_vector<T> x); simd_vector &operator|=(simd_vector<T> x); simd_vector &operator^=(simd_vector<T> x); simd_vector &operator<<=(simd_vector<T> x); simd_vector &operator>>=(simd_vector<T> x); simd_vector &operator<<=(int x); simd_vector &operator>>=(int x); // scalar entries access implementation-defined &operator[](size_t index); value_type operator[](size_t index) const; }; typedef simd_vector< float> float_v; typedef simd_vector< double> double_v; typedef simd_vector< int64_t> int64_v; typedef simd_vector<uint64_t> uint64_v; typedef simd_vector< int32_t> int32_v; typedef simd_vector<uint32_t> uint32_v; typedef simd_vector< int16_t> int16_v; typedef simd_vector<uint16_t> uint16_v; typedef simd_vector< int8_t> int8_v; typedef simd_vector< uint8_t> uint8_v; } // namespace
Implicit conversion between vectors and implicit conversion from scalars to vectors follows the rules of conversion between scalar types as closely as possible. The rules are:
U
implicitly converts to T
then implicit conversion from U
to simd_vector<T>
also works.
simd_vector<U>::size == simd_vector<T>::size
works portably, then implicit conversion from simd_vector<U>
to simd_vector<T>
also works.
In Vc the following guarantees are made:
float_v::size == int32_v::size == uint32_v::size
int16_v::size == uint16_v::size
int64_v::size == uint64_v::size
int32_v::size == uint32_v::size
int16_v::size == uint16_v::size
int8_v::size == uint8_v::size
simd_vector<T>
types.
Explicit conversions between vectors of possibly different size are allowed and will either drop the last values or fill the remaining values in the destination with zeros. (Together with vector shift/rotate functions and a bit of care this allows portable code that casts between int and float vectors.)
The most important portable way to efficiently fill a complete SIMD vector is via loads.
A load requires a pointer to at least size
consecutive values in memory.
Whether load constructors should be specified needs to be discussed.
Their use does not explicitly state the operation:
float *data = ...; float_v a(data); // it is not obvious what this line of code does float_v b; b.load(data); // this is a lot clearerOn the other hand there should be a way to fill a vector on construction (e.g. to ease the use of
const
).
As for loads, the most important portable way to efficiently store the results from a SIMD vector is via store functions.
The overloads of load/store with value_type*
exists to support arguments that have an implicit conversion to value_type*
.
Loads and stores can optionally do a conversion from/to memory of a different type (e.g. float_v fun(short *mem) { return float_v(mem); }
).
This feature is important because:
float
for storage and double_v
for calculation)
All arithmetic, logical, and bit-wise operators are overloaded for combinations of different simd_vector<T> and builtin scalar types.
template<typename L, typename R> typename determine_return_type<L, R>::type operator+(L &&x, R &&y) { typedef typename determine_return_type<L, R>::type V; return internal_add_function(V(std::forward<L>(x)), V(std::forward<R>(y))); }It is possible to get correct automatic conversion. The
determine_return_type
type decides which type combinations are allowed and what conversions are done.
Consider:
void f(int32_v x, unsigned int y) { auto z = x * y; ...We expect the type of
z
in to be uint32_v
.
Analogue to 1 * 1u
yielding an unsigned int
.
On the other hand the following example must always fail to compile:
void f(int32_v x, double_v y) { auto z = x * y; ...While
1 * 1.
is well-defined and yields a double
, the issue with SIMD vectors is that their number of entries is not guaranteed to match.
(Consider x holding four values and y holding two values.
The semantics of a multiplication of x
with y
is unclear.
It could mean [x0 * y0, x1 * y0, x2 * y1, x3 * y1], or [x0 * y0, x1 * y1, x2, x3], or [x0 * y0, x1 * y1, x2 * y0, x3 * y1], or [x0 * y0, x1 * y1], or [x2 * y0, x3 * y1].
Or, for a different target, it could even simply mean [x0 * y0].)
Via SFINAE the operators can be defined such that only the following type combinations work:
simd_vector<T> × is_convertible<From, simd_vector<T> && !is_narrowing_float_conversion<From, T>
A follow-up proposal will define the determine_return_type
class.
This proposal could also discuss operator implementations for better error reporting (via static_assert
) if a simd_vector
is combined with an incompatible second operand.
All the functions in <cmath>
can/should be overloaded to accept SIMD vectors as input.
With the simd_vector<T>
class it is possible to explicitly vectorize simple loops:
std::array<float, 1024> x_data, y_data, r_data, phi_data; // fill x and y with random values from -1 to 1 // The following loop converts the x,y coordinates into r,phi polar coordinates. for (int i = 0; i < 1024; i += float_v::size) { const float_v x(&x_data[i]); const float_v y(&y_data[i]); const float_v r = sqrt(x * x + y * y); const float_v phi = atan2(y, x) * 57.29578f; r.store(&r_data[i]); phi.store(&phi_data[i]); }already shows a few issues that will be solved in follow-up proposals:
If this example is compiled for an x86 target with SSE2 instructions the loop body calculates four results in parallel.
If it is compiled for an x86 target with AVX instructions the loop body calculates eight results in parallel.
If the target does not support float
SIMD vectors the loop body calculates just one result.
This is a (slightly reduced) Kalman-Filter example from CERN/RHIC track reconstruction. It is used to fit the track parameters of several particles in parallel.
struct Covariance { float_v C00, C10, C11, C20, C21, C22, C30, C31, C32, C33, C40, C41, C42, C43, C44; }; struct Track { float_v x, y, tx, ty, qp, z; float_v chi2; Covariance C; float_v NDF; ... }; struct HitInfo { float_v cos_phi, sin_phi, sigma2, sigma216; }; void Filter(Track &track, const HitInfo &info, float_v m) { Covariance &C = track.C; const float_v residual = info.cos_phi * track.x + info.sin_phi * track.y - m; // ζ = Hr - m const float_v F0 = info.cos_phi * C.C00 + info.sin_phi * C.C10; // CHᵀ const float_v F1 = info.cos_phi * C.C10 + info.sin_phi * C.C11; const float_v F2 = info.cos_phi * C.C20 + info.sin_phi * C.C21; const float_v F3 = info.cos_phi * C.C30 + info.sin_phi * C.C31; const float_v F4 = info.cos_phi * C.C40 + info.sin_phi * C.C41; const float_v HCH = F0 * info.cos_phi + F1 * info.sin_phi; // HCHᵀ const float_v wi = 1.f / (info.sigma2 + HCH); const float_v zetawi = residual * wi; // (V + HCHᵀ)⁻¹ ζ const float_v K0 = F0 * wi; const float_v K1 = F1 * wi; const float_v K2 = F2 * wi; const float_v K3 = F3 * wi; const float_v K4 = F4 * wi; track. x -= F0 * zetawi; // r -= CHᵀ (V + HCHᵀ)⁻¹ ζ track. y -= F1 * zetawi; track.tx -= F2 * zetawi; track.ty -= F3 * zetawi; track.qp -= F4 * zetawi; C.C00 -= K0 * F0; // C -= CHᵀ (V + HCHᵀ)⁻¹ HC C.C10 -= K1 * F0; C.C11 -= K1 * F1; C.C20 -= K2 * F0; C.C21 -= K2 * F1; C.C22 -= K2 * F2; C.C30 -= K3 * F0; C.C31 -= K3 * F1; C.C32 -= K3 * F2; C.C33 -= K3 * F3; C.C40 -= K4 * F0; C.C41 -= K4 * F1; C.C42 -= K4 * F2; C.C43 -= K4 * F3; C.C44 -= K4 * F4; track.chi2 += residual * zetawi; // χ² += ζ (V + HCHᵀ)⁻¹ ζ track.NDF += 1; }In the
Filter
function a new measurement (m
) is added to the Track
state.
In the data structures used here, the objects each contain the data of float_v::size
tracks.
So instead of working with one track at a time, the code explicitly states that multiple tracks can be filtered together and that their data is stored interleaved in memory.
(The corresponding track (tn) in the memory layout of a Track
object (with e.g. SSE) looks like this:
[xt0 xt1 xt2 xt3
yt0 yt1 yt2 yt3
txt0 ...
]
With AVX:
[xt0 xt1 xt2 xt3
xt4 xt5 xt6 xt7
yt0 yt1 yt2 yt3
yt4 ...
]
)