1. Motivation
ISO/IEC 19570:2018 (1) introduced data-parallel types to the C++ Extensions for
Parallelism TS. [P1928R8] is proposing to make that a part of C++ IS. In the
current proposal individual elements must be of a type which satisfies
constraints set out by
(e.g., arithmetic types, complex-value
types). In this document we describe how
could allow user-defined
element types too, with automatic generation of element-wise operators for
user-defined types which define those operators, and with the option of dispatch
of operations to customization functions which allow the user to provide more
efficient implementations when automatic generation does less well.
Being able to have user-defined types in a
value is desirable
because it allows us to build on top of the standard features of
to
support SIMD programming in specialised problem domains, such as signal or media
processing, which might have custom data types (e.g., saturating or fixed-point
types). The idea is that
will be used to provide generic SIMD
capabilities for performing loads, stores, masking, reductions, gathers,
scatters, and much more, and then a set of customization points are provided to
allow the fundamental arithmetic operators to be overloaded where needed.
As a concrete example, consider that we might have a fixed-point data type for
signal processing called
. Such a fixed-point data type allows
a fractional (non-integer) value to be stored with only a fixed number of digits
to represent their fractional part. An instruction set which targets digital
signal processing will be likely to have instructions which accelerate the fundamental
arithmetic operations for these types and our aim is to allow
values of
this underlying fixed-point element type to be created and used. To begin with,
a
is represented or stored using individual blocks of bits
representing each element of the user-defined type within a vector register:
Note that a selected element, highlighted with a red rectangle, represents an individual
element value. The colours and values in each box represent
specific bit patterns. Where two elements have the same bit-pattern/colour, they
represent the same element value. An operation which moves elements within or
across
objects, or to and from memory, will copy the bit patterns. The
elements don’t change value because they move. In the diagram below a
has been used to extract the even elements of our original example above.
Note that an element representing a specific value in this example (e.g., the
dark-blue value 12) continues to represent the value 12 regardless of where it
appears within the
object. However, there are also cases where the bit
pattern of individual elements does need to be interpreted as a specific value.
For example, if we wish to add two
values together then
we need to define how to efficiently handle the addition of specific bit
patterns to form a new bit pattern. In the example below we are adding two
values and expecting to get a result which is created
by calling some underlying target-specific function:
In this example the
objects themselves have no knowledge of this
since the elements are of a custom type, so we need to encode that knowledge
into a user-defined customisation point. There will be a customisation point in
in those places where knowledge of what the bit-pattern in a
user-defined
element actually means. Common examples of such points are
arithmetic operations or relational comparators. In the example above, a
customisation point for
will be provided to map onto a
suitable hardware instruction to perform a vector fixed-point addition.
Custom types will not always support every possible operator that is exposed by
. For example, perhaps a
doesn’t allow division or
modulus, in which case
and
must be
removed from the overload set entirely.
Putting these mechanisms for defining custom element types together allows
us to write generic functions which can then be invoked with both built-in and
user-defined types equally easily, using the rich API of
, even when
the user-defined types are using target-specific customised functions:
// Compute the dot-product of a simd value. template < typename T , typename ABI > auto dotprod ( const basic_simd < T , ABI >& lhs , const basic_simd < T , ABI >& rhs ) { // The reduction moves are handled bit-wise, while the multiply and addition are delegated // to target-specific customisation points. The high-level code is unaltered. return reduce ( lhs * rhs , std :: plus <> {}); } ... float dfp = dotprod ( simdFloat0 , simdFloat1 ); fixed_point_16s8 dfxp = dotprod ( simdFixed0 , simdFixed1 );
An extended example of some custom element types is given toward the end of this paper, along with our experiences with using them.
This proposed extension of
to support custom types applies only to
types which can be vectorised as atomic units. That is, a
for a custom
type
has storage which is the same as an array of those types (i.e., this
might be called an array-of-structures). This is in contrast to another way of
representing a parallel collection of custom element types in which the
structure of the custom element is broken down into individual pieces (i.e., a
structure-of-arrays). For example, given
the
structure-of-array style of customisation would treat it as though it were
. While that could be a valuable feature for future
consideration, it is different to what is being proposed in this paper.
In addition to allowing customized element types in
, a side-effect
of this paper is to set out a way of thinking about the meaning of different
operations which makes the division of responsibility of different
APIs clear. This makes it easier to discuss the related topic of
adding new C++ builtin element types such as
, and scoped/unscoped
enumerations. For example, as we shall describe later there are specific basis
functions which define the fundamental arithmetic behaviour of
types, and knowing what those basis functions should be makes it easy to reason
about how to support
and
types. Later in this document we
shall introduce the necessary changes to allow
and
to
be easily incorporated.
Our intent with user-defined types is that the underlying type behaves like a
value type as defined in Elements of Programming (e.g., integers represented as
two’s complement values, or rational numbers represented as a concatenation of
two 32-bit sequences, interpreted as integer numerator and denominator). Only
arithmetic operations can be applied to value-types.
does not
provide overloads for non-value-type operators, such as the member-related
operators like
,
. We do not intend to extend
to include these non-value operators.
Since
is built on top of native hardware support, there are certain
limitations in the hardware that prevent arbitrary types being representable in
a
. It will be necessary to impose some restrictions on the types of
elements to make them implementable.
Note that a user-defined type, such as our example
, might
provide its own overloaded operators for handling the different types of
arithmetic operation that could occur. Ideally we want the operators from that
scalar type to be mapped over to work for
as well. In
effect, every
operator for that type would perform the equivalent
element-wise operator on the underlying values. For example, suppose we have the
following partially-implemented scalar
type:
struct fixed_point_16s8 { fixed_point_16s8 operator + ( fixed_point_16s8 lhs , fixed_point_16s8 rhs ) { return __intrin_fixed_add ( lhs , rhs ); } fixed_point_16s8 operator - ( fixed_point_16s8 lhs , fixed_point_16s8 rhs ) { return __intrin_fixed_sub ( lhs , rhs ); } int do_special_op () const ; std :: int16_t data ; };
If the user instantiated
and performed addition between
two values of this parallel type then this should give the same result as
invoking the operator above on each element in turn. The automatic extension of
a scalar operator to a simd operator is the first level of support that
should provide. Unfortunately, there may be cases where the compiler does a poor
job of auto-vectorising such code; iterating over each element in
turn does not automatically make good simd code. For these cases we also need a
mechanism which allows the automatic operator inferencing to be overridden to
replace it with a more efficient method provided by the user. We will use the
idea of customisation points, as used in other parts of C++, to enable this. The
user will be able to provide a special function for specific simd operators to
provide efficient implementations for those cases where the compiler doesn’t
find this for itself.
In the example above we also had a class method. At the moment no mechanism exists in
C++ to allow us to automatically add the equivalent method to
so we cannot create a complex simd-generic extension of
the user-defined type yet. In future if suitable reflection capabilities are
added this extension may become possible, but this is currently out of scope for this
paper.
2. Understanding customization opportunities, requirements, and restrictions
The first step in understanding how
can be extended to new element
types is to review what operations are provided by
, and how those operations must adapt to operating on custom user-defined elements.
2.1. Type restrictions
The current proposal for
in [P1928R8] only allows selected types
to be stored in a
but our proposal for user-defined types makes it
possible to theoretically store any other C++ type. It may not make sense to be
able to store some of those other types in a
. For example, they may have
side-effects that means they don’t work as expected when operated on in parallel,
or they rely on features that don’t translate well to a SIMD instruction set. We
should seek to restrict the valid element types to avoid the worst issues that
might arise.
Although [P1928R8] makes no mention of it, there is an implication that the
elements in a
are trivially copyable. For operations which move
elements around - broadcast, permutation, gather, scatter, and so on - it is
assumed that when the bits move location within a
object they will
continue to represent their original value.
currently restricts
types to be floating-point, integral, or complex-valued (with [P2663R4]), all
of which are trivially copyable.
Ultimately the success of
is tied to how well the underlying
hardware can support the element type. All known hardware supports only
elements which have sizes which are power-of-2, so we propose to require
element types to respect this. Also, most hardware has a limit on the size of
individual elements. In the current proposal of
, the largest
possible element type is the 128-bit
, so we will set this
as the upper limit of user-defined elements. Therefore, any user-defined type
for
must be 1, 2, 4, 8 or 16 bytes in size.
was designed to work on arithmetic value-like types, in common with the
hardware instruction sets to which they are ultimately giving access.
Restricting
to only such types is difficult as C++ has no way of querying
a type to determine if it is a product or sum type. Instead, we will only extend
user-defined operators to those operators which
already supports
(e.g.,
). We will not provide additional operators to support
member-like access (e.g.,
), dereferencing (e.g., unary
) since the presence of these operators implies they aren’t suitable
for storing in a
anyway
We propose that certain types should always be banned as
elements
including
, and any sort of pointer element. In the proposal we
will also explore the use of an opt-out mechanism which can be used to prevent
certain types from ever being contained in a
.
2.2. Customisation point classifications
While many operations within
will work on trivially copyable custom
types there are also places where
does need to interpret the meaning
of the bits, and it is those that need to be customization points. Each function can be put into one of the following categories:
- Basis
-
A basis function is one that must be provided as a customization point to allow the underlying element type to be used. An example would be addition; if addition is not provided as a customization point for a user-defined type then
values of that type cannot be added together.std :: simd - Custom
-
A customization function is one that can be implemented generically but which can also be customized to provide a more efficient implementation if one exists. An example of a Custom function would be negate (
) which could have a default implementation which subtracts from zero, or could be customized if the type provides a faster alternative (e.g., sign-bit flip for a floating-point-like type).operator - - Copy
-
A copy function uses the trivially copyable nature of the underlying type and allows bits to be moved from one place to another. The
function is a good example of a Copy function since apermute
of any type can move its elements around within a SIMD value without needing to know what the bits represent.std :: simd - Algorithm
-
An algorithm function uses other functions to implement some feature. If the algorithm relies on a Basis or Custom function which is not provided by the user-defined type then the algorithm function is removed from the overload set. An algorithm function does not provide a customization point.
The following table lists the key functions in
, and what category
they fall into. The table allows us to reason about what functions in
will just work as they are currently defined, and to separate out
those functions which must have customisation points defined in order to allow
their behaviour to be changed for custom user-defined types.
Function | Type | Notes |
---|---|---|
| ||
| n/a | Virtually every function in will work on user-defined types, with the exception of those listed in the next row. The mask is only dependent on knowing the number of bits in the type, and not on the interpretation of those bits.
|
| Algorithm | Convert and broadcast a 0, 1 or -1, and perform a using the mask to choose the appropriate value. Only provided if the convert is available.
|
Constructors | ||
| Copy | Broadcast a copy of the bits from represent the scalar source object to every element |
| Copy/custom | When U is the same as T, direct copy the bits into place. When U is different, we need a customization point to convert the user-defined elements. If no customization point is available, no conversions will be allowed but copying from other of the same type will be permitted.
|
| Copy | Each invocation of the generator builds an individual scalar value of the element type which is bitwise copied into the respective element. |
| Copy/Algorithm | When is the same, copy the bits.When is different and a customization point exists for the conversion, create the as the value iterator type first, and then copy the bits. No customization point will be allowed since it is unlikely that it brings any performance benefit, although this decision can be revisited.
|
Copy functions | ||
| Copy/Algorithm | When the destination type is the same use a direct bit copy from the into memory.When the destination type is different, convert to a of the destination type and invoke on that. A customization point is not provided since it is highly unlikely that any hardware support is available for copying to an special type.If the destination type is different and there is no conversion customization point remove the conversion-copy from the overload set. |
| Algorithm | Equivalent to calling and performing an assignment.
|
Subscript operators | ||
| Copy | Bitwise copy from element into the scalar output value |
Unary operators | ||
| Custom | If a customization point or builtin-type support is available use that. Otherwise if is available use .Otherwise remove from the overload set. |
| Algorithm | If are available use .Otherwise remove from the overload set. |
| Custom | If a customization point or builtin-type support is available use that. Otherwise if is available return .Otherwise remove from the overload set. |
| Basis | If a customization point or builtin-type support is available use that. Otherwise remove from the overload set. |
Binary operators | ||
| Basis | If a customization point or builtin-type support is available use that. Otherwise remove the from the overload set.
|
Compound assignment operators | ||
| Algorithm | If a customization point or builtin-type support is available for the underlying operation use .Otherwise remove this from the overload set. |
Relational operators | ||
| Basis | If a customization point or builtin-type support is available call that. Otherwise remove this from the overload set. Note that although each element represents its values using a specific copyable bit pattern this doesn’t mean that the same bit pattern represents an equal value (e.g., floating point NaN bit patterns will never be equal). |
| Custom | If a customization point or builtin-type support is available call that. Otherwise if is provided, return the negation of that function.Otherwise remove from the overload set. |
| Basis | If a customization point or builtin-type support is available call that. Otherwise remove from the overload set. Note that a minimal set could be provided (e.g., and since everything else can be built from those).
|
Conditional operator | ||
| Copy | Conditionally copy element bits with no interpretation. |
Permute | ||
/permute-like
| Copy | All permutes (generated or dynamic) move bits from one location to another without interpretation. Related operations like resize, insert, extract work in the same way. |
/
| Copy | All compression and expansion operations move bits from one location to another without interpretation. |
| Copy | Gather the values as though they were a and then use to convert (at no cost) into a of the user defined type.
|
| Algorithm | If the same type, bitwise scatter individual elements to the range using direct bitwise copy. If the destination type is different construct a of the destination type and perform a scatter on that type instead.
|
Reductions | ||
| Algorithm | All reduction operations can be implemented using a sequence of permutes and arithmetic operations. If the desired operation for the reduction step is not available in the overload set then the corresponding reduction is also removed from the overload set. No customization point will be provided for reductions since it is unlikely that custom types will have hardware support for reductions. These can be added later if this is found to be untrue. |
Free functions | ||
| Custom | If the user provides their own ADL overloaded customization point for this function then that will be used. Otherwise if relational operators are available for the type, use those to synthesise this operation (i.e., for ).Otherwise remove from the overload set. |
etc. | Custom | For any other free functions an ADL overload can be provided by the user to handle that specific type. |
2.3. Required customization points
In the table of function classifications above we can discount the Copy
functions from any further thought in this proposal. By limiting
elements to those which are trivially copyable, we can provide any sort of
operation which moves bits around using the
implementation itself, with no
special consideration for user-defined types.
Unsurprisingly, the table above shows us that we need customization points for all numeric operations, including:
plus minus negate multiplies divides modulus bit AND bit OR bit NOT bit XORr equal to not equal to greater less less or equal greater or equal logical AND logical OR logical NOT shift_left shift_right
Also unsurprisingly, these names are all those of the C++ transparent
template wrappers, with the exception of shift-left and shift-right. The only
other customisation point that would be needed is a conversion function. If a
UDT can be constructed or copied from a different type
, then it should also
be possible to construct or copy a
from
by element-wise
application.
Note: As a small aside, it isn’t clear why transparent operators are not provided for shift operations, and perhaps they should be added in for completeness in the future.
For each of these customisation points we have three layers of behaviour:
-
Default operator for standard C++ types (e.g.,
,int
,float
). No customisation is possible for any of the types available in C++.complex < floating_point > -
Explicit simd customisation. The user provides a specific function which is used to perform that operation on a
. The intent is to allow the user to make the operator as efficient as possible for cases where the compiler may not auto-vectorise efficiently.simd < UDT > -
Implicit
customisation using the scalar type’s own operator, applied element-wise. Ideally the compiler will auto-vectorise this to generate efficient code.simd -
Otherwise the operator is removed from the overload set.
For this last point, an example would be a user-defined complex type which might provide addition, subtraction and multiplication, but remove modulus, relational operators, and bitwise operators from the overload set, along with any other operations which depend upon those (e.g., compound modulus, compound bitwise).
Conversions will work in this customisation framework in the same manner as
arithmetic operators; if a conversion is not explicitly defined, or the scalar
type doesn’t support it, then the
type will also not support that
conversion.
3. Creating a customization framework
There are several considerations in a framework for user-defined element types: opt-in/out, storage, unary/binary operators and conversions, and free-functions. In this section we shall look at each of these in turn.
3.1. Opt-in or opt-out
When a
is created the aim is for library to do as much work as
possible to make that type behave reasonably, subject to a few restrictions
(e.g., element size). All bit-copy operations will work on that type, and any operators
defined for the scalar user-defined type will be mapped into the simd space.
In the original draft of this proposal the argument was made for an opt-in
process, whereby the user would have to explicitly arrange for the user-defined
type to be permitted as an element of a
. Since that first proposal we
have refined the behaviour of simd to allow it to infer simd operators from
scalar operators, thereby making it possible to create a correctly behaving
with very little effort. With this new approach it seems
unnecessarily onerous to have to require the user to opt-in to something that
works on its own.
Our new proposal is to have an opt-out mechanism, where the user can explicitly
indicate that a specified type is not suited to being an element in a
, even if the type is otherwise legal (e.g., copyable, power-of-2,
smaller than 16 bytes). Such an opt-out mechanism can be used to disable
unsuitable user-defined types as well as other non-vectorisable types in the
standard library, such as
.
3.2. Storage
In order to perform Copy-like operations on
elements we need to be
able to inform
values of how to store and move the underlying
elements. In [P2964R0] we allowed the user to specify
what the underlying storage should be, but with further thought we have decided
to make the storage unspecified. The
implementation is free to choose any
storage type. Customisation points which use that storage will the use
to convert to and from that storage and the user-defined type
as appropriate.
3.3. Unary and binary operator customization points
There are many different ways to implement customization points, including
template specialization, CPO, or
. Which mechanism is most suitable
can be discussed further if necessary but for this paper proposal we only care
about whether customization should be allowed, not the exact mechanism that will
be used.
All of the operators for
are
functions in order to allow ADL.
We must leave these
function operators, but allow them to defer to a
customization point as required. One possible pseudo-implementation of an individual
operator may be this:
constexpr friend basic_simd operator + ( const basic_simd & lhs , const basic_simd & rhs ) requires ( details :: simd_has_custom_binary_plus || details :: element_has_plus ) { if constexpr ( details :: is_standard_simd_type ) // int, float, complex, etc. return details :: plus ( lhs , rhs ); else if constexpr ( details :: simd_has_custom_binary_plus ) // user customisation point return simd_binary_plus ( lhs , rhs ); else return details :: element_wise_plus ( lhs , rhs ); // Infer from scalar operator }
In this example
is only put in the overload set if a
builtin-arithmetic type supports addition directly or has a suitable
customisation point. Internally, the function will then invoke the appropriate
implementation.
We need to specify what the customisation points will be called to allow them to
be discoverable. In the example above we have explicitly named the customisation
function
. This has the advantage that it is very clear and
unambiguous in what it does, but it does introduce potential for high levels of
duplication. This is because every operator will have its own unique
customisation point name.
An alternative is to exploit the transparent templates that already exist in C++ and use them to differentiate between operations. Here is an example of the signature of a customisation point for a user defined type:
template < typename Abi , typename CustomType , std :: invocable < CustomType , CustomType > Fn > constexpr auto simd_binary_op ( const basic_simd < CustomType , Abi >& lhs , const basic_simd < CustomType , Abi >& rhs , Fn op );
This has a unique and distinctive name to mark it as a customisation point, and as
a binary operator it takes in two
inputs. It also takes a
third parameter which specifies what binary operation to perform, chosen from
the list of standard template wrappers. For example, the call site in
would look like this:
return simd_binary_op ( lhs , rhs , std :: plus <> {});
Similarly, unary operators can be customized using a customization function
called
which accept a unary transparent template wrapper.
The advantage of using this mechanism rather than named functions for every required operator is that it removes the need for many different functions, and allows related operations to be consolidated into a single function. It also allows the transparent operator itself to be invoked directly to perform an operation. For example, suppose we want to define a customisation point for a user defined type that has non-standard behaviour for multiply and divide, but everything else works like a standard arithmetic operator (examples of such types include complex numbers and fixed-point numbers). The following pseudo-implementation captures this behaviour:
template < typename Abi , typename CustomType , std :: invocable < CustomType , CustomType > Fn > constexpr auto simd_binary_op ( const basic_simd < CustomType , Abi >& lhs , const basic_simd < CustomType , Abi >& rhs , Fn op ) { // Special case for some operators if constexpr ( std :: is_same_v < Fn , std :: multiplies <>> ) return doCustomMultiply ( lhs , rhs ); else if constexpr ( std :: is_same_v < Fn , std :: divides <>> ) return doCustomDivides ( lhs , rhs ); // All other cases defer to an integer instead. else return op ( simd < int > ( lhs ), simd < int > ( rhs )); }
This is not only less verbose but it also makes it obvious how and why the custom type has to be handled differently to a builtin-type.
Unfortunately shift operators don’t have transparent wrappers, so if we did use this approach we need one of the following too:
-
a specially named customization point (e.g.,
)simd_shift_left_op -
an additional transparent operator added to
to allow the existing binary operation to be used (e.g.,std :: simd
)std :: simd_shift_left <> -
a standardised
transparent operator (i.e.,shift_left
)std :: shift_left <>
In the Intel example implementation we have used the second of these. Having a different name and mechanism for shifts introduces extra complexity and non-uniformity. We hope that a transparent operator wrapper for shift might be added in future, in which case it will also be easier to transition to using that if we provide a local alternative to begin with.
3.4. Conversion customization
Conversions behave much like the customization points for arithmetic operators. Like the other customisation operators conversions try three different strategies:
-
If a customisation for
conversion exists, use thatsimd < UDT > -
Otherwise if scalar conversion exists, invoke that element-wise
-
Otherwise remove conversion capabilities
These conversion rules will be used wherever needed with
, not just within
the main constructor. For example,
will load data from memory into a
and then convert it to the desired output type.
3.5. Overloads for free-functions
Anything outside
itself can be freely overloaded for the custom type. For
example,
could be provided as follows:
template < typename Abi > constexpr auto abs ( const basic_simd < fixed_point_16s8 , Abi >& v ) { return /* special-abs-impl */ ; }
No
-specific customization points are required for any of the other functions as overloads will suffice.
4. Extending support to enum
and std :: byte
It is useful to be able to create a
of enumerations and
and
the mechanisms defined in this proposal make this trivial to implement. However,
we would also define the following free-functions to mirror the support
available with these standard types.
template < class Enum , typename Abi > constexpr std :: basic_simd < std :: underlying_type_t < Enum > , Abi > to_underlying ( std :: basic_simd < Enum , Abi > se ) noexcept ;
template < class IntegerType , typename Abi > constexpr std :: basic_simd < IntegerType , Abi > to_integer ( std :: basic_simd < std :: byte , Abi > b ) noexcept ;
5. Implementation Experience
In Intel’s implementation of
, the customization points described in
this proposal have been implemented so that we can use instantiations of
which use signal processing data types such as fixed-point or
saturating integrals. In the following example we show code from our
implementation which has been used to create a
of a user defined
saturating data type. We start by showing how the compiler does a reasonable job
of inferring the required operators, but that certain operators prove too
difficult to do well, at which point we can provide customisation points to
smooth things out. We expect that a similar process will be used when other
user-defined types are implemented.
Consider what might be needed to make a saturating data type. To begin with, we might already have a 16-bit scalar saturating data type:
struct saturating_int16 { saturating_int16 ( int v ) : data ( v ) {} int16_t data ; // Addition friend saturating_int16 operator + ( saturating_int16 lhs , saturating_int16 rhs ) { auto r = int32_t ( lhs . data ) + int32_t ( rhs . data ); return int16_t ( std :: min < int32_t > ( std :: max < int32_t > ( r , -32768 ), 32767 )); } friend bool operator > ( saturating_int16 lhs , saturating_int16 rhs ) { return lhs . data > rhs . data ; } // Other operators also defined, but omitted for brevity. };
Let us begin with simple permute operations which illustrates that storage and bit-copy operations will work as expected:
C++ Code | Assembly |
---|---|
|
|
|
|
Our original type defined addition, so we should be able to automatically call addition and its derivatives for the corresponding user-defined type:
C++ Code | Assembly |
---|---|
|
|
|
|
Note that the compiler (Intel oneAPI 2024.0) has done an excellent job, even generating a real saturation instruction. Unfortunately there is a small issue with the quality of the generated code, which we shall return to shortly.
Next, we need to be able to compare saturated values. Our original class was able to do this using its own operator, allowing us to unlock comparisons including reduction comparisons.
C++ Code | Assembly |
---|---|
|
|
|
|
Although the compiler has done a good job of the examples above, there were unfortunately a few places where it didn’t do so well. For example:
C++ Code | Assembly |
---|---|
|
|
For an unknown reason the compiler has switched to using element-by-element
application of the scalar operation here, which is considerably slower. It is
likely that we will fix the compiler to correct whatever mistake it is making
here, but for now we can use the customisation point mechanism to aid the
compiler in doing a better job. We want to change the behavior of
so
that addition is handled explicitly using the known intrinsics. We do this by
defining our customisation function. Here is a very simple implementation which
runs on an Intel AVX2 machine. It could be extended to work on all Intel
instruction sets (e.g, AVX-512) but that is outside the scope of this proposal.
template < typename Abi > constexpr auto simd_binary_op ( const xvec :: basic_simd < saturating_int16 , Abi >& lhs , const xvec :: basic_simd < saturating_int16 , Abi >& rhs , std :: plus <> ) { constexpr int numElements = xvec :: basic_simd < saturating_int16 , Abi >:: size ; if constexpr ( numElements <= 8 ) return basic_simd < saturating_int16 , Abi > ( _mm_adds_epi16 ( lhs . to_register (), rhs . to_register ())); else return basic_simd < saturating_int16 , Abi > ( _mm256_adds_epi16 ( lhs . to_register (), rhs . to_register ())); }
This now generates code like this:
C++ Code | Assembly |
---|---|
|
|
This is acceptably better code.
5.1. Summary of implementation experience
In this section we have given a simple example in which we started with a scalar
user-defined type and automatically inferred the
versions of its
operators, allowing us to quickly develop code which used
. During
inspection of the generated assembly code we found an issue with the quality of
the code which we were able to overcome by using a customisation point to help
guide the compiler to generate the correct code.
The complete finished example could be used in real code, and allow the user to
quickly extend their own type to something that can be made to work with all of
the
APIs, with performant quality code, with relatively minimal
effort.
6. Future work
SG6 suggested that all operators should be able to take different input and
outputs to facilitate adding
,
, and other
mixed-type operators. Although this work is incomplete it is still useful to get LEWG
feedback on the direction of the other features.
7. Conclusion
In this proposal we have outlined the basic mechanisms needed to allow
user-defined types to be stored and manipulated by
values, and
crucially, to be able to do so without knowledge of the internal implementation
of the
library.
8. Acknowledgements
We would like to thank Matthias Kretz for his feedback and his useful contributions to discussions.
9. Revision History
R0 => R1
-
Incorporated SG1 and SG6 feedback from 2024 Tokyo meeting.
-
Added restrictions on element types (e.g., size).
-
Added inferencing as a valid method for constructing
operators from scalar operators.simd -
Added type conversions.
-
Removed opt-in and replaced with opt-out.
-
Removed explicit user-defined storage.
-
Provided inferencing example.