1. Revision History
R1 => R2
-
Added poll from St Louis '24 meeting.
-
Added wording
-
Added note on
.std :: ranges_to -
Added numerous extra notes, clarifications and examples.
R0 => R1
-
Added description of erroneous behaviour
-
Added examples
2. Motivation
The [P1928R12] proposal outlines a number of constructors and accessors which
can be used to move blocks of memory data in and out of a
object, and
these are supplemented in [P2664R6] with functions for permuting data to and
from memory. However, these functions are potentially unsafe since they rely on
representing memory regions using only an iterator to the start of the memory
region, and relying on the
objects own size to determine the extent. In [P3024R0] it was suggested that
should include memory safe operations
and LEWG asked for an exploration of what such features might look like, what
their cost would be, and what trade-offs might have to be made. This paper
discusses that exploration.
This paper primarily discusses moving data from memory into a
using a
constructor which accepts a range. Exploration of this problem covers all of the
main issues that might arise in safe interactions between memory and a
value. Once a direction has been set we will extend this paper to cover the
movement of data from a
back into memory.
3. Existing functions for memory accesses
Being able to load the contents of a region of memory into a
is common
practice, and essential for many types of algorithm which are written in a
data-parallel style. However, [1928R8] may have some issues related to memory
safety, and in this section we discuss some of the options to avoid or mitigate
these issues.
There are two constructors which allow a memory load, and two related member functions:
simd :: simd ( contiguous_iterator , flags ) simd :: simd ( contiguous_iterator , mask , flags ) simd :: copy_from ( contiguous_iterator , flags ) simd :: copy_from ( contiguous_iterator , mask , flags )
We will only discuss the unmasked constructor in this proposal since the other forms can be readily derived from it. There is one exception which we note in its own section below.
The issue with these memory load constructors and members is that they lack
memory safety. They will attempt to load a block of memory into a
value
without checking the legality of reading from that memory region, potentially
allowing access to data which it should not be able to use, or generating a
hardware exception (e.g,. page fault). In this section we shall discuss possible
ways to improve this behaviour, along with the costs in doing so.
The constructors are not the only place in
which needs to take care of
memory safety. The
and
functions described in [P2664R6] also suffered from similar access issues, but they have been updated
to avoid these issues by replacing single iterators with ranges, and using
runtime-checks. However, gather and scatter instructions are relatively rare
compared to normal loads, and are expensive instructions anyway, so the cost of
adding runtime checks to them has negligible cost. In contrast, a plain memory
load is a work-horse instruction, frequently used, and much more sensitive to
the imposition of the costs of runtime checks, so we have to be more careful in
what we propose for them.
4. Generic guidelines for safe memory accesses
There is one glaring problem with the basic memory access as it is currently
defined in
, which is that it provides only an iterator to the start of
the source:
template & lt ; class It , class ... Flags > constexpr basic_simd ( It first , simd_flags < Flags ... > = {});
The absence of information about the extent of the memory source makes it impossible for the implementation to (optionally) check its preconditions. Fundamentally, the fix for this is to provide information about the extent of the source which could be done in a few different ways, such as:
// Use a pair of iterators to mark the limits constexpr basic_simd ( It begin , It end , simd_flags < Flags ... > = {}); // Iterator and size constexpr basic_simd ( It begin , size_type count , simd_flags < Flags ... > = {}); // Span with fixed size constexpr basic_simd ( span < T , N > , simd_flags < Flags ... > = {}); // Range with dynamic size (possibly unbounded) constexpr basic_simd ( Range r , simd_flags < Flags ... > = {});
Once we have information about the extent of the memory region then we can add either static compile-time checks, or dynamic run-time checks to use that information to determine correctness and safety. We can then act on that information in a number of ways including throwing an exception on error, defaulting to some safe state, or even choosing to ignore the issue entirely when the programmer knows it is safe to do so or prefers speed over safety.
5. Accessing a source of compile-time extent
In the following examples the elements of the destination
will be filled
with the contents of a memory source with compile-time extent:
std :: array < float , 6 > ad ; simd < float , 6 > simdA ( ad ); std :: span < int , 8 > sd ( d . begin (), 8 ); simd < int , 8 > simdB ( sd );
This is the ideal scenario; we have the maximum amount of information possible about the data being moved, and since it is known at compile-time we can enforce whatever our desired behaviour is. In our proposal we think that under these conditions all data from the source should be copied into the destination. There should be no silent loss of data through truncation, nor should there be silent addition of new values. If the programmer desires either of those behaviour they should make that explicit:
std :: array < float , 6 > ad ; simd < float 8 > simdA ( ad ); // No! There is a mismatch in size causing extra data to be added. // Alternative making the resize explicit. simd < float , 6 > simdA ( ad ); auto simdA8 = resize < 8 > ( simdA );
Our intention would be to allow construction from anything which has
compile-time extent including C-array,
,
, and so on. We
will either spell these out in the wording, or find a way to generically
construct from anything which has compile-time constant size found using
,
, and so on. Example implementation:
template < typename Rg > constexpr inline size_t static_range_size = [] -> size_t { using T = remove_cvref_t < Rg > ; if constexpr ( requires { typename integral_constant < size_t , T :: size () > ; }) return T :: size (); else if constexpr ( requires { typename integral_constant < size_t , T :: extent > ; }) return T :: extent ; else if constexpr ( extent_v < T > > 0 ) return extent_v < T > ; else if constexpr ( requires { typename integral_constant < size_t , tuple_size < T >:: value > ; typename ranges :: range_value_t < Rg > ; }) { if constexpr ( tuple_size_v < T > >= 1 && same_as < tuple_element_t < 0 , T > , ranges :: range_value_t < Rg >> ) return tuple_size_v < T > ; } return dynamic_extent ; }();
We propose that the span constructor is explicit. Our basis for this is to note
that [P3116R0] recommends that implicit conversions should only be provided if
the types are fundamentally the same. A
and a
may share
similarities in storage (both contain a set of element values), but a
also allows an extensive range of parallel operations. It isn’t clear cut
whether the types are fundamentally the same and so we prefer to be conservative
and require the constructor to be explicit. This decision can be revisited if
compelling cases are found to make this conversion implicit.
Finally, we can provide deduction guides too, since we know what the size of the
output
should be:
template < contiguous - range - with - static - extent Rg > basic_simd ( Rg && ) -> basic_simd < ranges :: range_value_t < Rg > , deduce - t < ranges :: range_value_t < Rg > , static - range - size < Rg >>> ; template < contiguous - range - with - static - extent Rg , typename ... Flags > basic_simd ( Rg && , simd_flags < Flags ... > ) -> basic_simd < ranges :: range_value_t < Rg > , deduce - t < ranges :: range_value_t < Rg > , static - range - size < Rg >>> ;
Recall that we cannot define the necessary deduction guides for the
alias template. Thus, users that want to use CTAD need to use
:
std :: array < float , 4 > data ; std :: simd v = data ; // compiles only if simd<float>::size() happens to be 4 std :: basic_simd w = data ; // deduces Abi correctly
6. Accessing a source of bounded but unknown extent
When the source has an unknown extent there may be a mismatch between the size
of the
and the source’s size. We could handle this in one of three ways,
from least expensive to most expensive:
-
Make this undefined behaviour
-
Make this defined behaviour which is gracefully handled
-
Make this erroneous behaviour
-
Make this an exception
We look at each of these in turn. We also look at which of these should be the default behaviour.
6.1. Undefined behaviour
This option is equivalent to leaving the status quo of [P1928R12] unchanged,
and putting the onus entirely on the user to check that the range is valid. If
the source range is too small, the load will go beyond the end of the valid
data. If the source range is too big, the data will be truncated to the size of
the
.
Even if the range is not used explicitly for bounds checking, the extra information could still be used in debug modes by the compiler to help aid correctness checking, or by static analysis tools, or by mechanisms for contract checking.
6.2. Defined behaviour
With this option the behaviour of the
load will always generate valid
data and will always ensure that out-of-bounds reads (or writes for stores)
cannot occur.
When the
is too small to hold all the data in the source range, the data
will be truncated to what can fit in the
itself.
When the
is bigger than the source range then the extra elements are
filled with value initialized data. The memory beyond the end of the smaller
source range will not be accessed. This particular behaviour is useful for
handling loop remainders/epilogues.
The major costs of this option are the extra instructions to implement the runtime check, and the need to do a partial copy from memory into the simd register. The latter of these might be particular expensive if, for example, the target has to do element-by-element copy from memory to satisfy the need to avoid touching the memory entirely. Also, the introduction of a checked load could introduce a branch into critical load routines where we really don’t want branches. Even on modern Intel machines which have masked memory operations, the cost of setting up and acting on the mask for every call is still something we’d prefer to avoid.
We could explicitly opt-in to this behaviour by passing in
as a flag.
6.3. Erroneous behaviour
With a minor change in implementation, compared to "defined behaviour" above, the implementation could also support EB on out-of-bounds access. The runtime cost of checking the range size and copying only the valid elements is exactly the same. The only difference is that the default-initialized simd elements must exhibit EB on usage.
The default-initialization feature above is a valid use-case for implementing a remainder step after a loop over a larger range. So we certainly don’t want to unconditionally make default-initialized elements EB. Instead, EB is a replacement for UB, where the runtime cost is acceptable compared to the cost of an out-of-bounds access.
6.4. Exception
The final and most expensive option is to require the source range and the
size to be identical. A runtime check is inserted into the load to check
the extents match, and an exception thrown when they don’t. This is expensive
because of the runtime check, but also because the mechanics of exceptions often
disrupt optimized code.
We could explicitly opt-in to this behaviour by passing in
as a flag.
6.5. Choosing the default behaviour
We have listed three options for what the behaviour of a load from an unknown
range should be. In each case we can define a suitable flag which opts in to
that behaviour:
,
, and
. We
could require the user to always specify what behaviour they want, but this will
lead to verbose code. Ideally we should define what the default should be, and
realistically the choice is one of undefined behaviour, or defined
behaviour. Exceptional behaviour is really too expensive for what is considered
to be a performance library.
If we chose to use defined behaviour as the default, where the data will be truncated or
default initialized depending upon the source size, then we get repeatable,
understandable and sane behaviour whatever the range of data we request. The big
downside is that every use of the function will incur a runtime cost to check
the bounds and to do something appropriate if a mismatch is found. This
introduces a potential performance bottleneck directly into one of the most
performance critical parts of
. We could accept this default if we found
that the compiler is able to keep the cost of the check as low as possible
(e.g., if it can statically analyse the code and remove any redundant checks),
but our experience so far is that the check is expensive (TBD: Qualify the expense).
If we chose to use undefined behaviour then we avoid the need to insert a
run-time check and action in what is a low-level function with high performance
requirements. This most closely matches the status quo of
as it is
currently defined. By providing the extra bounds information we allow the
compiler to do static checking of its own to detect issues, or allow it to emit
extra information to allow range checking in hardware, or through debug
facilities. We also get the best possible performance from the
library in
the common case.
Given
’s primary use as a low-level performance library, we recommend
that the default should be undefined behaviour, with the user opting into one of
the other more expensive options. Higher level abstractions (such as the simd
execution policy discussed later) can use the extra knowledge they have about
context to improve on safety.
6.6. Examples
All examples use the following alias
and are compiled with
(32 byte wide SIMD). They implement the parallel part of a reduction over a
larger
. The function
would need to be a call to
for a complete reduction. Since that would complicate an
analysis we leave it at simply calling an anonymous function
.
using V = std :: simd < float > ;
This first example incorrectly creates a
of static extent at the end of
the loop, which, in general, goes beyond the size of the vector.
void iter_ub1 ( const std :: vector < float >& x ) { V acc {}; for ( auto it = x . begin (); it < x . end (); it += V :: size ) { std :: span < const float , V :: size () > chunk ( it , V :: size ()); acc += V ( chunk ); } g ( acc ); }
iter_ub1 ( std :: vector <float , std :: allocator <float > >const &): mov rax , QWORD PTR [ rdi ] mov rdx , QWORD PTR [ rdi +8 ] vxorps xmm0 , xmm0 , xmm0 cmp rax , rdx jnb .L32 .L33: vaddps ymm0 , ymm0 , YMMWORD PTR [ rax ] add rax , 32 cmp rax , rdx jb .L33 .L32: jmp void g <... >
So, if the user creates invalid
s there’s nothing we can do in
. As
expected. But lets not create
s anymore then. Instead we’re going to load
from an iterator pair, but potentially calling load with a range size less than 8.
void iter_ub2 ( const std :: vector < float >& x ) { V acc {}; for ( auto it = x . begin (); it < x . end (); it += V :: size ) acc += std :: simd_load ( it , x . end ()); g ( acc ); }
iter_ub2 ( std :: vector <float , std :: allocator <float > >const &): mov rax , QWORD PTR [ rdi ] mov rdx , QWORD PTR [ rdi +8 ] vxorps xmm0 , xmm0 , xmm0 cmp rax , rdx jnb .L37 .L38: vaddps ymm0 , ymm0 , YMMWORD PTR [ rax ] add rax , 32 cmp rax , rdx jb .L38 .L37: jmp void g <... >
Now a precondition check inside
can see the out-of-bounds access.
Contracts checking / debug builds / static analysis could easily catch this.
But if the default behavior of
is UB, then this is still an invalid
program. So, then what happens if we change
to partial-loadstore when
the range is too small?
void iter_zero ( const std :: vector < float >& x ) { V acc {}; for ( auto it = x . begin (); it < x . end (); it += V :: size ) acc += std :: simd_load ( it , x . end (), std :: simd_flag_partial_loadstore ); g ( acc ); }
iter_zero ( std :: vector <float , std :: allocator <float > >const &): mov rax , QWORD PTR [ rdi ] mov rcx , QWORD PTR [ rdi +8 ] vxorps xmm0 , xmm0 , xmm0 cmp rax , rcx jnb .L68 vmovaps ymm2 , ymm0 mov r10d , 8 .L65: mov rdx , rcx sub rdx , rax cmp rdx , 28 jle .L69 vmovups ymm1 , YMMWORD PTR [ rax ] add rax , 32 vaddps ymm0 , ymm0 , ymm1 cmp rax , rcx jb .L65 .L68: jmp void g <... >.L69: push r13 lea r13 , [ rsp +16 ] and rsp , -32 push QWORD PTR [ r13-8 ] push rbp mov rbp , rsp push r13 .L43: sar rdx , 2 mov rsi , r10 vmovaps YMMWORD PTR [ rbp-48 ], ymm2 lea r8 , [ rbp-48 ] sub rsi , rdx mov rdx , rax cmp esi , 8 jnb .L70 .L45: xor edi , edi test sil , 4 je .L48 mov edi , DWORD PTR [ rdx ] mov DWORD PTR [ r8 ], edi mov edi , 4 .L48: test sil , 2 je .L49 movzx r9d , WORD PTR [ rdx +rdi ] mov WORD PTR [ r8 +rdi ], r9w add rdi , 2 .L49: and esi , 1 je .L50 movzx edx , BYTE PTR [ rdx +rdi ] mov BYTE PTR [ r8 +rdi ], dl .L50: vmovaps ymm1 , YMMWORD PTR [ rbp-48 ] add rax , 32 vaddps ymm0 , ymm0 , ymm1 cmp rax , rcx jnb .L71 .L51: mov rdx , rcx sub rdx , rax cmp rdx , 28 jle .L43 vmovups ymm1 , YMMWORD PTR [ rax ] add rax , 32 vaddps ymm0 , ymm0 , ymm1 cmp rax , rcx jb .L51 .L71: mov r13 , QWORD PTR [ rbp-8 ] leave lea rsp , [ r13-16 ] pop r13 jmp void g <... >.L70: mov r9d , esi xor edx , edx and r9d , -8 .L46: mov edi , edx add edx , 8 mov r8 , QWORD PTR [ rax +rdi ] mov QWORD PTR [ rbp-48 +rdi ], r8 cmp edx , r9d jb .L46 lea rdi , [ rbp-48 ] lea r8 , [ rdi +rdx ] add rdx , rax jmp .L45
The program is correct now. But it’s not as efficient as it should be. There’s
no way around using UB loads inside the loop. We just need to end the loop
earlier and then handle the remainder. For the remainder we can make good use
of the
behavior.
void iter_zero_efficient ( const std :: vector < float >& x ) { V acc {}; auto it = x . begin (); for (; it + V :: size () <= x . end (); it += V :: size ) acc += std :: simd_load ( it , x . end ()); acc += std :: simd_load ( it , x . end (), std :: simd_flag_partial_loadstore ); g ( acc ); }
iter_zero_efficient ( std :: vector <float , std :: allocator <float > >const &): mov rcx , QWORD PTR [ rdi ] mov rdx , QWORD PTR [ rdi +8 ] vxorps xmm0 , xmm0 , xmm0 lea rax , [ rcx +32 ] cmp rdx , rax jb .L109 .L110: vaddps ymm0 , ymm0 , YMMWORD PTR [ rax-32 ] mov rcx , rax add rax , 32 cmp rdx , rax jnb .L110 .L109: sub rdx , rcx cmp rdx , 28 jbe .L111 vmovups ymm1 , YMMWORD PTR [ rcx ] vaddps ymm0 , ymm0 , ymm1 jmp void g <... >.L111: push r13 sar rdx , 2 mov esi , 8 vxorps xmm1 , xmm1 , xmm1 sub rsi , rdx mov rax , rcx lea r13 , [ rsp +16 ] and rsp , -32 push QWORD PTR [ r13-8 ] push rbp mov rbp , rsp push r13 lea rdi , [ rbp-48 ] vmovaps YMMWORD PTR [ rbp-48 ], ymm1 cmp esi , 8 jnb .L134 .L113: xor edx , edx test sil , 4 jne .L135 test sil , 2 jne .L136 .L117: and esi , 1 jne .L137 .L118: vmovaps ymm1 , YMMWORD PTR [ rbp-48 ] vaddps ymm0 , ymm0 , ymm1 mov r13 , QWORD PTR [ rbp-8 ] leave lea rsp , [ r13-16 ] pop r13 jmp void g <... >.L137: movzx eax , BYTE PTR [ rax +rdx ] mov BYTE PTR [ rdi +rdx ], al jmp .L118 .L136: movzx ecx , WORD PTR [ rax +rdx ] mov WORD PTR [ rdi +rdx ], cx add rdx , 2 and esi , 1 je .L118 jmp .L137 .L135: mov edx , DWORD PTR [ rax ] mov DWORD PTR [ rdi ], edx mov edx , 4 test sil , 2 je .L117 jmp .L136 .L134: mov r8d , esi xor eax , eax and r8d , -8 .L114: mov edx , eax add eax , 8 mov rdi , QWORD PTR [ rcx +rdx ] mov QWORD PTR [ rbp-48 +rdx ], rdi cmp eax , r8d jb .L114 lea rdi , [ rbp-48 ] add rdi , rax add rax , rcx jmp .L113
This program is both correct and, at least in the loop, as efficient as possible. With hardware support for masked instructions, a high quality implementation of the partial-init load turns into a single instruction.
7. Overloads for different range specifications
Providing constructors which accept a static or a dynamic extent can solve all
of the basic needs of
, but it may be convenient to have additional
overloads which simplify some of the common ways in which
values are
constructed.
already provides a variety of different styles of bound
specification, including:
span ( It first , size_type count ); span ( It first , End last ); span ( std :: type_identity_t < element_type > ( & arr )[ N ] ) noexcept ; span ( std :: array < U , N >& arr ); span ( R && range ); span ( std :: initializer_list < value_type > il ) noexcept ; span ( const std :: span < U , N >& source ) noexcept ;
Many of these styles are already handled by
, where constructors are
available for compile-time extents or dynamic ranges already, and need not be
considered for special treatment here. However, there are two common cases which could be added as utility constructors:
constexpr basic_simd ( It first , size_type count ); constexpr basic_simd ( It first , It last );
8. Contiguous or non-contiguous sources
In [P1928R12] and [P2664R6] we make a distinction between
loads and stores to contiguous memory regions, which we call copy operations
(e.g.,
,
) and non-contiguous operations, which we call
and
. We have done so to call out the potentially
large performance difference between the two and make it obvious to the user
what the expectations should be.
With ranges or spans as the source of data for
constructors we should
take similar care to avoid accidental performance traps. We recommend that the new
constructors described in this paper should work only on contiguous ranges. If the
user wishes to access data in a non-contiguous range then they should explicitly
build a suitable contiguous data source, using their expert knowledge to decide
how best to achieve that.
9. Range operations for simd_mask
In the current definition of
in [P1928R12] there are constructors,
and
/
functions which work with iterators to containers of
values.
As noted in [P2876R1], a container of
values is not an ideal way to
store mask-like values. A better alternative for storing mask element bits
outside a
is either as a
or as densely packed bits in
unsigned integers. [P2876R1] proposes that
provide implicit
constructors and accessors for these cases, giving a more efficient and safer
mechanism for storing mask bits than the current functions in [P1928R12]. We
suggest that the
iterator constructors, and the
functions are removed entirely, and that
and
should only operate on
values.
10. Constructor, member function or free function
At this point we have two constructors loading from a contiguous range: In the
static extent case the constructor requires an exact match of
width and
range size; in the dynamic extent case the range can be of any size. This
mismatch prompts the question whether we’re doing something wrong. Why does this
compile:
std :: vector data = { 1 , 2 , 3 , 4 , 5 }; std :: simd < int , 4 > v ( data );
but this doesn’t compile:
std :: array data = { 1 , 2 , 3 , 4 , 5 }; std :: simd < int , 4 > v ( data );
and this doesn’t compile:
std :: vector data = { 1 , 2 , 3 , 4 , 5 }; std :: span < const int , 5 > chunk ( data , 5 ); std :: simd < int , 4 > v ( chunk );
We need to resolve this inconsistency. To this end, note that we can
characterize these two constructors as "conversion from array" and "load from
range" constructors. Under the hood they are both just a
, but the
semantics we are defining for the
type makes them different (implicit vs
explicit, constrained size vs no size constraint). Also note that in most
cases the "conversion from array" can also be implemented via a
,
which is not trivially possible for the "load from range" constructor.
This leads us to reconsider whether we should have a load constructor at all.
If we have a "conversion from array" constructor the load we can consider
"safe" to use is available to users. And given a facility that produces a
of static extent from a given range we don’t actually need anything else
anymore:
myvector < int > data = /*...*/ ; // std::vector plus span subscript operator std :: basic_simd v = data [ i , std :: simd < int >:: size ];
The other important consideration is the relation of loads to gathers. Both
fill all
elements from values stored in a contiguous range. The load can
be considered a special case of a gather where the index vector is an iota
vector (0, 1, 2, 3, ...). Thus, if we adopt the
function from
P2664, the following two are functionally equivalent (using the P1928R12 and
P2664R8 API):
std :: vector < int > data = /*...*/ ; constexpr std :: simd < int , 4 > iota ([]( int i ) { return i ; }); auto v1 = std :: simd < int , 4 > ( data . begin ()); auto v2 = std :: gather_from ( data , iota ); assert ( all_of ( v1 == v2 ));
A high-quality implementation can even recognize the index argument (constant
propagated) and implement the above
as a vector load instruction
instead of a gather instruction.
In terms of API design we immediately notice the inconsistency. These two functions are closer related than the "conversion from array" constructor and "load from range" constructor. Consequently, either there needs to be a gather constructor or the "load from range" constructor should be a non-member function.
auto v1 = std :: load_from ( data ); auto v2 = std :: gather_from ( data , iota );
But these two still have an important difference, the former provides no
facility to choose the
width, whereas the latter infers it from the
width of the index
. This could be solved with an optional template
argument to the
function. For consistency, the
function should provide the same optional template argument:
using V = std :: simd < int , 4 > ; auto v1 = std :: load_from < V > ( data ); auto v2 = std :: gather_from < V > ( data , iota );
With this definition of the load non-member function, we retain the same
functionality the constructor provides (converting loads, choose
width).
If we replace the load constructor with a non-member function, we should at the
same time remove the
member function. The analogue non-member
function
replaces
. Stores
don’t require a template argument, since the
type is already give via
the first function argument.
Providing loads and stores as non-member functions enables us to provide them for scalar types as well, enabling more SIMD-generic programming. Consider:
template < simd_integral T > void f ( std :: vector < int >& data ) { T v = std :: load_from < T > ( data ); v += 1 ; std :: store_to ( v , data ); }
This function
can be called as
and
if
and
are overloaded for scalars. With the P1928R12 API the
implementation of
requires constexpr-if branches.
For completeness, the load and store functions need to be overloaded with a mask argument preceding the flags argument.
10.1. Design alternative replacing load_from
by std :: ranges :: to
An alternative to providing
is to use an existing C++ feature called
instead. Here are a couple of examples showing the potential
of doing this:
auto v0 = array { 1 , 2 , 3 , 4 } | ranges :: to < basic_simd > `; std :: vector < double > floats { 2.4 , 1.1 , 9.7 , 6.3 , 9.6 }; auto v1 = floats | std :: ranges :: to < basic_simd < int , 3 > ();
The first example works already using the span-constructor proposed earlier. The
second example could also be made to work if another constructor were provided
which took
as its first parameter. Both examples would also
require CTAD to deduce the correct simd template parameters.
While
does seem to be a potential replacement for
which is consistent with existing C++ facilities, there are a number of API
mis-matches which potentially make this confusing:
-
The range will be silently truncated if the output
object is too small. This alone is sufficiently undesirable behaviour to persuade us not to usebasic_simd
.std :: ranges :: to -
Other standard types which serve as the destination of
can resize at runtime to store all the available input elements. Standard types which can’t resize (e.g.,std :: ranges :: to
) are not permitted as the destination ofstd :: array
, sostd :: ranges :: to
arguably shouldn’t either.basic_simd -
The
function would not allow a mask to be passed in, sostd :: ranges :: to
would still be needed for the masked variation.load_from -
There is no obvious way for
to replacestd :: ranges :: to
, sostore_to
would still be required.store_to
For all these reasons we do not think that using
is a viable
full or partial replacement for
.
10.2. Design alternative combining load and gather / store and scatter
As noted above, an implementation could recognize a constant propagated iota
index vector passed to a gather and emit an much faster load instruction. What
if we provide a special type denoting a contiguous index range? In that case
there could be a single
function
implementing gather and load efficiently. Example:
std :: index_range < 4 > i4 = 1 ; std :: simd < int , 4 > iota ([]( int i ) { return i + 1 ; }); auto v1 = std :: simd_load ( data , i4 ); auto v2 = std :: simd_load ( data , iota ); assert ( all_of ( v1 == v2 ));
Such an
type is also interesting for subscripting into
contiguous ranges, producing a
of static extent, which is consequently
convertible to
:
std :: basic_simd v3 = data [ i4 ]; assert ( all_of ( v1 == v3 ));
11. Data parallel execution policy
Many of the arguments in this proposal concern how explicit the programmer needs
to be in what they are doing (e.g., exception or default-initialization) to
achieve safe but high-performance. For example, a very common use case for
is to iterate through a large data set in small simd-sized blocks,
processing each block in turn. The code might look something like:
void process ( Range && r ) { constexpr int N = simd < float >:: size ; // Whole blocks for ( int i = 0 ; i < r . size (); i += N ) { std :: span < float , simd < float >:: size > block ( r . begin () + i , N ); doSomething ( block ); // Standard load - no runtime checks, but its within the known range. } // Remainder if ( remainder ) { // range checked simd < float > remainder ( r . begin (), N , simd_flag_partial_loadstore ); doSomething ( block ); } }
This is efficient and correct, but is verbose common code for the user to write.
If an execution policy for
was available then this could be written much
more simply, with all the mechanics hidden away.
std :: for_each ( std :: execution :: simd , std :: begin ( ptr ), std :: end ( ptr ), [ & ]( simdable block ) { doSomething ( block ); });
In this case the
library need not worry too much about choosing what
default to use in the absence of any flags (e.g.,
,
) because the library code wil use whatever it needs explicitly.
It is not our intent to define an execution policy in this paper, but we point
out its potential as it may help guide us to a decision on what
should do by
default.
12. Wording
The wording relies on [P1928R12] being landed to the Working Draft. Below, substitute the � character with a number the editor finds appropriate for the table, paragraph, section or sub-section.
12.1. Modify [simd.flags]
Add the following:
// [simd.flags], Load and store flags �struct partial - loadstore - flag ; // exposition only �inline constexpr simd_flags simd_flag_partial_loadstore {};
// Class template simd_flags overview �Every type in Flags is one of convert - flag , aligned - flag , partial - loadstore - flag or overaligned - flag .
12.2. Modify [simd.synopsis]
Add functions to load a simd value from a range.
� [simd.range_load], basic_simd load from rangetemplate < class V , class Range , class ... Flags > constexpr V load_from ( const Range & r , simd_flags < Flags ... > f = {}); template < class It , class ... Flags > constexpr V load_from ( const Range & r , const typename V :: mask_type & mask , simd_flags < Flags ... > f = {}); template < class V , class It , class ... Flags > constexpr V load_from ( It begin , std :: size_t count , simd_flags < Flags ... > f = {}); template < class V , class It , class ... Flags > constexpr V load_from ( It begin , std :: size_t count , const typename V :: mask_type & mask , simd_flags < Flags ... > f = {}); template < class V , class It , class ... Flags > constexpr V load_from ( It begin , It end , simd_flags < Flags ... > f = {}); template < class V , class It , class ... Flags > constexpr V load_from ( It begin , It end , const typename V :: mask_type & mask , simd_flags < Flags ... > f = {});
Add functions to load a simd value from a range.
� [simd.range_store], basic_simd store into a rangetemplate < class T , class Abi , class Range , class ... Flags > constexpr void store_to ( const basic_simd < T , Abi >& v , Range && r , simd_flags < Flags ... > f = {}); template < class T , class Abi , class Range , class ... Flags > constexpr void store_to ( const basic_simd < T , Abi >& v , Range && r , const typename basic_simd < T , Abi >:: mask_type & mask , simd_flags < Flags ... > f = {}); template < class T , class Abi , class Out , class ... Flags > constexpr void store_to ( const basic_simd < T , Abi >& v , Out begin , std :: size_t count , simd_flags < Flags ... > f = {}); template < class T , class Abi , class Out , class ... Flags > constexpr void store_to ( const basic_simd < T , Abi >& v , Out begin , std :: size_t count , typename basic_simd < T , Abi >:: const mask_type & mask , simd_flags < Flags ... > f = {}); template < class T , class Abi , class Out , class ... Flags > constexpr void store_to ( const basic_simd < T , Abi >& v , Out begin , Out end , simd_flags < Flags ... > f = {}); template < class T , class Abi , class Out , class ... Flags > constexpr void store_to ( const basic_simd < T , Abi >& v , Out begin , Out end , const typename basic_simd < T , Abi >:: mask_type & mask , simd_flags < Flags ... > f = {});
12.3. Modify [simd.overview]
Replace iterator-based constructors with static-span constructors:
� [simd.ctor], basic_simd constructorstemplate < class It , class ... Flags > constexpr explicit basic_simd ( It first , simd_flags < Flags ... > = {}); template < class It , class ... Flags > constexpr explicit basic_simd ( It first , const mask_type & mask , simd_flags < Flags ... > = {}); template < class Span , class ... Flags > constexpr explicit basic_simd ( const Span & first , simd_flags < Flags ... > = {}); template < class Span , class ... Flags > constexpr explicit basic_simd ( const Span & first , const mask_type & mask , simd_flags < Flags ... > = {});
Remove
and
functions.
� [simd.copy], basic_simd copy functionstemplate < class It , class ... Flags > constexpr void copy_from ( It first , simd_flags < Flags ... > f = {}); template < class It , class ... Flags > constexpr void copy_from ( It first , const mask_type & mask , simd_flags < Flags ... > f = {}); template < class Out , class ... Flags > constexpr void copy_to ( Out first , simd_flags < Flags ... > f = {}) const ; template < class Out , class ... Flags > constexpr void copy_to ( Out first , const mask_type & mask , simd_flags < Flags ... > f = {}) const ;
12.4. Modify [simd.mask.overview]
Remove
memory interaction.
� [simd.mask.ctor],constructors
basic_simd_mask template < class It , class ... Flags > constexpr basic_simd_mask ( It first , simd_flags < Flags ... > = {}); template < class It , class ... Flags > constexpr basic_simd_mask ( It first , const basic_simd_mask & mask , simd_flags < Flags ... > = {});
Remove
and
functions.
� [simd.mask.copy],copy functions
basic_simd_mask template < class It , class ... Flags > constexpr void copy_from ( It first , simd_flags < Flags ... > = {}); template < class It , class ... Flags > constexpr void copy_from ( It first , const basic_simd_mask & mask , simd_flags < Flags ... > = {}); template < class Out , class ... Flags > constexpr void copy_to ( Out first , simd_flags < Flags ... > = {}) const ; template < class Out , class ... Flags > constexpr void copy_to ( Out first , const basic_simd_mask & mask , simd_flags < = {}) const ;
12.5. Modify [simd.ctor] basic_simd
constructors
Remove iterator-based constructors.
template < class It , class ... Flags > constexpr explicit basic_simd ( It first , simd_flags = {}); [ remove all descriptions too ] template < class It , class ... Flags > constexpr explicit basic_simd ( It first , const mask_type & mask , simd_flags = {}); [ remove all descriptions too ]
Add static-extent span constructors.
�constructors [simd.ctor]
basic_simd // Construct a basic_simd from the values contained in the given span-like object with // the same static extent. template < class StaticSpan , class ... Flags > constexpr explicit basic_simd ( const StaticSpan & s , simd_flags < Flags ... > = {}); template < class StaticSpan , class ... Flags > constexpr explicit basic_simd ( const StaticSpan & s , const mask_type & mask , simd_flags < Flags ... > = {}); Constraints:
must be a vectorizable type.
element - type < StaticSpan > The static extent of the
must be the same as
StaticSpan .
size Mandates:
If the template parameter pack
does not contain
Flags , then the conversion from
convert - flag to
element - type < StaticSpan > is value-preserving.
value_type Preconditions:
If the template parameter pack
contains
Flags , then
aligned - flag points to storage aligned by
to_address ( std :: begin ( s )) .
simd_alignment_v < basic_simd , span - value - type < StaticSpan >> If the template parameter pack
contains
Flags , then
overaligned - flag < N > points to storage aligned by
to_address ( std :: begin ( s )) .
N Effects:
For the unmasked constructor, initializes the
element with
i ’th for all
static_cast < T > ( to_address ( std :: begin ( s ))[ i ]) in the range of
i .
[ 0 , size ()) For the masked constructor, initializes the
element with
i ’th for all
mask [ i ] ? static_cast < T > ( to_address ( std :: begin ( s ))[ i ]) : T () in the range of
i .
[ 0 , size ())
12.6. Add [simd.from_range] basic_simd
load from range
Add free-functions to load a basic_simd’s elements from a range.
�load from range [simd.from_range]
basic_simd ( 1 ) template < class V , class Range , class ... Flags > constexpr V load_from ( const Range & r , simd_flags < Flags ... > f = {}); ( 2 ) template < class It , class ... Flags > constexpr V load_from ( const Range & r , const typename V :: mask_type & mask , simd_flags < Flags ... > f = {}); ( 3 ) template < class V , class Iter , class ... Flags > constexpr V load_from ( Iter first , std :: size_t count , simd_flags < Flags ... > f = {}); ( 4 ) template < class It , class ... Flags > constexpr V load_from ( Iter first , std :: size_t count , const typename V :: mask_type & mask , simd_flags < Flags ... > f = {}); ( 5 ) template < class V , class Iter , class ... Flags > constexpr V load_from ( Iter first , Iter last , simd_flags < Flags ... > f = {}); ( 6 ) template < class It , class ... Flags > constexpr V load_from ( Iter first , Iter last , const typename V :: mask_type & mask , simd_flags < Flags ... > f = {}); Constraints:
or
std :: ranges :: range_value_t < Range > is a vectorizable type.
iter_value_t < Iter >
models
Range
std :: contiguous_range
models
Range
std :: sized_range
models
Iter
std :: contiguous_iterator
is a
V
basic_simd Mandates:
If the template parameter pack
does not contain
Flags , then the conversion from
convert - flag or
std :: ranges :: range_value_t < Range > to
std :: iter_value_t < Iter > is value-preserving.
V :: value_type Preconditions:
If the template parameter pack
does not contain
Flags then the source must be a valid range of equal or greater extent to
simd_flag_partial_loadstore .
V :: size For 3 and 4,
must be a valid range.
[ first , first + count ) For 5 and 6,
must be a valid range.
[ first , last ) If the template parameter pack
contains
Flags , then the source range (e.g.,
aligned - flag ,
to_address ( std :: begin ( r )) ) points to storage aligned by
to_address ( first ) .
simd_alignment_v < basic_simd , std :: ranges :: range_value_t < Range >> If the template parameter pack
contains
Flags , then
overaligned - flag < N > points to storage aligned by
to_address ( std :: begin ( r )) .
N Returns:
A
object of type
basic_simd . When
V contains
Flags , the i’th element of the return value will be:
simd_flag_partial_loadstore
For unmasked function:
.
i < size ? static_cast < typename V :: value_type > ( to_address ( std :: begin ( s ))[ i ]) ? typename V :: value_type {} For masked function:
.
( i < size && mask [ i ]) ? static_cast < typename V :: value_type > ( to_address ( std :: begin ( s ))[ i ]) ? typename V :: value_type {} When
does not contain
Flags , the i’th element of the return value will be:
simd_flag_partial_loadstore
For unmasked function:
.
static_cast < typename V :: value_type > ( to_address ( std :: begin ( s ))[ i ]) For masked function:
.
mask [ i ] ? static_cast < typename V :: value_type > ( to_address ( std :: begin ( s ))[ i ]) ? typename V :: value_type {}
12.7. Add [simd.overview]
Add class template argument deduction to overview.
� [simd.overview]
// SIMD ctad. template < contiguous - range - with - static - extent Rg > basic_simd ( Rg && ) -> basic_simd < ranges :: range_value_t < Rg > , deduce - t < ranges :: range_value_t < Rg > , static - range - size < Rg >>> ; template < contiguous - range - with - static - extent Rg , typename ... Flags > basic_simd ( Rg && , simd_flags ) -> basic_simd < ranges :: range_value_t < Rg > , deduce - t < ranges :: range_value_t < Rg > , static - range - size < Rg >>> ;
12.8. Add [simd.ctad] basic_simd
class template argument deduction
�class template argument deduction [simd.ctad]
basic_simd template < contiguous - range - with - static - extent Rg > basic_simd ( Rg && ) -> basic_simd < ranges :: range_value_t < Rg > , deduce - t < ranges :: range_value_t < Rg > , static - range - size < Rg >>> ; template < contiguous - range - with - static - extent Rg , typename ... Flags > basic_simd ( Rg && , simd_flags ) -> basic_simd < ranges :: range_value_t < Rg > , deduce - t < ranges :: range_value_t < Rg > , static - range - size < Rg >>> ; Constraints:
represents something with a compile-time constant extent which is not larger than an implementation defined maximum.
Rg
contains elements of a vectorizable type.
Rg Returns:
A
parameterized to have exactly the same number of elements as the statically bounded span-like type, and the same element type.
basic_simd Remarks:
The span-like object can be a span with known compile-time, or something equivalent (e.g., a C-array with known bounds).
-NOTE: Add suitable note about erroneous behaviour.
13. Polls
13.1. St Louis 2024
Poll: We agree with the direction of P3299R0 regarding non-member functions for loads and stores, and a conversion operator (whether explicit or otherwise) for ranges with static extents.
SF | F | N | A | SA |
---|---|---|---|---|
6 | 10 | 1 | 0 | 0 |