aligned_accessor: An mdspan accessor expressing pointer overalignment

Document #: P2897R2
Date: 2024-07-15
Project: Programming Language C++
LEWG
Reply-to: Mark Hoemmen
<>
Damien Lebrun-Grandie
<>
Nicolas Morales
<>
Christian Trott
<>

Contents

1 Authors

2 Revision history

3 Purpose of this paper

We propose adding aligned_accessor to the C++ Standard Library. This class template is an mdspan accessor policy that uses assume_aligned to decorate pointer access. We think it belongs in the Standard Library for two reasons. First, it would serve as a common vocabulary type for interfaces that take mdspan to declare their minimum alignment requirements. Second, it extends to mdspan accesses the optimizations that compilers can perform to pointers decorated with assume_aligned.

aligned_accessor is analogous to the various atomic_accessor_* templates proposed by P2689. Both that proposal and this one start with a Standard Library feature that operates on a “raw” pointer (assume_aligned or the various atomic_ref* templates), and then propose an mdspan accessor policy that straightforwardly wraps the lower-level feature.

We had originally written aligned_accessor as an example in P2642, which proposes “padded” mdspan layouts. We realized that aligned_accessor was more generally applicable and that standardization would help the padded layouts proposed by P2642 reach their maximum value.

4 Key features

The offset_policy alias is default_accessor<ElementType>, because even if a pointer p is aligned, p + i might not be.

The data_handle_type alias is ElementType*. It needs no further adornment, because alignment is asserted at the point of access, namely in the access function. Some implementations might have an easier time optimizing if they also apply some implementation-specific attribute to data_handle_type itself. Examples of such attributes include __declspec(align_value(byte_alignment)) and __attribute__((align_value(byte_alignment))). However, these attributes should not apply to the result of offset, for the same reason that offset_policy is default_accessor and not aligned_accessor.

The converting constructor from aligned_accessor is analogous to default_accessor’s constructor, in that it exists to permit conversion from nonconst element_type to const element_type. It additionally permits implicit conversion from more overalignment to less overalignment – something that we expect users may need to do. For example, users may start with aligned_accessor<float, 128>, because their allocation function promises 128-byte alignment. However, they may then need to call a function that takes an mdspan with aligned_accessor<float, 32>, which declares the function’s intent to use 8-wide SIMD of float.

The is_sufficiently_aligned function checks whether a pointer has sufficient alignment to be used correctly with the class. This makes it easier for users to check preconditions, without needing to know how to cast a pointer to an integer of the correct size and signedness.

Per discussion below, we would like LEWG to consider the alternative design that removes is_sufficiently_aligned from aligned_accessor and adds it to the <bit> header as a separate nonmember function. We think LWG review of aligned_accessor can proceed concurrently. We present wording for this design alternative below.

5 Design discussion

5.1 The accessor is not nestable

We considered making aligned_accessor “wrap” any accessor type that meets the right requirements. For example, aligned_accessor could take the inner accessor as a template parameter, store an instance of it, and dispatch to its member functions. That would give users a way to apply multiple accessor “attributes” to their data handle, such as atomic access (see P2689) and overalignment.

We decided against this approach for three reasons. First, we would have no way to validate that the user’s accessor type has the correct behavior. We could check that their accessor’s data_handle_type is a pointer type, but we could not check that their accessor’s access function actually dereferences the pointer. For instance, access might instead interpret the pointer as a file handle or a key into a distributed data store.

Second, even if the inner accessor’s access function actually did return the result of dereferencing the pointer, the outer access function might not be able to recover the effects of the inner access function, because access computes a reference, not a pointer. In order for aligned_accessor’s access function to get back that pointer, it would need to reach past the inner accessor’s public interface. That would defeat the purpose of generic nesting.

Third, any way (not just this one) of nesting two generic accessors raises the question of order dependence. Even if it were possible to apply the effects of both the inner and outer accessors’ access functions in sequence, it might be unpleasantly surprising to users if the effects depended on the order of nesting. A similar question came up in the “properties” proposal P0900, which we quote here.

Practically speaking, it would be considered a best practice of a high-quality implementation to ensure that a property’s implementation of properties::element_type_t (and other traits) are invariant with respect to ordering with other known properties (such as those in the standard library), but with this approach it would be impossible to make that guarantee formal, particularly with respect to other vendor-defined and user-defined properties unknown to the property implementer.

For these reasons, we have made aligned_accessor stand-alone, instead of having it modify another user-provided accessor.

5.2 Explicit constructor from default_accessor

LEWG’s 2023-10-10 review of R0 pointed out that in R0, aligned_accessor lacks an explicit constructor from default_accessor. Having that constructor would make it easier for users to create an aligned mdspan from an unaligned mdspan. Making it explicit would prevent implicit conversion. Thus, we have decided to add this explicit constructor in R1.

Without the explicit constructor, users have two options for turning a nonaligned mdspan into an aligned mdspan. First, as in the following example, users could “take apart” the input nonaligned mdspan and use the pieces to construct an aligned mdspan, whose type they name completely.

void compute_with_aligned(
  std::mdspan<float, std::dims<2>, std::layout_left> matrix)
{
  const std::size_t byte_alignment = 4 * alignof(float);
  using aligned_matrix_t = std::mdspan<float, std::dims<2>,
    std::layout_left, std::aligned_accessor<float, byte_alignment>>;

  aligned_matrix_t aligned_matrix{matrix.data_handle(), matrix.mapping()};
  // ... use aligned_matrix ...
}

Second, as in the following example, users could construct an aligned_accessor explicitly and use constructor template argument deduction (CTAD) to construct the aligned mdspan from its pieces.

void compute_with_aligned(
  std::mdspan<float, std::dims<2>, std::layout_left> matrix)
{
  const std::size_t byte_alignment = 4 * alignof(float);

  std::mdspan aligned_matrix{matrix.data_handle(), matrix.mapping(),
    std::aligned_accessor<float, byte_alignment>{}};
  // ... use aligned_matrix ...
}

The first approach would likely be more common. This is because mdspan users commonly define their own type aliases for mdspan, with application-specific names that make code more self-documenting. The aligned_matrix_t definition above is an an example.

Adding an explicit constructor from default_accessor lets users get the same effect more concisely, without needing to “take apart” the input mdspan.

void compute_with_aligned(std::mdspan<float, std::dims<2, int>, std::layout_left> matrix)
{
  const std::size_t byte_alignment = 4 * alignof(float);
  using aligned_mdspan = std::mdspan<float, std::dims<2, int>,
    std::layout_left, std::aligned_accessor<float, byte_alignment>>;

  aligned_mdspan aligned_matrix{matrix};
  // ... use aligned_matrix ...
}

The explicit constructor does not decrease safety, in the sense that users were always allowed to convert from an mdspan with default_accessor to an mdspan with aligned_accessor. Before, users could perform this conversion by typing the following.

aligned_matrix_t aligned_matrix{matrix.data_handle(), matrix.mapping()};

Now, users can do the same thing with fewer characters.

aligned_matrix_t aligned_matrix{matrix};

5.3 Why no explicit constructor from less to more alignment?

As explained in the previous section, aligned_accessor has an explicit converting constructor from default_accessor so that users can assert overalignment. It also has an (implicit) converting constructor from another aligned_accessor with more alignment, to an aligned_accessor with less alignment. However, aligned_accessor does not have an explicit converting constructor from another aligned_accessor with less alignment, to an aligned_accessor with more alignment. Why not?

Consider the three typical use cases for aligned_accessor.

  1. User knows an allocation’s alignment at compile time.

  2. User knows an allocation’s alignment at run time, but not at compile time. For example, the value might depend on run-time detection of particular hardware features.

  3. User doesn’t know whether an allocation is overaligned. They might need to ask some system at run time, or check the pointer value themselves, in order to decide whether to call code that expects a particular alignment.

In Case (1), users would normally declare the maximum alignment. They would want to preserve this information at compile time as much as possible, by keeping the aligned_accessor mdspan with maximum compile-time alignment for the entire scope of its use. Users would only want implicit conversions to less alignment or default_accessor when calling functions whose parameter types encode these requirements.

Case (2) reduces to Case (3).

Case (3) reduces to Case (1). This works like any conversion from run-time type to compile-time type, with a fixed list of possible compile-time types (the alignments). As soon as a user’s mdspan enters a scope where the alignment is known at compile time, the user would want to preserve that compile-time information and maximize the alignment for as large of a scope as possible.

None of these cases involve starting with more alignment, going to less (but still some) alignment, and then going back to more alignment again. Code that does that probably does not correctly use the types of function parameters to express its overalignment requirements. It’s like code that uses dynamic_cast a lot. Users can still convert from less or more alignment by creating the result’s aligned_accessor manually. However, we don’t want to encourage this pattern, so we don’t offer an explicit conversion for it.

5.4 We do not define an alias for aligned mdspan

In LEWG’s 2023-10-10 review of R0, participants observed that this proposal’s examples define an example-specific type alias for mdspan with aligned_accessor. They asked whether our proposal should include a standard alias aligned_mdspan. We do not object to such an alias, but we do not find it very useful, for the following reasons.

  1. Users of mdspan commonly define their own type aliases whose names are meaningful for their applications.

  2. It would not save much typing.

Examples may define aliases to make them more concise. One example in this proposal defines the following alias for an mdspan of float with alignment byte_alignment.

template<size_t byte_alignment>
using aligned_mdspan = std::mdspan<float, std::dims<1, int>,
  std::layout_right, std::aligned_accessor<float, byte_alignment>>;

This lets the example use aligned_mdspan<32> and aligned_mdspan<16>.

The above alias is specific to a particular example. A general version of alias would look like this.

template<class ElementType, class Extents, class Layout,
  size_t byte_alignment>
using aligned_mdspan = std::mdspan<ElementType, Extents, Layout,
  std::aligned_accessor<ElementType, byte_alignment>>;

This alias would save some typing. However, mdspan “power users” rarely type out all the template arguments. First, they can rely on CTAD to create mdspans, and auto to return them. Second, users commonly already define their own aliases whose names have an application-specific meaning. They define these aliases once and use them throughout the application. For instance, users might define the following.

template<class ElementType>
using vector_t = std::mdspan<ElementType,
  std::dims<1>, std::layout_left>;
template<class ElementType>
using matrix_t = std::mdspan<ElementType,
  std::dims<2>, std::layout_left>;

template<class ElementType, size_t byte_alignment>
using aligned_vector_t = std::mdspan<ElementType,
  std::dims<1>, std::layout_left, 
  std::aligned_accessor<ElementType, byte_alignment>>;
template<class ElementType, size_t byte_alignment>
using aligned_matrix_t = std::mdspan<ElementType,
  std::dims<2>, std::layout_left, 
  std::aligned_accessor<ElementType, byte_alignment>>;

Such users may never type the characters “mdspan” again. For this reason, while we do not object to an aligned_mdspan alias, we do not find the proliferation of aliases particularly ergonomic.

5.5 mdspan construction safety

LEWG’s 2023-10-10 review of R0 expressed concern that mdspan’s constructor has no way to check aligned_accessor’s alignment requirements. Users can call aligned_accessor’s is_sufficiently_aligned(p) static member function with a pointer p to check this themselves, before constructing the mdspan. However, mdspan’s constructor generally has no way to check whether its accessor finds the caller’s data handle acceptable.

This is true for any accessor type, not just for aligned_accessor. It is a design feature of mdspan that accessors can be stateless. Most of them have no state. Even if they have state, they do not need to be constructed with or store the data handle. As a result, an accessor might not see a data handle until access or offset is called. Both of those member functions are performance critical, so they cannot afford an extra branch on every call. Compare to vector::operator[], which has preconditions but is not required to perform bounds checks. Using exceptions in the manner of vector::at could reduce performance and would also make mdspan unusable in a freestanding or no-exceptions context.

Note that aligned_accessor does not introduce additional preconditions beyond those of the existing C++ Standard Library feature assume_aligned. In the words of one LEWG reviewer, aligned_accessor is not any more “pointy” than assume_aligned; it just passes the point through without “blunting” it.

Before submitting R0 of this paper, we considered an approach specific to aligned_accessor, that would wrap the pointer in a special data handle type. The data handle type would have an explicit constructor that takes a raw pointer, with a precondition that the raw pointer have sufficient alignment. The constructor would be explicit, because it would have a precondition. This design would force the precondition back to mdspan construction time. Users would have to construct the mdspan like this.

element_type* raw_pointer = get_pointer_from_somewhere();
using acc_type = aligned_accessor<element_type, byte_alignment>;
mdspan x{acc_type::data_handle_type{raw_pointer}, mapping, acc_type{}};

We rejected this approach in favor of is_sufficiently_aligned for the following reasons.

  1. Wrapping the pointer in a custom data handle class would make every access or offset call need to reach through the data handle’s interface, instead of just taking the raw pointer directly. The access function, and to some extent also offset, need to be as fast as possible. Their performance depends on compilers being able to optimize through function calls. The authors of mdspan carefully balanced generality with function call depth and other code complexity factors that may hinder compilers from optimizing. Performance of aligned_accessor matters as much or even more than performance of default_accessor, because aligned_accessor exists to communicate optimization potential.

  2. The alignment precondition would still exist. Requiring the data handle type to throw an exception if the pointer is not sufficiently aligned would make mdspan unusable in a freestanding or no-exceptions context.

  3. Users should not have to pay for unneeded checks. The two examples in the wording express the two most common cases. If users get a pointer from a function like aligned_alloc, then they already know its alignment, because they asked for it. If users are computing alignment at run time to dispatch to a more optimized code path, then they know alignment before dispatch. In both cases, users already know the alignment before constructing the mdspan.

  4. The data handle is still a pointer, it’s just a pointer with a constraint on its values. Users would reasonably expect to be able to use the result of data_handle() with existing interfaces that expect a raw pointer.

An LEWG poll on 2023-10-10, “[b]lock aligned_accessor progressing until we have a way of checking alignment requirements during mdspan construction,” resulted in no consensus. Attendance was 14.

Strongly Favor Weakly Favor Neutral Weakly Against Strongly Against
0 1 1 2 2

LEWG expressed an (unpolled) interest that we explore mdspan safety in subsequent work after the fall 2023 Kona WG21 meeting. LEWG asked us to explore safety in a way that is not specific to aligned_accessor. Part of that exploration is in the section below “Generalize is_sufficiently_aligned for all accessors?”. We plan further exploration of this topic elsewhere.

5.6 is_sufficiently_aligned is not constexpr

LEWG reviewed R1 of this proposal at the June 2024 St. Louis WG21 meeting, and polled 1/10/0/0/1 (SF/F/N/A/SA) to remove constexpr from is_sufficiently_aligned. This is because it is not clear how to implement the function in a way that could ever be a constant expression. The straightforward cross-platform way to implement this would bit_cast the pointer to uintptr_t. However, bit_cast is not constexpr when converting from a pointer to an integer, per [bit.cast] 3. Any reinterpret_cast similarly could not be a core constant expression, per [expr.const] 5.15. One LEWG reviewer pointed out that some compilers have a built-in operation (e.g., Clang and GCC have __builtin_bit_cast) that might form a constant expression when bit_cast does not. On the other hand, the authors could not foresee a need for is_sufficiently_aligned to be constexpr and did not want to constrain implementations to use compiler-specific functionality.

5.7 Generalize is_sufficiently_aligned for all accessors?

5.7.1 is_sufficiently_aligned is specific to aligned_accessor

The is_sufficiently_aligned function exists so users can check the pointer’s alignment precondition before constructing an mdspan with it. This precondition check is specific to aligned_accessor. Furthermore, the function has a precondition that the pointer points to a valid element. Standard C++ offers no way for users to check that. More importantly for mdspan users, Standard C++ offers no way to check whether a pointer and a layout mapping’s required_span_size() form a valid range.

5.7.2 detectably_invalid: Generic validity check?

During the June 2024 St. Louis WG21 meeting, one LEWG reviewer (please see Acknowledgments below) pointed out that code that is generic on the accessor type currently has no way to check whether a given data handle is valid. Specifically, given a size_t size (e.g., the required_span_size() of a given layout mapping), there is no way to check whether [ 0, size ) forms an accessible range (see [mdspan.accessor.general] 2) of a given data handle and accessor. The reviewer suggested adding a new member function

bool detectably_invalid(data_handle_type handle, size_t size) const noexcept;

to all mdspan accessors. This would return true if the implementation can show that [ 0, size ) is not an accessible range for handle and the accessor, and true otherwise. The word “detectably” in the name would remind users that this is a “best effort” check. It might return false even if the handle is invalid or if [ 0, size ) is not an accessible range. Also, it might return different values on different implementations, depending on their ability to check e.g., pointer range validity. The function would have the following design features.

With such a function, users could write generic checked mdspan creation code like the following.

template<class LayoutMapping, class Accessor>
auto create_mdspan_with_check(
  typename Accessor::data_handle_type handle,
  LayoutMapping mapping,
  Accessor accessor)
{
  if (accessor.detectably_invalid(handle, mapping.required_span_size())) {
    throw std::out_of_range("Invalid data handle and/or size");
  }
  return mdspan{handle, mapping, accessor};
}

5.7.3 Arguments against and for detectably_invalid

We didn’t include this feature in the original mdspan design because most data handle types have no way to say with full accuracy whether a handle and size are valid. We didn’t want to give users the false impression that a validity check was doing anything meaningful. Standard C++ has no way to check a raw pointer T* and a size, though some implementations such as CHERI C++ ([Davis 2019] and [Watson 2020]) and run-time profiling and debugging systems such as Valgrind do have this feature. We designed mdspan accessors to be able to wrap libraries that implement a partitioned global address space (PGAS) programming model for accessing remote data over a network. (See P0009R18, Section 2.7, “Why custom accessors?”.) Such libraries include the one-sided communication interface in MPI (the Message Passing Interface for distributed-memory parallel programming) or NVSHMEM (NVIDIA’s implementation of the SHMEM standard). Those libraries define their own data handle to represent remote data. For example, MPI uses an MPI_Win “window” object. NVSHMEM uses a C++ pointer to represent a “symmetric address” that points to an allocation from the “symmetric heap” (that is accessible to all participating parallel processes). Such libraries generally do not have validity checks for their handles.

On the other hand, a detectably_invalid function would let happen any checks that could happen. For instance, a hypothetical “GPU device memory accessor” (not proposed for the C++ Standard, but existing in projects like RAPIDS RAFT) might permit access to an allocation of GPU “device” memory from only GPU “device” code, not from ordinary “host” code. A common use case for GPU allocations is to allocate device memory in host code, then pass the pointer to device code for use there. Thus, it would be reasonable to create an mdspan in host code with that accessor. The accessor could use a CUDA run-time function like cudaPointerGetAttributes to check if the pointer points to valid GPU memory. Even default_accessor could have a simple check like this.

bool detectably_invalid(data_handle_type ptr, size_t size) const {
  return ptr == nullptr && size != 0;
}

5.7.4 Nonmember customization point design

C++23 defines the generic interface of accessors through the accessor policy requirements [mdspan.accessor.reqmts]. Adding detectably_invalid to these requirements would be a breaking change to C++23. Thus, generic code that wanted to call this function would need to fill in default behavior for both Standard accessors defined in C++23, and user-defined accessors that comply with the C++23 accessor requirements. The following detectably_invalid nonmember function (not proposed in this paper) shows one way users could do that. Please see Appendix A below for the full source code of a demonstration, along with a Compiler Explorer link. This demonstration shows that breaking backwards compatibility with C++23 is unnecessary, because users can straightforwardly work around the lack of a detectably_invalid member function in C++23 - compliant accessors. Not standardizing this nonmember function work-around would also give users the freedom to fill in different default behavior. For example, some users may prefer to consider every (data handle, size) pair invalid unless proven otherwise, as a way to force use of custom accessors that have the ability to make accurate checks.

template<class Accessor>
concept has_detectably_invalid = requires(Accessor acc) {
  typename Accessor::data_handle_type;
  { std::as_const(acc).detectably_invalid(
      std::declval<typename Accessor::data_handle_type>(),
      std::declval<std::size_t>()
    ) } noexcept -> std::same_as<bool>;
};

template<class Accessor>
bool detectably_invalid(Accessor&& accessor,
  typename std::remove_cvref_t<Accessor>::data_handle_type handle,
  std::size_t size)
{
  if constexpr (has_detectably_invalid<std::remove_cvref_t<Accessor>>) {
    return std::as_const(accessor).detectably_invalid(handle, size);
  }
  else {
    return false;
  }
}

5.7.5 We need both is_sufficiently_aligned and detectably_invalid

One could argue that if aligned_accessor had detectably_invalid, that would make is_sufficiently_aligned unnecessary. We disagree; we think is_sufficiently_aligned is useful by itself, whether or not detectably_invalid exists, for the following reasons.

  1. Users will often want to check alignment separately from pointer range validity.

  2. Checking alignment may be much less expensive than checking pointer range validity.

Regarding (1), we think the most common use case for aligned_accessor’s explicit converting constructor from default_accessor would be explicit construction of an mdspan with aligned_accessor from an mdspan with default_accessor. The latter exists, so the user has already asserted that the range formed by its data handle and required_span_size() is valid. Thus, the only thing the user would need to check would be whether the data handle is sufficiently aligned.

The same LEWG reviewer who suggested detectably_invalid had originally thought it would make is_sufficiently_aligned unnecessary. However, after reviewing R2 of this paper, that reviewer changed their mind. They now agree with us that is_sufficiently_aligned is useful by itself. All their concerns would be addressed by making is_sufficiently_aligned a nonmember function, rather than a member function of aligned_accessor.

5.7.6 Alternative: Nonmember is_sufficiently_aligned

The reviewer responded to our argument above by suggesting that we remove is_sufficiently_aligned from aligned_accessor and make it a separate nonmember function. This is a reasonable alternative suggestion. We think that LWG review of aligned_accessor can proceed concurrently while LEWG considers this alternative.

Into which header should this new function go? We think it belongs in <bit>, because it is fundamentally a bit arithmetic operation. There should be no reason why that could not be freestanding, like everything else in the <bit> header. Another option would be to put it in <memory> right after assume_aligned, and to mark it as freestanding (since assume_aligned is).

5.7.7 Do accessors need to check anything else?

The only other thing an accessor’s user might want to check besides a (data handle, size) pair would be converting construction from another type of accessor. All mdspan components – extents, layout mappings, and accessors – implement conversions with preconditions via explicit constructors. (For more detail, please see the section below, “Explicit conversions as the model for precondition-asserting conversions.”) Accessors do not store their data handles, so the only reason to check whether converting construction is valid would be if the input or result accessor has separate run-time state. (Otherwise, the check could be a constraint or static_assert.) It’s rare for an accessor to need run-time state, so we don’t expect to need this feature in generic code. It would also be a separable addition from the feature of checking a data handle and size. Nevertheless, one could consider a design. We would favor just overloading detectably_invalid for accessors, as there would be no risk of ambiguity. Converting constructors only take one argument, so there would be no ambiguity between calling detectably_invalid with an accessor and calling it with a data handle and size.

5.7.8 Naming the function

  1. The function describes a property: “this (data handle, size) pair is not known to be invalid.” It’s an adjective (like “valid” or “is_valid”), not a verb (like “check” as in “check_valid”).

  2. The function does not promise perfect accuracy. In the common case, it says whether it can detect whether the handle and size are _in_valid. Whether or not they are valid might be harder to say.

  3. As discussed above, users may also want to check converting constructors from other accessor types. However, there would be no risk of ambiguity between that and checking a data handle and size. Therefore, there’s no need for the function’s name to include the type of the thing being checked (e.g., “range”).

  4. Specifically, the function should not contain the word “pointer,” because a data handle is not necessarily a pointer. Even if data_handle_type​ is a pointer type, a data handle might not necessarily be a pointer to the elements in the Standard C++ sense. For example, it might be some opaque handle that a library represents as a type alias of void*​.

These points together suggest the name detectably_invalid.

5.7.9 Conclusions

  1. Adding detectably_invalid to the accessor requirements and existing Standard accessors in C++26 would be a breaking change to C++23. Nevertheless, users could write code that fills in reasonable behavior for C++23 accessors.

  2. Few C++ implementations offer a way to check validity of a pointer range. Thus, users would experience detectably_invalid as mostly not useful for the common case of default_accessor and other accessors that access a pointer range.

  3. Item (1) reduces the urgency of this suggestion for C++26. Item (2) reduces its potential to improve the mdspan user experience in a practical way. Therefore, we do not suggest adding detectably_invalid to the accessor requirements in this proposal. However, we do not discourage further work in separate proposals.

  4. We would like LEWG to consider the alternative design that removes is_sufficiently_aligned from aligned_accessor and adds it to the C++ Standard Library as a separate nonmember function. We think LWG review of aligned_accessor can proceed concurrently.

5.8 Explicit conversions as the model for precondition-asserting conversions

During the June 2024 St. Louis WG21 meeting, one LEWG reviewer asked about the explicit constructor from default_accessor. This constructor lets users assert that a pointer has sufficient alignment to be accessed by the aligned_accessor. The reviewer argued that this was an “unsafe” conversion, and wanted these “unsafe” conversions to be even more explicit than an explicit constructor: e.g., a new *_cast function template. We do not agree with this idea; this section explains why.

5.8.1 Example: conversion to aligned_accessor

Suppose that some function that users can’t change returns an mdspan of float with default_accessor, even though users know that the mdspan is overaligned to 8 * sizeof(float) bytes. The function’s parameter(s) don’t matter for this example.

mdspan<float, dims<1>, layout_right, default_accessor<float>>
  overaligned_view(SomeParameters params);

Suppose also that users want to call some other function that they can’t change. This function takes an mdspan of float with aligned_accessor<float, 8>. Its return type doesn’t matter for this example.

SomeReturnType use_overaligned_view(
  mdspan<float, dims<1>, layout_right, aligned_accessor<float, 8>>);

5.8.2 Status quo

How do users call use_overaligned_view with the object returned from overaligned_view? The status quo offers two ways. Both of them rely on aligned_accessor<float, 8>’s explicit converting constructor from default_accessor<float>.

  1. Use mdspan’s explicit converting constructor.

  2. Construct the new mdspan explicitly from its data handle, layout mapping, and accessor. (This is the ideal use case for CTAD, as an mdspan is nothing more than its data handle, layout mapping, and accessor.)

Way (1) looks like this.

auto x = overaligned_view(params);
auto result = use_overaligned_view(
  mdspan<float, dims<1>, layout_right,
    aligned_accessor<float, 8>>(x)
);

Way (2) looks like this. Note use of CTAD.

auto x = overaligned_view(params);
auto result = use_overaligned_view(
  mdspan{x.data_handle(), x.mapping(),
    aligned_accessor<float, 8>>(x.accessor())}
);

Which way is less verbose depends on mdspan’s template arguments. Both ways, though, force the user to name the type aligned_accessor<float, 8> explicitly. Users know that they have pulled out a sharp knife from the toolbox. It’s verbose, it’s nondefault, and it’s a class with a short definition. Users can go to the specification, see assume_aligned, and know they are dealing with a low-level function that has a precondition.

5.8.3 mdspan uses explicit conversions to assert preconditions

The entire system of mdspan components was designed so that

Changing this would break backwards compatibility with C++23. For example, one can see this with converting constructors for

This is consistent with C++ Standard Library class templates, in that construction asserts any preconditions. For example, if users construct a string_view or span from a pointer ptr and a size size, this asserts that the range [ ptr, ptr + size ) is accessible.

5.8.4 Alternative: explicit cast function naughty_cast

Everything we have described above is the status quo. What did the one LEWG reviewer want to see? They wanted all conversions with preconditions to use a “cast” function with an easily searchable name, analogous to static_cast. As a placeholder, we’ll call it “naughty_cast.” For the above use_overaligned_view example, the naughty_cast analog of Way (2) would look like this.

auto x = overaligned_view(params);
auto result = use_overaligned_view(
  mdspan{x.data_handle(), x.mapping(),
    naughty_cast<aligned_accessor<float, 8>>>(x.accessor())}
);

One could imagine defining naughty_cast of mdspan by naughty_cast of its components. This would enable an analog of Way (1).

auto x = overaligned_view(params);
auto result = use_overaligned_view(naughty_cast<
  mdspan<float, dims<1>, layout_right,
    aligned_accessor<float, 8>>>(x)
);

Another argument for naughty_cast besides searchability is to make conversions with preconditions “loud,” that is, easily seen in the code by human developers. However, the original Way (1) and Way (2) both are loud already in that they require a lot of extra code that spells out the result’s accessor type explicitly. The status quo’s difference in “volume” is implicit conversion

auto result = use_overaligned_view(x);

versus explicit construction.

auto result = use_overaligned_view(
  mdspan{x.data_handle(), x.mapping(),
    aligned_accessor<float, 8>(x)});
);

Adding naughty_cast to the latter doesn’t make it much louder.

auto result = use_overaligned_view(
  mdspan{x.data_handle(), x.mapping(),
    naughty_cast<aligned_accessor<float, 8>>(x)});
);

There are other disadvantages to a naughty_cast design. The point of that design would be to remove or make non-public all the explicit constructors from mdspan’s components. That functionality would need to move somewhere. A typical implementation technique for a custom cast function is to rely on specializations of a struct with two template parameters, one for the input type and one for the output type of the cast. The naughty_caster struct example below shows how one could do that.

template<class Output, class Input>
struct naughty_caster {};

template<class Output, class Input>
Output naughty_cast(const Input& input) {
  return naughty_caster<Output, Input>::cast(input);
}

template<class OutputElementType, size_t ByteAlignment,
  class InputElementType>
  requires (is_convertible_v<InputElementType(*)[],
    OutputElementType(*)[]>) 
struct naughty_caster {
  using output_type =
    aligned_accessor<OutputElementType, ByteAlignment>;
  using input_type = default_accessor<InputElementType>;

  static output_type cast(const input_type&) {
    return {}; 
  }
};

This technique takes a lot of effort and code, when by far the common case is that cast has a trivial body. For any accessors with state, it would almost certainly call for breaks of encapsulation, like making the naughty_caster specialization a friend of the input and/or output.

We emphasize that users are meant to write custom accessors. The intended typical author of a custom accessor is a performance expert who is not necessarily a C++ expert. It takes quite a bit of C++ experience to learn how to use encapsulation-breaking techniques safely; the lazy approaches all just expose implementation details or defeat the “safety” that naughty_cast is supposed to introduce. Given that the main motivation of naughty_cast is safety, we shouldn’t make it harder for users to write safe code.

More importantly, naughty_cast would obfuscate accessors. The architects of mdspan meant accessors to have to have a small number of “moving parts” and to define all those parts in a single place. Contrast default_accessor with the contiguous iterator requirements, for instance. The naughty_cast design would force custom accessors (and custom layouts) to define their different parts in different places, rather than all in one class. WG21 has moved away from this scattered design approach. For example, P2855R1 (“Member customization points for Senders and Receivers”) changes P2300 (std::execution) to use member functions instead of tag_invoke-based customization points.

5.8.5 Conclusion: retain mdspan’s current design

For all these reasons, we do not support replacing mdspan’s current “conversions with preconditions are explicit conversions” design with a cast function design.

6 Implementation

We have tested an implementation of this proposal with the reference mdspan implementation. Appendix B below lists the source code of a full implementation.

7 Example

template<size_t byte_alignment>
using aligned_mdspan =
  std::mdspan<float, std::dims<1, int>, std::layout_right, std::aligned_accessor<float, byte_alignment>>;

// Interfaces that require 32-byte alignment,
// because they want to do 8-wide SIMD of float.
extern void vectorized_axpy(aligned_mdspan<32> y, float alpha, aligned_mdspan<32> x);
extern float vectorized_norm(aligned_mdspan<32> y);

// Interfaces that require 16-byte alignment,
// because they want to do 4-wide SIMD of float.
extern void fill_x(aligned_mdspan<16> x);
extern void fill_y(aligned_mdspan<16> y);

// Helper functions for making overaligned array allocations.

template<class ElementType>
struct delete_raw {
  void operator()(ElementType* p) const {
    std::free(p);
  }
};

template<class ElementType>
using allocation = std::unique_ptr<ElementType[], delete_raw<ElementType>>;

template<class ElementType, std::size_t byte_alignment>
allocation<ElementType> allocate_raw(const std::size_t num_elements)
{
  const std::size_t num_bytes = num_elements * sizeof(ElementType);
  void* ptr = std::aligned_alloc(byte_alignment, num_bytes);
  return {ptr, delete_raw<ElementType>{}};
}

float user_function(size_t num_elements, float alpha)
{
  // Code using the above two interfaces needs to allocate to the max alignment.
  // Users could also query aligned_accessor::byte_alignment
  // for the various interfaces and take the max.
  constexpr size_t max_byte_alignment = 32;
  auto x_alloc = allocate_raw<float, max_byte_alignment>(num_elements);
  auto y_alloc = allocate_raw<float, max_byte_alignment>(num_elements);

  aligned_mdspan<max_byte_alignment> x(x_alloc.get());
  aligned_mdspan<max_byte_alignment> y(y_alloc.get());

  fill_x(x); // automatic conversion from 32-byte aligned to 16-byte aligned
  fill_y(y); // automatic conversion again

  vectorized_axpy(y, alpha, x);
  return vectorized_norm(y);
}

8 References

9 Acknowledgments

10 Wording

Text in blockquotes is not proposed wording, but rather instructions for generating proposed wording. The � character is used to denote a placeholder section number which the editor shall determine.

In [version.syn], add

#define __cpp_lib_aligned_accessor YYYYMML // also in <mdspan>

Adjust the placeholder value as needed so as to denote this proposal’s date of adoption.

To the Header <mdspan> synopsis [mdspan.syn], after class default_accessor and before class mdspan, add the following.

// [mdspan.accessor.aligned], class template aligned_accessor
template<class ElementType, size_t byte_alignment>
  class aligned_accessor;

At the end of [mdspan.accessor.default] and before [mdspan.mdspan], add all the material that follows.

10.1 Add subsection � [mdspan.accessor.aligned] with the following

� Class template aligned_accessor [mdspan.accessor.aligned]

�.1 Overview [mdspan.accessor.aligned.overview]

template<class ElementType, size_t the_byte_alignment>
struct aligned_accessor {
  using offset_policy = default_accessor<ElementType>;
  using element_type = ElementType;
  using reference = ElementType&;
  using data_handle_type = ElementType*;

  static constexpr size_t byte_alignment = the_byte_alignment;

  constexpr aligned_accessor() noexcept = default;

  template<class OtherElementType, size_t other_byte_alignment>
    constexpr aligned_accessor(
      aligned_accessor<OtherElementType, other_byte_alignment>) noexcept;

  template<class OtherElementType>
    explicit constexpr aligned_accessor(
      default_accessor<OtherElementType>) noexcept;

  constexpr operator default_accessor<element_type>() const {
    return {};
  }

  constexpr reference access(data_handle_type p, size_t i) const noexcept;

  constexpr typename offset_policy::data_handle_type
    offset(data_handle_type p, size_t i) const noexcept;

  static bool is_sufficiently_aligned(data_handle_type p);
};

1 Mandates:

2 aligned_accessor meets the accessor policy requirements.

3 ElementType is required to be a complete object type that is neither an abstract class type nor an array type.

4 Each specialization of aligned_accessor is a trivially copyable type that models semiregular.

5 [0, n) is an accessible range for an object p of type data_handle_type and an object of type aligned_accessor if and only if [p, p + n) is a valid range.

10.2 Members [mdspan.accessor.aligned.members]

template<class OtherElementType, size_t other_byte_alignment>
  constexpr aligned_accessor(
    aligned_accessor<OtherElementType, other_byte_alignment>) noexcept {}

1 Constraints: is_convertible_v<OtherElementType(*)[], element_type(*)[]> is true.

2 Mandates: gcd(other_byte_alignment, byte_alignment) == byte_alignment is true.

template<class OtherElementType>
  explicit constexpr aligned_accessor(
    default_accessor<OtherElementType>) noexcept {};

3 Constraints: is_convertible_v<OtherElementType(*)[], element_type(*)[]> is true.

constexpr reference access(data_handle_type p, size_t i) const noexcept;

4 Preconditions: p points to an object X of a type similar ([conv.qual]) to element_type, where X has alignment byte_alignment ([basic.align]).

5 Effects: Equivalent to: return assume_aligned<byte_alignment>(p)[i];

constexpr typename offset_policy::data_handle_type
  offset(data_handle_type p, size_t i) const noexcept;

6 Preconditions: p points to an object X of a type similar ([conv.qual]) to element_type, where X has alignment byte_alignment ([basic.align]).

7 Effects: Equivalent to: return p + i;

static bool is_sufficiently_aligned(data_handle_type p);

8 Preconditions: p points to an object X of a type similar ([conv.qual]) to element_type.

9 Returns: true if X has alignment at least byte_alignment, else false.

[Example: The following function compute uses is_sufficiently_aligned to check whether a given mdspan with default_accessor has a data handle with sufficient alignment to be used with aligned_accessor<float, 4 * sizeof(float)>. If so, the function dispatches to a function compute_using_fourfold_overalignment that requires fourfold overalignment of arrays, but can therefore use hardware-specific instructions, such as four-wide SIMD (Single Instruction Multiple Data) instructions. Otherwise, compute dispatches to a possibly less optimized function compute_without_requiring_overalignment that has no overalignment requirement.

extern void
compute_using_fourfold_overalignment(
  std::mdspan<float, std::dims<1>, std::layout_right,
    std::aligned_accessor<float, 4 * alignof(float)>> x);

extern void
compute_without_requiring_overalignment(
  std::mdspan<float, std::dims<1>> x);

void compute(std::mdspan<float, std::dims<1>> x)
{
  auto accessor = aligned_accessor<float, 4 * sizeof(float)>{};
  auto x_handle = x.data_handle();

  if (accessor.is_sufficiently_aligned(x_handle)) {
    compute_using_fourfold_overalignment(mdspan{x_handle, x.mapping(), accessor});
  }
  else {
    compute_without_requiring_overalignment(x);
  }
}

end example]

[Example: The following example shows how users can fulfill the preconditions of aligned_accessor by using existing C++ Standard Library functionality to create overaligned allocations. First, the allocate_overaligned helper function uses aligned_alloc to create an overaligned allocation.

template<class ElementType>
struct delete_with_free {
  void operator()(ElementType* p) const {
    std::free(p);
  }
};

template<class ElementType>
using allocation = std::unique_ptr<ElementType[], delete_with_free<ElementType>>;

template<class ElementType, size_t byte_alignment>
allocation<ElementType> allocate_overaligned(const size_t num_elements)
{
  const size_t num_bytes = num_elements * sizeof(ElementType);
  void* ptr = std::aligned_alloc(byte_alignment, num_bytes);
  return {ptr, delete_with_free<ElementType>{}};
}

Second, this example presumes that some third-party library provides functions requiring arrays of float to have 32-byte alignment.

template<size_t byte_alignment>
using aligned_mdspan = std::mdspan<float, std::dims<1, int>,
  std::layout_right, std::aligned_accessor<float, byte_alignment>>;

extern void vectorized_axpy(aligned_mdspan<32> y, float alpha, aligned_mdspan<32> x);
extern float vectorized_norm(aligned_mdspan<32> y);

Third and finally, the user’s function user_function would begin by allocating “raw” overaligned arrays with allocate_overaligned. It would then create aligned mdspan with them, and pass the resulting mdspan into the third-party library’s functions.

float user_function(size_t num_elements, float alpha)
{
  constexpr size_t max_byte_alignment = 32;
  auto x_alloc = allocate_overaligned<float, max_byte_alignment>(num_elements);
  auto y_alloc = allocate_overaligned<float, max_byte_alignment>(num_elements);

  aligned_mdspan<max_byte_alignment> x(x_alloc.get());
  aligned_mdspan<max_byte_alignment> y(y_alloc.get());

  // ... fill the elements of x and y ...

  vectorized_axpy(y, alpha, x);
  return vectorized_norm(y);
}

end example]

11 Wording for alternative nonmember is_sufficiently_aligned design

Text in blockquotes is not proposed wording, but rather instructions for generating proposed wording. The instructions here are expressed as a “diff” atop P2897R2, that is, as if the wording proposed in P2897R2 had already been applied to the Working Draft. The � character is used to denote a placeholder section number which the editor shall determine.

From the aligned_accessor synopsis in [mdspan.accessor.aligned.overview], remove the member function static bool is_sufficiently_aligned(data_handle_type p);.

From the list of members of aligned_accessor in [mdspan.accessor.aligned.members], remove static bool is_sufficiently_aligned(data_handle_type p); from after line 7 (“Effects: Equivalent to: return p + i;”), and remove line 8 (“Preconditions: …”) and line 9 (“Returns: …”). (Retain the Example using is_sufficiently_aligned that immediately follows the deleted lines.)

To the Header <bit> synopsis [bit.syn], after enum class endian and before the } that closes namespace std, add the following.

// [bit.aligned], is_sufficiently_aligned
template<class ElementType, size_t byte_alignment>
  bool is_sufficiently_aligned(ElementType*);

After [bit.endian], add a new section [bit.aligned] and put in it all the material that follows.

template<class ElementType, size_t byte_alignment>
bool is_sufficiently_aligned(ElementType* p);

1 Preconditions: p points to an object X of a type similar ([conv.qual]) to ElementType.

2 Returns: true if X has alignment at least byte_alignment, else false.

12 Appendix A: detectably_invalid nonmember function example

This section is nonnormative. This is the full source code with tests for the detectably_invalid nonmember function example above. Please see this Compiler Explorer link for a test with five different compilers: GCC 14.1, Clang 18.1.0, MSVC v19.40 (VS17.10), and nvc++ 24.5.

#include <cassert>
#include <concepts>
#include <cstdint>
#include <iostream>
#include <stdexcept>
#include <type_traits>
#include <utility>

template<class Accessor>
concept has_detectably_invalid = requires(Accessor acc) {
  typename Accessor::data_handle_type;
  { std::as_const(acc).detectably_invalid(
      std::declval<typename Accessor::data_handle_type>(),
      std::declval<std::size_t>()
    ) } noexcept -> std::convertible_to<bool>;
};

template<class Accessor>
bool detectably_invalid(Accessor&& accessor,
  typename std::remove_cvref_t<Accessor>::data_handle_type handle,
  std::size_t size)
{
  if constexpr (has_detectably_invalid<std::remove_cvref_t<Accessor>>) {
    return std::as_const(accessor).detectably_invalid(handle, size);
  }
  else {
    return false;
  }
}

struct A {
  using data_handle_type = float*;

  static bool detectably_invalid(data_handle_type ptr, std::size_t size) noexcept {
    return ptr == nullptr && size != 0;
  }
};

struct B {
  using data_handle_type = float*;
};

struct C {
  using data_handle_type = float*;

  // This is nonconst, so it's not actually called.
  bool detectably_invalid(data_handle_type ptr, std::size_t size) {
    throw std::runtime_error("C::detectably_invalid: uh oh");
  }
};

struct D {
  using data_handle_type = float*;

  // This is const but not noexcept, so it's not actually called.
  bool detectably_invalid(data_handle_type ptr, std::size_t size) const {
    throw std::runtime_error("D::detectably_invalid: uh oh");
  }
};


int main()
{
  float* ptr = nullptr;

  assert(not detectably_invalid(A{}, ptr, 0));
  assert(detectably_invalid(A{}, ptr, 1));

  A a{};
  assert(not detectably_invalid(a, ptr, 0));
  assert(detectably_invalid(a, ptr, 1));

  const A a_c{};
  assert(not detectably_invalid(a_c, ptr, 0));
  assert(detectably_invalid(a_c, ptr, 1));

  assert(not detectably_invalid(B{}, ptr, 0));
  assert(not detectably_invalid(B{}, ptr, 1));

  // B doesn't know how to check pointer validity.

  assert(not detectably_invalid(B{}, ptr, 0));
  assert(not detectably_invalid(B{}, ptr, 1));

  B b{};
  assert(not detectably_invalid(b, ptr, 0));
  assert(not detectably_invalid(b, ptr, 1));

  const B b_c{};
  assert(not detectably_invalid(b_c, ptr, 0));
  assert(not detectably_invalid(b_c, ptr, 1));

  // If users make detectably_invalid nonconst or not noexcept,
  // the nonmember function falls back to a default implementation.

  try {
    assert(not detectably_invalid(C{}, ptr, 0));
    assert(not detectably_invalid(C{}, ptr, 1));
  }
  catch (const std::runtime_error& e) {
    std::cerr << "C{} threw runtime_error: " << e.what() << "\n";
  }

  try {
    const C c_c{};
    assert(not detectably_invalid(c_c, ptr, 0));
    assert(not detectably_invalid(c_c, ptr, 1));
  }
  catch (const std::runtime_error& e) {
    std::cerr << "const C threw runtime_error: " << e.what() << "\n";
  }

  try {
    C c{};
    assert(not detectably_invalid(c, ptr, 0));
    assert(not detectably_invalid(c, ptr, 1));
  }
  catch (const std::runtime_error& e) {
    std::cerr << "nonconst C threw runtime_error: " << e.what() << "\n";
  }

  try {
    assert(not detectably_invalid(D{}, ptr, 0));
    assert(not detectably_invalid(D{}, ptr, 1));
  }
  catch (const std::runtime_error& e) {
    std::cerr << "D{} threw runtime_error: " << e.what() << "\n";
  }

  try {
    const D d_c{};
    assert(not detectably_invalid(d_c, ptr, 0));
    assert(not detectably_invalid(d_c, ptr, 1));
  }
  catch (const std::runtime_error& e) {
    std::cerr << "const D threw runtime_error: " << e.what() << "\n";
  }

  try {
    D d{};
    assert(not detectably_invalid(d, ptr, 0));
    assert(not detectably_invalid(d, ptr, 1));
  }
  catch (const std::runtime_error& e) {
    std::cerr << "nonconst D threw runtime_error: " << e.what() << "\n";
  }

  std::cerr << "Made it to the end\n";
  return 0;
}

13 Appendix B: Implementation and demo

This Compiler Explorer link gives a full implementation of aligned_accessor and a demonstration. We show the full source code from that link here below.

#include <https://raw.githubusercontent.com/kokkos/mdspan/single-header/mdspan.hpp>
#include <bit>
#include <cassert>
#include <cmath>
#if defined(_MSC_VER)
#  include <cstdlib> // MSVC's _aligned_malloc
#endif
#include <exception>
#include <functional>
#include <memory>
#include <numeric>
#include <type_traits>

namespace stdex = std::experimental;

// P2389 (voted into C++ at June 2024 STL plenary)
namespace std {
template<size_t Rank, class IndexType = size_t>
using dims = dextents<IndexType, Rank>;
} // namespace std

namespace {

template<class ElementType, std::size_t byte_alignment>
class aligned_accessor {
public:
  static_assert(std::has_single_bit(byte_alignment),
    "byte_alignment must be a power of two.");
  static_assert(byte_alignment >= alignof(ElementType),
    "Insufficient byte alignment for ElementType");

  using offset_policy = stdex::default_accessor<ElementType>;
  using element_type = ElementType;
  using reference = ElementType&;
  using data_handle_type = ElementType*;

  constexpr aligned_accessor() noexcept = default;

  template<
    class OtherElementType,
    std::size_t other_byte_alignment>
  requires(std::is_convertible_v<
    OtherElementType(*)[], element_type(*)[]>)
  constexpr aligned_accessor(
    aligned_accessor<OtherElementType, other_byte_alignment>)
      noexcept
  {
    constexpr size_t the_gcd =
      std::gcd(other_byte_alignment, byte_alignment);
    static_assert(the_gcd == byte_alignment);
  }

  template<class OtherElementType>
  requires(std::is_convertible_v<
    OtherElementType(*)[], element_type(*)[]>)
  constexpr explicit aligned_accessor(
    stdex::default_accessor<OtherElementType>) noexcept
  {}
 
  constexpr
    operator stdex::default_accessor<element_type>() const
  {
    return {};
  }

  constexpr reference
    access(data_handle_type p, size_t i) const noexcept
  {
    return std::assume_aligned<byte_alignment>(p)[i];
  }

  constexpr typename offset_policy::data_handle_type
  offset(data_handle_type p, size_t i) const noexcept {
    return p + i;
  }

  static bool
    is_sufficiently_aligned(data_handle_type p) noexcept
  {
    return alignof(ElementType) >= byte_alignment && 
      std::bit_cast<uintptr_t>(p) % byte_alignment == 0;
  }
};

template<size_t byte_alignment>
using aligned_mdspan =
  std::mdspan<float, std::dims<1, int>, std::layout_right,
    aligned_accessor<float, byte_alignment>>;

// Interfaces that require 32-byte alignment,
// because they want to do 8-wide SIMD of float.
void
vectorized_axpby(aligned_mdspan<32> y,
  float alpha, aligned_mdspan<32> x, float beta)
{
  assert(x.extent(0) == y.extent(0));
  for (int k = 0; k < x.extent(0); ++k) {
    y[k] = beta * y[k] + alpha * x[k]; 
  }
}

// 1-norm of the vector y
float vectorized_norm(aligned_mdspan<32> y)
{
  float one_norm = 0.0f;
  for (int k = 0; k < y.extent(0); ++k) {
    one_norm += std::fabs(y[k]); 
  }
  return one_norm;
}

// Interfaces that require 16-byte alignment,
// because they want to do 4-wide SIMD of float.
void fill_x(aligned_mdspan<16> x) {
  for (int k = 0; k < x.extent(0); ++k) {
    x[k] = static_cast<float>(k + 2);
  }  
}
void fill_y(aligned_mdspan<16> y) {
  for (int k = 0; k < y.extent(0); ++k) {
    y[k] = static_cast<float>(k - 1);
  }  
}

// Helper functions for making overaligned array allocations.

template<class ElementType>
struct delete_raw {
  void operator()(ElementType* p) const {
    std::free(p);
  }
};

template<class ElementType>
using allocation =
  std::unique_ptr<ElementType[], delete_raw<ElementType>>;

template<class ElementType, std::size_t byte_alignment>
allocation<ElementType> allocate_raw(const std::size_t num_elements)
{
  const std::size_t num_bytes = num_elements * sizeof(ElementType);
  float* ptr = reinterpret_cast<float*>(
#if defined(_MSC_VER)
    _aligned_malloc(byte_alignment, num_bytes)
#else
    std::aligned_alloc(byte_alignment, num_bytes)
#endif
  );
  return {ptr, delete_raw<ElementType>{}};
}

float user_function(size_t num_elements, float alpha, float beta)
{
  constexpr size_t max_byte_alignment = 32;
  // unique_ptr<float[], deleter_something>
  auto x_alloc = allocate_raw<float, max_byte_alignment>(num_elements);
  auto y_alloc = allocate_raw<float, max_byte_alignment>(num_elements);

  aligned_mdspan<max_byte_alignment> x(x_alloc.get(), num_elements);
  aligned_mdspan<max_byte_alignment> y(y_alloc.get(), num_elements);

  // expect aligned_mdspan<16>, get aligned_mdspan<32>
  fill_x(x); // automatic conversion from 32-byte aligned to 16-byte aligned
  fill_y(y); // automatic conversion again

  // expect aligned_mdspan<32>, get aligned_mdspan<32>
  vectorized_axpby(y, alpha, x, beta);
  return vectorized_norm(y);
}

} // namespace (anonymous)

int main(int argc, char* argv[])
{
  float result = user_function(10, 1.0f, -1.0f);
  // 3 + 3 + ... + 3 = 30
  assert(result == 30.0f);
  return 0;
}