1. Introduction
There is a [P1068R7] paper on the flight, which discusses a higher level API to enable (RNG) in batches. The main purpose of the new API is to enable vectorization under the hood in the implementation. Given some of us live in anticipation of finalstd :: simd
[P1928R3] standardization, a natural questions arise:
-
Do we need a dedicated high-level API for RNG vectorization?
-
Can
-based low-level API with some generic mechanism for making it high level be sufficient?std :: simd
2. Example
When does RNG performance matter? An algorithm should consume many random numbers. Additional computation on top of generation should be relatively lightweight. The "European options pricing" benchmark can show the purpose:
std :: mt19937 engine ( 777 ); std :: normal_distribution distribution ( 0. , 1. ); double v0 = 0 , v1 = 0 ; for ( std :: size_t p = 0 ; p < npath ; ++ p ) { //e.g., npath=1,000,000 double rand = distribution ( engine ); double res = std :: max ( 0. , S * exp ( sqrt ( T ) * vol * rand + T * mu ) - X ); v0 += res ; v1 += res * res ; } result = v0 / npath ; confidence = v1 / npath ;
One should have millions of iterations of the loop to get an accurate estimation of the price: many random numbers will be consumed with a handful of operations on top.
The loop will not be auto-vectorized by compilers because of RNG - lack of performance out of the box. Production code is tending to be more complicated: there can be more than one random number per computational block with potentially different distributions, random number paths may need to be correlated, etc. but the "European option pricing" benchmark is good enough to show different approaches.
2.1. High-level API with [P1068R7]
Let’s consider the straightforward example
std :: mt19937 engine ( 777 ); std :: normal_distribution distribution ( 0. , 1. ); std :: array < double , npath > rand ; // npath=1,000,000 -> sizeof(rand)=8 MB std :: ranges :: generate_random ( rand , engine , distribution ); double v0 = 0 , v1 = 0 ; for ( std :: size_t p = 0 ; p < npath ; ++ p ) { double res = std :: max ( 0. , S * exp ( sqrt ( T ) * vol * rand [ p ] + T * mu ) - X ); v0 += res ; v1 += res * res ; } result = v0 / npath ; confidence = v1 / npath ;
We got vectorization both in
and in the loop.
But we are forced to allocate a temporary storage for the random numbers. Given the size of
,
we run into L1 or L2 cache misses with sub-optimal performance at the end.
Let us apply bufferization:
std :: mt19937 engine ( 777 ); std :: normal_distribution distribution ( 0. , 1. ); std :: array < double , nbuffer > rand ; // e.g., nbuffer=128 double v0 = 0 , v1 = 0 ; for ( std :: size_t p = 0 ; p < npath ; p += nbuffer ) { std :: size_t local_size = ( p + nbuffer <= npath ) ? nbuffer : ( npath - p ); // dealing with tail std :: ranges :: generate_random ( std :: span ( rand . begin (), local_size ), engine , distribution ); for ( std :: size_t b = 0 ; b < local_size ; ++ b ) { double res = std :: max ( 0. , S * exp ( sqrt ( T ) * vol * rand [ p ] + T * mu ) - X ); v0 += res ; v1 += res * res ; } } result = v0 / npath ; confidence = v1 / npath ;
The code is more complex, but is likely faster, and allows one to tune its performance by changing the
parameter.
2.2. Low-level API with std :: simd
Let’s investigate the opportunities for low-level SIMD based API in the same benchmark.
The API for SIMD-level RNG is not defined yet, but there are several possible approaches
with their own pros and cons, which we may consider. The key distinguishing property in
these approaches is the level at which the API gets the information about
type
being requested by the user. An engine, pseudo-random number generator that produces integer
sequence with uniform distribution, holds the state for the RNG generation including extra
intermediate structures for efficient calculation. With the knowledge about SIMD size, it may
store internal structures in the most efficient format and this format does differ in practice
for different SIMD sizes.
In the ideal world, distribution objects should be almost stateless (except for the parameters of the distribution), but in practice they are not. For example, Box-Muller method for Gaussian random number generation generates two variates using two independent samples from the uniform distribution. Thus, in the scalar use one already generated value can be stored in the distribution object to be returned on the next call.
This results in an amusing corner-case:
std :: mt19937 E1 ( 777 ); std :: mt19937 E2 ( 777 ); std :: normal_distribution D ( 0. , 1. ); // Box-Muller for (;;) { double res1 = D ( E1 ); double res2 = D ( E2 ); }
One may naively expect that
, because the values are expected to be consumed
from 2 engines in the same starting state. But in fact all numbers will be generated using
, while the
engine will never be used in the most existing C++ standard library
implementations. The current standard wording specifies the output of a distribution object
when used with "successive invocations with the same [engine] objects", thus the result above
is unspecified and observed behavior is valid.
2.2.1. Engine template parameter
In our first example, both engine and distribution will be constructed with the knowledge
about the
type, which will be requested from them.
std :: mt19937 < std :: fixed_size_simd < std :: uint_fast32_t , 16 >> E ( 777 ); std :: normal_distribution < std :: fixed_size_simd < double , 16 >> D ( 0. , 1. ); auto rand = D ( E );
type in the engine and
type in the distribution are different.
The number of values produced by the engine per value required by distribution is currently
unspecified in the standard (even in scalar case), but in most existing implementations 2 values
of
will be consumed from the engine per double return result of the
.
Implementation wide it may (or may not) be more optimal to define it as:
// 32 SIMD size passed to engine std :: mt19937 < std :: fixed_size_simd < std :: uint_fast32_t , 32 >> E ( 777 ); // 16 SIMD size passed to distribution std :: normal_distribution < std :: fixed_size_simd < double , 16 >> D ( 0. , 1. ); auto rand = D ( E );
Going towards more sophisticated distributions acceptance-rejection type of methods are becoming more common, which consume a varying number of engine values per resulting value.
Thus, there is no universal good answer for the proper
size of the engine output,
which would be the best fit for the given distribution
output.
If engine is capable of generating values only in the packs of fixed size (defined by the type of the simd parameter) and distribution consumed this pack only partially, then the question about the use of the remaining numbers is raised:
-
Remaining part can be stored in the distribution object in either raw engine values form or in the form of already produced, but not returned to the application values. That brings two more oddities:
-
Distribution is unaware of the pack size returned from the engine at construction point - it gets the engine type in
. The size of potential tail is unknown thus static allocation of the storage for it is problematic.operator () -
It extends the corner-case example above into a more complicated form, where it is unclear what was actually the source of the randomness - current function argument or something from the past.
-
-
Remaining part can be discarded. While not entirely thrifty, it might not be that bad choice.
-
Note: There is no solid reason for distribution to know about
type at construction point in this case, except for providing corresponding member function in the APIstd :: simd
-
Having all those implementation implications in mind, let us look at our example:
constexpr std :: size_t size = 16 ; std :: mt19937 < std :: fixed_size_simd < std :: uint_fast32_t , size >> E ( 777 ); std :: normal_distribution < std :: fixed_size_simd < double , size >> D ( 0. , 1. ); double v0 = 0 , v1 = 0 ; std :: size_t p = 0 ; for (; p + size <= npath ; p += size ) { auto rand = D ( E ); // std::fixed_size_simd<double, size> auto res = std :: max ( 0. , S * exp ( sqrt ( T ) * vol * rand + T * mu ) - X ); v0 += std :: reduce ( res ); v1 += std :: reduce ( res * res ); } // npath is runtime parameter - it may not be divisible by 16 if ( p != npath ) { auto rand_tail = D ( E ); auto res = std :: max ( 0. , S * exp ( sqrt ( T ) * vol * rand_tail + T * mu ) - X ); for ( std :: size_t i = 0 ; p + i < npath ; ++ i ) { v0 += res [ i ]; v1 += res [ i ] * res [ i ]; } // discarding unused part of res } result = v0 / npath ; confidence = v1 / npath ;
If
is not a multiple of
size, then this example generates more random numbers
than actually needed in
and throws away unused part. That makes the example not an
absolute match to our initial example, but it is as close as we can get.
There is also an open question what
means in this case - whether it discards a
pack of 16 values or only one element.
Note: We do not want to combine the logic of the tail with the main loop in order to make our hot path as optimized as possible.
2.2.2. Engine template parameter + rebind constructor
One may overcome the main limitation of the approach by allowing rebind construction of the
engine with the different
widths and scalar type.
constexpr std :: size_t size = 16 ; std :: mt19937 < std :: fixed_size_simd < std :: uint_fast32_t , size >> E ( 777 ); std :: normal_distribution < std :: fixed_size_simd < double , size >> D ( 0. , 1. ); double v0 = 0 , v1 = 0 ; std :: size_t p = 0 ; for (; p + size <= npath ; p += size ) { auto rand = D ( E ); auto res = std :: max ( 0. , S * exp ( sqrt ( T ) * vol * rand + T * mu ) - X ); v0 += std :: reduce ( res ); v1 += std :: reduce ( res * res ); } if ( p != npath ) { std :: mt19937 E_tail ( E ); // rebinding to scalar type std :: normal_distribution D_tail ( 0. , 1. ); // getting scalar distribution for (; p < npath ; ++ p ) { auto rand_tail = D_tail ( E_tail ); auto res = std :: max ( 0. , S * exp ( sqrt ( T ) * vol * rand_tail + T * mu ) - X ); v0 += res ; v1 += res * res ; } E = E_tail ; // rebinding back to keep original engine in the expected state } result = v0 / npath ; confidence = v1 / npath ;
While user-level flexibility is obtained, a set of redundant copy operations were introduced into the code, which will bring visible overheads especially in case of engines with a larger state.
2.2.3. Algorithm-like function template parameter
An alternative solution would be to shift the knowledge of the
size from
engine/distribution construction point to the point of usage. This makes the engine unaware
of its main usage mode and thus internal layout would be chosen by implementation to some
balanced form, which allows both packed and scalar generation.
std :: mt19937 E ( 777 ); std :: normal_distribution D ( 0. , 1. ); auto rand = std :: generate_random_simd < std :: fixed_size_simd < double , 16 >> ( E , D );
While it limits bare metal optimizations of the engine layout, it provides a significant freedom of implementation to the standard library - the way of consuming base random numbers out of an engine may vary from platform to platform. It also brings in more flexibility on the usage side.
constexpr std :: size_t size = 16 ; std :: mt19937 E ( 777 ); std :: normal_distribution D ( 0. , 1. ); double v0 = 0 , v1 = 0 ; std :: size_t p = 0 ; for (; p + size <= npath ; p += size ) { auto rand = std :: get_random_simd < std :: fixed_size_simd < double , size >> ( E , D ); auto res = std :: max ( 0. , S * exp ( sqrt ( T ) * vol * rand + T * mu ) - X ); v0 += std :: reduce ( res ); v1 += std :: reduce ( res * res ); } // scalar tail matches original example implementation for (; p < npath ; ++ p ) { auto rand_tail = D ( E ); auto res = std :: max ( 0. , S * exp ( sqrt ( T ) * vol * rand_tail + T * mu ) - X ); v0 += res ; v1 += res * res ; } result = v0 / npath ; confidence = v1 / npath ;
This example matches the initial example exactly and the user is not forced to copy engine
state to deal with tails. Moreover, having some belief in performance outcome, one may modify
example to optimize the tail computation more, by generating a sequence of decreasing
sizes of 8, 4, 2 and 1.
3. SIMD-level API vs High-level API
There is no single answer on whether High-level or SIMD-level API should be used to optimize a given scalar code with random number generation.
High-level API requires additional buffers and may need additional blocking to achieve good performance.
Low-level API may provide top performance but at the same time, SIMD-level API brings in
a number of implications on the user code - more efforts for dealing with tails, more code restructuring
to make all surrounding code
fashioned.
At the same time any discussed Low-level API can be used under the hood of higher level API.
-
Engine template parameter approach will require more API tuning under the hood of High-level API. Some kind of rebind operation will be needed to transform incoming legacy engine into another one with
template parameter.std :: simd -
Algorithm-like approach can be seamlessly used without additional modifications.
With that we believe that both API levels make sense for the standard library.
4. Summary
With the analysis of RNG API for support of vectorization, we conclude that:
-
High level API aims at the support of the majority of the RNG based applications created by a regular C++ developer. With opportunity to enable HW vectorization capability features by vendors under the hood of the API implementation it would further enhance performance of C++ RNG based applications. The blocking techniques for the further improvement of the performance are also possible upon the same API. The implementation of the API might rely on SIMD capability.
-
Low-level API aims at the advanced developers who need to code closer to HW still staying in C++ convention. Though, the definition of SIMD based RNG API should avoid making the developers understand the computational SIMD related specifics of RNG based on engine and distribution concepts. The key to that might be a single point of SIMD configuration in the API, for example, a SIMD-parameterized generation function. How this approach correlates with the SIMD use in other domains, for example, mathematical functions, - open question.
Based on those observations having both high- and low-level vectorized API for RNG in the C++ standard library looks reasonable.