1. Motivation
Throughout this document "malloc" refers to the implementation of:: operator new
both as fairly standard practice for implementers, and to
make clear the distinction between the interface and the implementation.
Everyone’s favorite dynamic data structure,
, allocates memory with
code that looks something like this (with many details, like
,
templating for non
, and
exception safety, elided):
void vector :: reserve ( size_t new_cap ) { if ( capacity_ >= new_cap ) return ; const size_t bytes = new_cap ; void * newp = :: operator new ( new_cap ); memcpy ( newp , ptr_ , capacity_ ); ptr_ = newp ; capacity_ = bytes ; }
Consider the sequence of calls:
std :: vector < char > v ; v . reserve ( 37 ); // ... v . reserve ( 38 );
All reasonable implementations of malloc round sizes, both for alignment requirements and improved performance. It is extremely unlikely that malloc provided us exactly 37 bytes. We do not need to invoke the allocator here...except that we don’t know that for sure, and to use the 38th byte would be undefined behavior. We would like that 38th byte to be usable without a roundtrip through the allocator.
This paper proposes an API making it safe to use that byte, and explores many of the design choices (not all of which are obvious without implementation experience.)
1.1. nallocx: not as awesome as it looks
The simplest way to help here is to provide an informative API answering the
question "If I ask for N bytes, how many do I actually get?" [jemalloc] calls
this
. We can then use that hint as a smarter parameter for operator
new:
void vector :: reserve ( size_t new_cap ) { if ( capacity_ >= new_cap ) return ; const size_t bytes = nallocx ( new_cap , 0 ); void * newp = :: operator new ( bytes ); memcpy ( newp , ptr_ , capacity_ ); ptr_ = newp ; capacity_ = bytes ; }
This is a good start, and does in fact work to allow vector and friends to use the true extent of returned objects. But there are three significant problems with this approach.
1.1.1. nallocx must give a conservative answer
While many allocators have a deterministic map from requested size to allocated
size, it is by no means guaranteed that all do. Presumably they can make a
reasonably good guess, but if two calls to
might return 64
and 128 bytes, we’d definitely rather know the right answer, not a conservative
approximation.
1.1.2. nallocx duplicates work
Allocation is often a crucial limit on performance. Most allocators compute
the returned size of an object as part of fulfilling that allocation...but if
we make a second call to
, we duplicate all that communication, and
also the overhead of the function call.
1.1.3. nallocx hides information from malloc
The biggest problem (for the authors) is that
discards information
malloc finds valuable (the user’s intended allocation size.) That is: in our
running example, malloc normally knows that the user wants 37 bytes (then 38),
but with
, we will only ever be told that they want 40 (or 48, or
whatever
returns.)
Google’s malloc implementation (TCMalloc) rounds requests to one of a small (<100) number of sizeclasses: we maintain local caches of appropriately sized objects, and cannot do this for every possible size of object. Originally, these sizeclasses were just reasonably evenly spaced among the range they cover. Since then, we have used extensive telemetry on allocator use in the wild to tune these choices. In particular, as we know (approximately) how many objects of any given size are requested, we can solve a fairly simple optimization problem to minimize the total internal fragmentation for any choice of N sizeclasses.
Widespread use of
breaks this. By the time TCMalloc’s telemetry sees
a request that was hinted by nallocx, to the best of our knowledge the user wants exactly as many bytes as we currently provide them. If a huge number
of callers wanted 40 bytes but were currently getting 48, we’d lose the ability
to know that and optimize for it.
Note that we can’t take the same telemetry from
calls: we have no
idea how many times the resulting hint will be used (we might not allocate at
all, or we might cache the result and make a million allocations guided by it.)
We would also lose important information in the stack traces we collect from
allocation sites.
Optimization guided by malloc telemetry has been one of our most effective
tools in improving allocator performance. It is important that we fix this
issue without losing the ground truth of what a caller of
wants.
These three issues explain why we don’t believe
is a sufficient
solution here.
1.2. after allocation is too late
Another obvious suggestion is to add a way to inspect the size of an object
returned by
. Most mallocs provide a way to do this; [jemalloc] calls it
. Vector would look like:
void vector :: reserve ( size_t new_cap ) { if ( capacity_ >= new_cap ) return ; void * newp = :: operator new ( new_cap ); const size_t bytes = sallocx ( newp ); memcpy ( newp , ptr_ , capacity_ ); ptr_ = newp ; capacity_ = bytes ; }
This is worse than nallocx. It fixes the non-constant size problem, and avoids
a feedback loop, but the performance issue is worse (this is the major issue fixed by [SizedDelete]!), and what’s worse, the above code invokes UB as
soon as we touch byte
. We could in principle change the standard,
but this would be an implementation nightmare.
1.3. realloc’s day has passed
We should also quickly examine why the classic C API
is insufficient.
void vector :: reserve ( size_t new_cap ) { if ( capacity_ >= new_cap ) return ; ptr_ = realloc ( ptr_ , new_cap ); capacity_ = new_cap ; }
In principle a realloc from 37 to 38 bytes wouldn’t carry the full cost of allocation. But it’s dramatically more expensive than making no call at all. What’s more, there are a number of more complicated dynamic data structures that store variable-sized chunks of data but are never actually resized. These data structures still deserve the right to use all the memory they’re paying for.
Furthermore,
's original purpose was not to allow the use of more bytes
the caller already had, but to (hopefully) extend an allocation in place to
adjacent free space. In a classic malloc implementation this would actually be
possible...but most modern allocators use variants of slab allocation. Even if
the 65th byte in a 64-byte allocation isn’t in use, they cannot be combined into
a single object; it’s almost certainly required to be used for the next 64-byte
allocation. In the modern world,
serves little purpose.
2. Proposal
We propose adding new overloads of
that directly inform the
user of the size available to them. C++ makes
replaceable
(15.5.4.6), allowing a program to provide its own version different from the
implementation.
struct std :: return_size_t {}; struct std :: sized_ptr_t { void * p ; size_t n ; }; std :: sized_ptr_t :: operator new ( size_t size , std :: return_size_t ); std :: sized_ptr_t :: operator new ( size_t size , std :: align_val_t al , std :: return_size_t ); std :: sized_ptr_t :: operator new ( size_t size , const std :: nothrow_t & , std :: return_size_t ); std :: sized_ptr_t :: operator new ( size_t size , std :: align_val_t al , const std :: nothrow_t & , std :: return_size_t ); std :: sized_ptr_t :: operator new []( size_t size , std :: return_size_t ); std :: sized_ptr_t :: operator new []( size_t size , std :: align_val_t al , std :: return_size_t ); std :: sized_ptr_t :: operator new []( size_t size , const std :: nothrow_t & , std :: return_size_t ); std :: sized_ptr_t :: operator new []( size_t size , std :: align_val_t al , const std :: nothrow_t & , std :: return_size_t );
Additionally, we amend 15.5.4.6 (Replacement functions), wording relative to [N4762]:
operator new ( std :: size_t ) operator new ( std :: size_t , std :: align_val_t ) operator new ( std :: size_t , const std :: nothrow_t & ) operator new ( std :: size_t , std :: align_val_t , const std :: nothrow_t & ) std :: sized_ptr_t :: operator new ( size_t size , std :: return_size_t ); std :: sized_ptr_t :: operator new ( size_t size , std :: align_val_t al , std :: return_size_t ); std :: sized_ptr_t :: operator new ( size_t size , const std :: nothrow_t & , std :: return_size_t ); std :: sized_ptr_t :: operator new ( size_t size , std :: align_val_t al , const std :: nothrow_t & , std :: return_size_t ); operator delete ( void * ) operator delete ( void * , std :: size_t ) operator delete ( void * , std :: align_val_t ) operator delete ( void * , std :: size_t , std :: align_val_t ) operator delete ( void * , const std :: nothrow_t & ) operator delete ( void * , std :: align_val_t , const std :: nothrow_t & ) operator new []( std :: size_t ) operator new []( std :: size_t , std :: align_val_t ) operator new []( std :: size_t , const std :: nothrow_t & ) operator new []( std :: size_t , std :: align_val_t , const std :: nothrow_t & ) std :: sized_ptr_t :: operator new []( size_t size , std :: return_size_t ); std :: sized_ptr_t :: operator new []( size_t size , std :: align_val_t al , std :: return_size_t ); std :: sized_ptr_t :: operator new []( size_t size , const std :: nothrow_t & , std :: return_size_t ); std :: sized_ptr_t :: operator new []( size_t size , std :: align_val_t al , const std :: nothrow_t & , std :: return_size_t ); operator delete []( void * ) operator delete []( void * , std :: size_t ) operator delete []( void * , std :: align_val_t ) operator delete []( void * , std :: size_t , std :: align_val_t ) operator delete []( void * , const std :: nothrow_t & ) operator delete []( void * , std :: align_val_t , const std :: nothrow_t & )
Another signature we could use would be:
enum class return_size_t : std :: size_t {}; void * :: operator new ( size_t size , std :: return_size_t );
(and so on.) This is slightly simpler to read as a signature, but arguably worse in usage:
std :: tie ( obj . ptr , obj . size ) = :: operator new ( 37 , std :: return_size_t {}); // ...vs... // Presumably the object implementation wants to contain a size_t, // not a return_size_t. std :: return_size_t rs ; obj . ptr = :: operator new ( 37 , rs ); obj . size = rs ;
More importantly, this form is less efficient. In practice, underlying malloc
implementations provide actual definitions of
symbols which
are called like any other function. Passing a reference parameter requires us
to actually return the size via memory.
-
Linux ABIs support returning at least two scalar values in registers (even if they’re members of a trivially copyable struct) which can be dramatically more efficient.
-
The [MicrosoftABI] returns large types by pointer, but this is no worse than making the reference parameter an inherent part of the API.
Whether we use a reference parameter or a second returned value, the interpretation is the same. Candidate (rough) language for the first overload would be:
[[ nodiscard ]]] std :: sized_ptr_t :: operator new ( size_t size , const std :: return_size_t );
Effects: returns a pair (p, n) with
. Behaves as if
n >= size was the return value of a call to
p .
:: operator new ( n )
The intention is quite simple: we return the "actual" size of the allocation,
and rely on "as if" to do the heavy lifting that lets us use more than
bytes of the resulting allocation. In particular, this means at no point do
we risk undefined behavior from using more bytes than
was
called with.
2.1. How many :: operator new
's?
It is unfortunate that we have so many permutations of
--eight
seems like far more than we should really need! But there really isn’t any
significant runtime cost for having them. Use of raw calls to
is relatively rare: It’s a building block for low-level libraries, allocators
([P0401]), and so on, so the cognitive burden on C++ users is low.
The authors have considered other alternatives to the additional overloads. At the Jacksonville meeting, EWG suggested looking at parameter packs.
-
Parameter packs do not reduce the number of symbols introduced. Implementers still need to provide implementations each of the n overloads.
-
Retrofitting parameter packs leaves us with more mangled variants. Implementers need to provide both the legacy symbols as well as the parameter pack-mangled symbols.
2.2. Implementation difficulty
It’s worth reiterating that there’s a perfectly good trivial implementation of these functions:
std :: sized_ptr_t :: operator new ( size_t n , std :: return_size_t ) { return { :: operator new ( n ), n }; }
Malloc implementations are free to properly override this with a more impactful definition, but this paper poses no significant difficulty for toolchain implementers.
Implementation Experience:
-
TCMalloc has developed a (currently internal) implementation. While this requires mapping from an integer size class to the true number of bytes, combining this lookup with the allocation is more efficient as we avoid recomputing the sizeclass itself (given a request) or deriving it from the object’s address.
-
jemalloc is prototyping a
function providing a C API for this functionality [smallocx].smallocx
2.3. Interaction with Sized Delete
For allocations made with
-returning
, we need to
relax
's size argument (16.6.2.1 and 16.6.2.2). For
allocations of
, the size quanta used by the allocator may not be a multiple
of
, leading to both the original and returned sizes being
unrecoverable at the time of deletion.
Consider the memory allocated by:
using T = std :: aligned_storage < 16 , 8 >:: type ; std :: vector < T > v ( 4 );
The underlying heap allocation is made with
.
-
The memory allocator may return a 72 byte object: Since there is no
such thatk
, we can’t provide that value tosizeof ( T ) * k = 72
. The only option would be storing 72 explicitly, which would be wasteful.:: operator delete ( void * , size_t ) -
The memory allocator may instead return an 80 byte object (5
's): We now cannot represent the original request when deallocating without additional storage.T
For allocations made with
std :: tie ( p , m ) = :: operator new ( n , std :: return_size_t {});
we permit
where
.
This behavior is consistent with [jemalloc]'s
, where the
deallocation size must fall between the request (
) and the actual allocated
size (
) inclusive.
2.4. Advantages
It’s easy to see that this approach nicely solves the problems with
or the like. We pay almost nothing in speed to return an actual-size
parameter; allocator telemetry knows actual request sizes exactly; and we are
told exactly the size we have, without risk of UB.
3. New Expressions
Additionally, we propose expanding this functionality to
expressions by returning:
-
For
, pointers to the object created and the end of the allocation.new -
For
, pointers to the initial element of the array and one past the last element of the array.new [] auto [ start , end ] = new ( std :: return_size_t ) T [ 5 ]; for ( T * p = start + 5 ; p != end ; p ++ ) { new ( p ) T ; } for ( T * p = start ; p != end ; p ++ ) { p -> DoStuff (); } for ( T * p = start + 5 ; p != end ; p ++ ) { p ->~ T (); } delete [] start ; The pair of pointers provides convience for use with iterator-oriented algorithms.
We considered alternatives for returning the size.
-
We could return the size in units of bytes (minus the array allocation overhead).
auto [ p , sz ] = new ( std :: return_size_t ) T [ 5 ]; for ( int i = 5 ; i < sz / sizeof ( T ); i ++ ) { new ( p [ i ]) T ; } for ( int i = 0 ; i < sz / sizeof ( T ); i ++ ) { p [ i ]. DoStuff (); } for ( int i = 5 ; i < sz / sizeof ( T ); i ++ ) { p [ i ]. ~ T (); } delete [] p ; -
We could return the size in units of
, this leads to an inconsistency between the expected usage forT
andnew
:new [] -
For
, we may only end up fitting a singlenew
into an allocator size quanta, so the extra space remains unusable. If we can fit multipleT
into a single allocator size quanta, we now have an array from what was a scalar allocation site. This cannot be foreseen by the compiler asT
is a replaceable function.:: operator new -
For
, the size in units ofnew []
can easily be derived from the returned size in bytes.T
-
-
We could pass the size in units of
or bytes to the constructor ofT
:T -
For
, this is especially useful for tail-padded arrays, but neglects default-initializednew
.T -
For
, a common use case is expected to be the allocation of arrays ofnew []
,char
, etc. The size of the overall array is irrelevant for the individual elements.int
-
-
We could return the size via a reference parameter:
std :: return_end < T > end ; T * p = new ( end ) T [ 5 ]; for ( T * p = start + 5 ; p != end ; p ++ ) { new ( p ) T ; } for ( T * p = start ; p != end ; p ++ ) { p -> DoStuff (); } for ( T * p = start + 5 ; p != end ; p ++ ) { p ->~ T (); } or, demonstrated with bytes:
std :: return_size_t size ; T * p = new ( s ) T [ 5 ]; for ( int i = 5 ; i < size / sizeof ( T ); i ++ ) { new ( p [ i ]) T ; } for ( int i = 0 ; i < size / sizeof ( T ); i ++ ) { p [ i ]. DoStuff (); } for ( int i = 5 ; i < size / sizeof ( T ); i ++ ) { p [ i ]. ~ T (); } delete [] p ; (Casts omitted for clarity.)
As discussed for
in §2 Proposal, a reference parameter poses difficulties for optimizers and involves returning the size via memory (depending on ABI).:: operator new
For
expressions, we considered alternatively initializing the returned
(
) number of elements.
-
This would avoid the need to explicitly construct / destruct the elements with the additional returned space (if any).
The new-initializer is invoked for the returned number of elements, rather than the requested number of elements. This allows
to destroy the correct number of elements (by storingdelete []
in the array allocation overhead).sz / sizeof ( T ) -
The presented proposal (leaving this space uninitialized) was chosen for consistency with
.new
4. Related work
[AllocatorExt] considered this problem at the level of the
concept. Ironically, the lack of the above API was one significant problem: how
could an implementation of
provide the requested feedback in a
way that would work with any underlying malloc implementation?
If this proposal is accepted, it’s likely that [AllocatorExt] should be taken up again.
5. History
5.1. R1 → R2
Applied feedback from San Diego Mailing
-
Moved from passing
parameter by reference to by value. For many ABIs, this is more optimizable and to the authors' knowledge, no worse on any other.std :: return_size_t -
Added rationale for not using parameter packs for this functionality.
5.2. R0 → R1
Applied feedback from [JacksonvilleMinutes].
-
Clarified in §2 Proposal the desire to leverage the existing "replacement functions" wording of the IS, particularly given the close interoperation with the existing
/:: operator new
implementations.:: operator delete -
Added a discussion of the Microsoft ABI in §2 Proposal.
-
Noted in §2.1 How many ::operator new's? the possibility of using a parameter pack.
-
Added a proposal for §3 New Expressions, as requested by EWG.
Additionally, a discussion of §2.3 Interaction with Sized Delete has been added.