atomic_swap
and atomic_compare_swap
to atomic_exchange
and atomic_compare_exchange
, respectively.atomic_is_lock_free
.swap
to exchange
edit to the proposed
text, at the committee's request.
The level of thread safety offered by shared_ptr
is the default for
the standard library: const operations on the same instance can be performed concurrently
by multiple threads, but updates require exclusive access. Some scenarios require
stronger guarantees; there is considerable demand (on the Boost mailing lists and
otherwise) for an atomic shared_ptr
, one that can withstand
concurrent updates from two threads without synchronization. This is often motivated
by the desire to port an existing scalable pattern that relies on atomic operations
on raw pointers in combination with garbage collection.
We propose the addition of the following functions to <memory>
to address this use case:
template<class T> bool atomic_is_lock_free( shared_ptr<T> const * p ); template<class T> shared_ptr<T> atomic_load( shared_ptr<T> const * p ); template<class T> shared_ptr<T> atomic_load_explicit( shared_ptr<T> const * p, memory_order mo ); template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r ); template<class T> void atomic_store_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo ); template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r ); template<class T> shared_ptr<T> atomic_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo ); template<class T> bool atomic_compare_exchange( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w ); template<class T> bool atomic_compare_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w, memory_order success, memory_order failure );
A typical example scenario — a "lock-free" reader/writer pattern — that takes advantage of this functionality is outlined below:
shared_ptr<State> ps; void reader() { shared_ptr<State const> p = atomic_load( &ps ); use *p; } // single writer case void writer() { shared_ptr<State> p( new State( *ps ) ); update *p reflecting new information; atomic_store( &ps, move( p ) ); } // or, multiple writers case void writer() { shared_ptr<State> p = atomic_load( &ps ); do { shared_ptr<State> q( new State( *p ) ); update *q reflecting new information; } while( !atomic_compare_exchange( &ps, &p, move( q ) ) ); }
This feature extends the interface of shared_ptr
in a backward-compatible
way. We believe that it is a strong candidate for addition to the C++0x standard.
It introduces no source- or binary compatibility issues.
The proposed additions have been implemented in Boost (with a slightly different syntax) and will be part of Boost 1.36. Performance tests have identified reader/writer scenarios where the above pattern achieves close to linear scalability, outperforming a read/write lock.
There are several possible approaches that lead to a lock-free implementation: hazard pointers, k-CAS, software transactional memory. The challenge is to consistently outperform the spinlock baseline implementation, and this part is not easy; that is why it makes sense to provide the functionality as part of the standard library.
Yes, the spinlock implementation is still worth having. For certain tasks and workloads, it still scales better than mutexes or read/write locks. See the performance section.
A spinlock-based implementation can be done non-intrusively in a TR, but a lock-free
implementation cannot, as it inherently requires access to shared_ptr
internals. We don't want to disallow lock-free implementations because they would
scale better. The area is a subject of active research and having a stable interface
as a target will, hopefully, lead to the emergence of such implementations.
We do not know whether a lock-free implementation will need to change shared_ptr
's
internal details, but the conservative assumption is that it might, which could
potentially require an ABI change unsuitable for a technical report.
shared_ptr
maintain a version counter?We believe that the interface is sufficient.
The need for a version counter is motivated by the read-upgrade-write pattern:
void maybe_writer() { obtain a read lock; use object; if( object needs updating ) { upgrade read lock to write lock; update object; } }
which in shared_ptr
terms looks like:
void maybe_writer() { shared_ptr<State> p = atomic_load( &ps ); do { use *p; if( object doesn't need updating ) break; shared_ptr<State> q( new State( *p ) ); update *q; } while( !atomic_compare_exchange( &ps, &p, move( q ) ) ); }
The other possible use for a version counter is to avoid the
ABA problem that is common for CAS-based algorithms. ABA cannot occur in
our case because the storage for the object referenced by p
cannot
be reused while p
is still alive.
std::atomic
specialization?
An std::atomic< shared_ptr<> >
specialization is not an
appropriate way to provide the functionality. The template parameter of std::atomic<T>
is required to have a trivial copy constructor and an assignment operator, for good
reasons; calling user code from within the atomic operations is a recipe for disaster.
Since the copy constructor and the assignment operator of shared_ptr
aren't trivial, it is not acceptable to instantiate std::atomic
on
it.
Providing a separate type atomic_shared_ptr<T>
is a legitimate
alternative. We have instead chosen to propose additions to the existing shared_ptr
interface for the following reasons:
shared_ptr
interface, we make it clear that
the support for atomic access and manipulation does not alter the type or the layout
of shared_ptr
.shared_ptr
instances atomically without sacrificing
the ability to keep them in containers. In particular, we could have an std::map<Key,
shared_ptr<State>>
and use the reader-writer pattern on the
elements of this map.atomic_shared_ptr
type and do
not feel confident that we can propose the right interface for it.atomic_shared_ptr
on top of shared_ptr
, but not vice versa.The implementation requires that the argument value is private to the function so that it can fiddle with its internal state without some other thread getting inconsistent results. A const reference implies that other threads are free to read the value.
If the argument is passed by const reference, the functions will need to make their own private copies anyway. This is inefficient when the argument is a temporary; an extra copy is introduced, which isn't needed. Pass by value automatically elides the copy when the argument is an rvalue.
Add to the synopsis of <memory>
in [memory]
the
following:
// shared_ptr atomic access: template<class T> bool atomic_is_lock_free( shared_ptr<T> const * p ); template<class T> shared_ptr<T> atomic_load( shared_ptr<T> const * p ); template<class T> shared_ptr<T> atomic_load_explicit( shared_ptr<T> const * p, memory_order mo ); template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r ); template<class T> void atomic_store_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo ); template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r ); template<class T> shared_ptr<T> atomic_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo ); template<class T> bool atomic_compare_exchange( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w ); template<class T> bool atomic_compare_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w, memory_order success, memory_order failure );
Add a new section [util.smartptr.shared.atomic]
:
Concurrent access to a
shared_ptr
instance from multiple threads does not introduce a data race if the access is done exclusively via the functions in this section and the instance is passed as their first argument.The meaning of the arguments of type
memory_order
is explained in 29.1/1 [atomics.order] and 29.1/2.template<class T> bool atomic_is_lock_free( shared_ptr<T> const * p );Returns:
true
if atomic access to*p
is lock-free,false
otherwise.Throws: nothing.
template<class T> shared_ptr<T> atomic_load( shared_ptr<T> const * p );Returns:
atomic_load_explicit( p, memory_order_seq_cst )
.template<class T> shared_ptr<T> atomic_load_explicit( shared_ptr<T> const * p, memory_order mo );Requires:
mo
shall not bememory_order_release
ormemory_order_acq_rel
.Returns:
*p
.Throws: nothing.
template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r );Effects:
atomic_store_explicit( p, r, memory_order_seq_cst )
.template<class T> void atomic_store_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo );Requires:
mo
shall not bememory_order_acquire
ormemory_order_acq_rel
.Effects:
p->swap( r )
.Throws: nothing.
template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r );Returns:
atomic_exchange_explicit( p, r, memory_order_seq_cst )
.template<class T> shared_ptr<T> atomic_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo );Effects:
p->swap( r )
.Returns: the previous value of
*p
.Throws: nothing.
template<class T> bool atomic_compare_exchange( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w );Returns:
atomic_compare_exchange( p, v, w, memory_order_seq_cst, memory_order_seq_cst )
.template<class T> bool atomic_compare_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w, memory_order success, memory_order failure );Requires:
failure
shall not bememory_order_release
,memory_order_acq_rel
, or stronger thansuccess
.Effects: If
*p
is equivalent to*v
, assignsw
to*p
and has synchronization semantics corresponding to to the value ofsuccess
, otherwise assigns*p
to*v
and has synchronization semantics corresponding to to the value offailure
.Returns:
true
if*p
was equivalent to*v
,false
otherwise.Throws: nothing.
Remarks: two
shared_ptr
instances are equivalent if they store the same pointer value and share ownership.
Rename the swap
member function to exchange
throughout
chapter 29.
Rename the compare_swap
member function to compare_exchange
throughout chapter 29.
Rename the atomic_swap
function to atomic_exchange
throughout
chapter 29.
Rename the atomic_swap_explicit
function to atomic_exchange_explicit
throughout chapter 29.
Rename the atomic_compare_swap
function to atomic_compare_exchange
throughout chapter 29.
Rename the atomic_compare_swap_explicit
function to atomic_compare_exchange_explicit
throughout chapter 29.
A straightforward implementation would use a per-instance spinlock, obtained from a spinlock pool keyed on a hash of the address of the instance. In pseudocode, the functions will be implemented as follows:
template<class T> shared_ptr<T> atomic_load( shared_ptr<T> const * p ) { lock the spinlock for *p; shared_ptr<T> r( *p ); unlock the spinlock for *p; return r; } template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r ) { lock the spinlock for *p; p->swap( r ); unlock the spinlock for *p; } template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r ) { lock the spinlock for *p; p->swap( r ); unlock the spinlock for *p; return r; } template<class T> bool atomic_compare_exchange( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w ) { lock the spinlock for *p; if( *p is equivalent to *v ) { p->swap( w ); unlock the spinlock for *p; return true; } else { shared_ptr<T> tmp( *p ); unlock the spinlock for *p; tmp.swap( *v ); return false; } }
Note that the code carefully avoids destroying a non-empty shared_ptr
instance while holding the spinlock, as this may lead to user code being called
and present an opportunity for a deadlock. It also attempts to limit the spinlock
scope to the minimum necessary to prevent contention.
An implementation that closely follows this outline has already been committed to the Boost SVN and will be part of Boost release 1.36. The relevant files can be browsed online at:
We used
sp_atomic_mt2_test.cpp to evaluate the performance of the reader-writer
pattern from the overview. For comparison purposes we used
synchronization with a boost::detail::lightweight_mutex
, equivalent
to CRITICAL_SECTION
under Windows, and boost::shared_mutex
,
a reader/writer lock that allows multiple concurrent read locks and a single exclusive
write lock.
The test is intended to emulate a server process that fulfills client read or write requests from a number of worker threads.
We picked the following test parameters:
The results, obtained on a non-hyperthreaded dual core Pentium D (two hardware threads) under Windows XP64, are shown in the following tables.
Time in seconds:
Primitive | 1 thread | 2 threads | 4 threads |
mutex | 8.312 | 20.921 | 42.812 |
rwlock | 8.437 | 23.390 | 46.468 |
atomic access | 8.515 | 9.421 | 18.781 |
Operations per millisecond:
Primitive | 1 thread | 2 threads | 4 threads |
mutex | 120 | 96 | 93 |
rwlock | 119 | 86 | 86 |
atomic access | 117 | 212 | 213 |
It is clear that for this combination of parameters, the atomic access reader/writer pattern offers the best scalability.
Note that we make no claim that the atomic access approach unconditionally outperforms the others, merely that it may be best for certain scenarios that can be encountered in practice.
Thanks to Hans Boehm and Lawrence Crowl for reviewing this paper.
--end