Document number: N2632=08-0142
Programming Language C++, Library Subgroup
 
Peter Dimov, <pdimov@pdimov.com>
Beman Dawes, <bdawes@acm.org>
 
2008-05-16

Shared_ptr atomic access

I. Overview

The level of thread safety offered by shared_ptr is the default for the standard library: const operations on the same instance can be performed concurrently by multiple threads, but updates require exclusive access. Some scenarios require stronger guarantees; there is considerable demand (on the Boost mailing lists and otherwise) for an atomic shared_ptr, one that can withstand concurrent updates from two threads without synchronization. This is often motivated by the desire to port an existing scalable pattern that relies on atomic operations on raw pointers in combination with garbage collection.

We propose the addition of the following members to shared_ptr's interface to address this use case:

shared_ptr<T> atomic_load() const;
void atomic_store( shared_ptr<T> r );
shared_ptr<T> atomic_swap( shared_ptr<T> r );
bool atomic_compare_swap( shared_ptr<T> & v, shared_ptr<T> w );

A typical example scenario — a "lock-free" reader/writer pattern — that takes advantage of this functionality is outlined below:

shared_ptr<State> ps;

void reader()
{
    shared_ptr<State const> p = ps.atomic_load();
    use *p;
}

// single writer case
void writer()
{
    shared_ptr<State> p( new State( *ps ) );
    update *p reflecting new information;
    ps.atomic_store( move( p ) );
}

// or, multiple writers case
void writer()
{
    shared_ptr<State> p = ps.atomic_load();

    do
    {
        shared_ptr<State> q( new State( *p ) );
        update *q reflecting new information;
    }
    while( !ps.atomic_compare_swap( p, move( q ) ) );
}

This feature extends the interface of shared_ptr in a backward-compatible way. We believe that it is a strong candidate for addition to the C++0x standard. It introduces no source- or binary compatibility issues.

The proposed additions have been implemented in Boost and will be part of Boost 1.36. Performance tests have identified reader/writer scenarios where the above pattern achieves close to linear scalability, outperforming a read/write lock.

II. Rationale

Can the implementation be made lock-free? If not, is it still worth having?

There are several possible approaches that lead to a lock-free implementation: hazard pointers, k-CAS, software transactional memory. The challenge is to consistently outperform the spinlock baseline implementation, and this part is not easy; that is why it makes sense to provide the functionality as part of the standard library.

Yes, the spinlock implementation is still worth having. For certain tasks and workloads, it still scales better than mutexes or read/write locks. See the performance section.

Can this be moved to TR2? Is it essential to have it in the standard?

A spinlock-based implementation can be done non-intrusively in a TR, but a lock-free implementation cannot, as it inherently requires access to shared_ptr internals. We don't want to disallow lock-free implementations because they would scale better. The area is a subject of active research and having a stable interface as a target will, hopefully, lead to the emergence of such implementations.

We do not know whether a lock-free implementation will need to change shared_ptr's internal details, but the conservative assumption is that it might, which could potentially require an ABI change unsuitable for a technical report.

Is the interface sufficient? Should shared_ptr maintain a version counter?

We believe that the interface is sufficient.

The need for a version counter is motivated by the read-upgrade-write pattern:

void maybe_writer()
{
    obtain a read lock;
    use object;

    if( object needs updating )
    {
        upgrade read lock to write lock;
        update object;
    }
}

which in shared_ptr terms looks like:

void maybe_writer()
{
    shared_ptr<State> p = ps.atomic_load();

    do
    {
        use *p;

        if( object doesn't need updating ) break;

        shared_ptr<State> q( new State( *p ) );
        update *q;
    }
    while( !ps.atomic_compare_swap( p, move( q ) ) );
}

The other possible use for a version counter is to avoid the ABA problem that is common for CAS-based algorithms. ABA cannot occur in our case because the storage for the object referenced by p cannot be reused while p is still alive.

Should an atomic pointer be provided as a separate type? As an std::atomic specialization?

An std::atomic< shared_ptr<> > specialization is not an appropriate way to provide the functionality. The template parameter of std::atomic<T> is required to have a trivial copy constructor and an assignment operator, for good reasons; calling user code from within the atomic operations is a recipe for disaster. Since the copy constructor and the assignment operator of shared_ptr aren't trivial, it is not acceptable to instantiate std::atomic on it.

Providing a separate type atomic_shared_ptr<T> is a legitimate alternative. We have instead chosen to propose additions to the existing shared_ptr interface for the following reasons:

III. Proposed Text

Add to shared_ptr [util.smartptr.shared] the following:

// [util.smartptr.shared.atomic], atomic access:

shared_ptr atomic_load( memory_order mo = memory_order_seq_cst ) const;
void atomic_store( shared_ptr r, memory_order mo = memory_order_seq_cst );
shared_ptr atomic_swap( shared_ptr r, memory_order mo = memory_order_seq_cst );
bool atomic_compare_swap( shared_ptr & v, shared_ptr w );
bool atomic_compare_swap( shared_ptr & v, shared_ptr w, memory_order success, memory_order failure );

Add a new section [util.smartptr.shared.atomic]:

Concurrent access to a shared_ptr instance from multiple threads does not introduce a data race if the access is done exclusively via the member functions in this section.

The meaning of the arguments of type memory_order is explained in 29.1/1 [atomics.order] and 29.1/2.

shared_ptr atomic_load( memory_order mo = memory_order_seq_cst ) const;

Requires: mo shall not be memory_order_release or memory_order_acq_rel.

Returns: *this.

Throws: nothing.

void atomic_store( shared_ptr r, memory_order mo = memory_order_seq_cst );

Requires: mo shall not be memory_order_acquire or memory_order_acq_rel.

Effects: swap( r ).

Throws: nothing.

shared_ptr atomic_swap( shared_ptr r, memory_order mo = memory_order_seq_cst );

Effects: swap( r ).

Returns: the previous value of *this.

Throws: nothing.

bool atomic_compare_swap( shared_ptr & v, shared_ptr w, memory_order success, memory_order failure );

Requires: failure shall not be memory_order_release, memory_order_acq_rel, or stronger than success.

Effects: If *this is equivalent to v, assigns w to *this and has synchronization semantics corresponding to to the value of success, otherwise assigns *this to v and has synchronization semantics corresponding to to the value of failure.

Returns: true if *this was equivalent to v, false otherwise.

Throws: nothing.

Remarks: two shared_ptr instances are equivalent if they store the same pointer value and share ownership.

bool atomic_compare_swap( shared_ptr & v, shared_ptr w );

Returns: atomic_compare_swap(v, w, memory_order_seq_cst, memory_order_seq_cst).

IV. Implementability

A straightforward implementation would use a per-instance spinlock, obtained from a spinlock pool keyed on a hash of the address of the instance. In pseudocode, the functions will be implemented as follows:

shared_ptr atomic_load() const
{
    lock the spinlock for *this;
    shared_ptr r( *this );
    unlock the spinlock for *this;
    return r;
}

void atomic_store( shared_ptr r )
{
    lock the spinlock for *this;
    swap( r );
    unlock the spinlock for *this;
}

shared_ptr atomic_swap( shared_ptr r )
{
    lock the spinlock for *this;
    swap( r );
    unlock the spinlock for *this;
    return r;
}

bool atomic_compare_swap( shared_ptr & v, shared_ptr w )
{
    lock the spinlock for *this;

    if( *this is equivalent to v )
    {
        swap( w );
        unlock the spinlock for *this;
        return true;
    }
    else
    {
        shared_ptr tmp( *this );
        unlock the spinlock for *this;
        tmp.swap( v );
        return false;
    }
}

Note that the code carefully avoids destroying a non-empty shared_ptr instance while holding the spinlock, as this may lead to user code being called and present an opportunity for a deadlock. It also attempts to limit the spinlock scope to the minimum necessary to prevent contention.

An implementation that closely follows this outline has already been committed to the Boost SVN and will be part of Boost release 1.36. The relevant files can be browsed online at:

V. Performance

We used sp_atomic_mt2_test.cpp to evaluate the performance of the reader-writer pattern from the overview. For comparison purposes we used synchronization with a boost::detail::lightweight_mutex, equivalent to CRITICAL_SECTION under Windows, and boost::shared_mutex, a reader/writer lock that allows multiple concurrent read locks and a single exclusive write lock.

The test is intended to emulate a server process that fulfills client read or write requests from a number of worker threads.

We picked the following test parameters:

The results, obtained on a non-hyperthreaded dual core Pentium D (two hardware threads) under Windows XP64, are shown in the following tables.

Time in seconds:

Primitive 1 thread 2 threads 4 threads
mutex 8.312 20.921 42.812
rwlock 8.437 23.390 46.468
atomic access 8.515 9.421 18.781

Operations per millisecond:

Primitive 1 thread 2 threads 4 threads
mutex 120 96 93
rwlock 119 86 86
atomic access 117 212 213

It is clear that for this combination of parameters, the atomic access reader/writer pattern offers the best scalability.

Note that we make no claim that the atomic access approach unconditionally outperforms the others, merely that it may be best for certain scenarios that can be encountered in practice.


Thanks to Hans Boehm and Lawrence Crowl for reviewing this paper.

--end