ISO/IEC JTC1 SC22 WG21 N3425 = 12-0115 - 2012-09-20
Arch D. Robison, Intel Corp., (arch.robison@intel.com)
Anton Malakhov, Intel Corp., (anton.malakhov@intel.com)
Artur Laksberg, Microsoft Corp., (arturl@microsoft.com)
Introduction
Interfaces
Why Concurrent Erasure Is Not Supported
Deferred Destruction
Alternative: GC-Based Deferred Destruction
Alternative: Copy-Based Interface
Bucket Interface
Alternative: Concurrency-Safe Bucket Interface
Synopsis
Header <concurrent_unordered_map> synopsis
Header <concurrent_unordered_set> synopsis
Concurrency Guarantees
Concurrency-Safe Operations
Non-Blocking
Partial Linearizability
References
We propose adding concurrent variants of the unordered associative containers that guarantee concurrency safety of insertion operations and lookup operations. By concurrency safety, we mean that concurrent execution of the operations does not introduce a data race. Though the existing specifications of STL unordered associative containers permit concurrency safety, they do not require it. We also propose allowing concurrency-safe erasure as an option.
Currently a programmer must serialize insertions with respect to themselves and other operations, even though often there is no logical conflict. For example, consider multiple threads that need to insert keys into a set. Logically, the resulting set should be independent of whether the insertions are concurrent or not. The proposed containers enable the insertions to proceed concurrently.
The concurrent variants can be efficiently implemented using split-ordered lists [SplitOrder], an established technique for implementing lock-free hash tables. Versions of these concurrent variants have been shipping with Intel Threading Building Blocks and the Microsoft Parallel Patterns Libraries.
The significant syntactic difference is that the prefix unsafe_ is added to some of the methods that are unsafe to invoke concurrently with mutating operations. These methods are:
There are a few unprefixed operations that not concurrency safe:
There are several solutions to this problem, each with inherent costs. If the committee wishes to support concurrent erasure, we recommend the ``deferred destruction'' approach outlined in the next section, since it is non-blocking and offers the fewest surprises. However, it has a high performance cost, and therefore we recommend that it be an optional feature enabled by a template parameter to the container, to preserve ``pay as you go'' for cases not needing concurrent erasure.
Deferred destruction enables safe concurrent erasure with non-blocking semantics. When an item is erased, it is unlinked from the container so that subsequent lookups cannot find it, but destruction of the item is deferred until no container iterators pointing to it.
A minor drawback of deferred destruction is that it permits a thread to continue operating on item that is no longer in the table. However, the alternative of blocking erasure seems worse, since it can lead to deadlock. Clients wanting blocking erasure can add the necessary synchronization themselves, such as by embedding a mutex in each item. Note that the converse, providing a blocking container and making clients who need non-blocking semantics subtract synchronization, is not viable.
Any support for concurrency-safe erasure is not cheap, because it requires thread synchronization to avoid races. Deferred destruction requires, at a minimum, two atomic operations for the acquisition and release of an item. These operations can generate much coherence traffic since they require modifying a cache line. The cost is not pay-as-you-go since these modifications are required if concurrent erasure might happen, even if it never happens. Hence our recommendation is that the feature be enabled or disabled on a per-instantiation basis.
If implementations relying on garbage collection are to be permitted, the specification of erase and consequent deferred destruction need to permit an item to remain alive after the last reference to it is removed, until the garbage collector can reclaim it.
The big advantage of a copy-access interface is that there is no need for expensive atomic operations to track acquisition/release of an item as in in the deferred destruction approach. However, since the underlying data structure still involves links to things that might be concurrently erased, fully avoidance of mutating atomic operations in the internal implementation will have to rely on advanced software techniques for safe memory reclamation (e.g. hazard pointers) or hardware support such as transactional memory,
Copy-based interfaces can also have significant performance advantages because the underlying implementation is free to repack storage on the fly, and pack them densely, which is particularly helpful for single-word items.
The big drawback of a copy-based interface is that it is a significant departure from existing practice in the standard library. Thus we are proposing a non-copying interface.
template<typename Set>
void DumpSet(const Set& s) {
using namespace std;
for(typename Set::size_type i=0; i!=s.unsafe_bucket_count(); ++i) {
cout << "bucket " << i << endl;
for( auto j=s.unsafe_begin(i); j!=s.unsafe_end(i); ++j)
cout << *j << endl;
}
}
If a concurrent insert happens while the dump is running,
the dump may list some of the other elements twice.
To see how this can happen, suppose that the inner loop has dumped each item in bucket 0 and is about to evaluate j==unsafe_end(0). If an insert happens and causes the number of buckets to increase, the number of buckets doubles, from some value N to 2N. Bucket 0 splits into two buckets, bucket 0 and bucket N. The dumping routine will eventually dump bucket n, re-outputting items that were in bucket 0.
If the outer loop takes a snapshot of unsafe_bucket_count at the start and does not re-evaluate it, then problem of duplication is replaced by the problem of omission, since a concurrent insertion might move items to buckets beyond those known at the point of the snapshot.
Here is what the alternative interface would look like:
// bucket interface
size_type bucket_count() const noexcept;
size_type max_bucket_count() const noexcept;
size_type bucket_size(size_type n, size_type c);
size_type bucket(const key_type& key, size_type c) const;
local_iterator begin(size_type n, size_type c);
const_local_iterator begin(size_type n, size_type c) const;
local_iterator end(size_type n, size_type c);
const_local_iterator end(size_type n, size_type c) const;
const_local_iterator cbegin(size_type n) const;
const_local_iterator cend(size_type n, size_type c) const;
With this kind of interface, the following code does a reasonable dump
of a concurrent unordered set in the presence of concurrent insert operations.
template<typename Set>
void DumpSet(const Set& s) {
using namespace std;
auto c = s.bucket_count();
for(typename Set::size_type i=0; i!=c; ++i) {
cout << "bucket " << i << endl;
for( auto j=s.begin(i,c); j!=s.end(i,c); ++j)
cout << *j << endl;
}
}
It is reasonable in the sense that all items already in the table are output exactly once.
Any concurrently inserted items might or might not be output.
We seek the committee's opinion on whether the concurrency-safety gains of this alternative bucket interface over our baseline proposal are worth the cost, particularly since we believe the interface is used primarily for debugging. The costs are:
The Intel implementations have some addition methods for recursively divisible ranges, which are useful for parallel traversal of the containers. These are omitted from the proposal since they make sense only within larger framework for recursively divisible ranges.
namespace std {
template <class Key,
class T,
class Hash = hash<Key>,
class Pred = std::equal_to<Key>,
class Allocator = std::allocator<std::pair<const Key, T> > >
class concurrent_unordered_map
{
public:
// types
typedef Key key_type;
typedef std::pair<const Key, T> value_type;
typedef T mapped_type;
typedef Hash hasher;
typedef Pred key_equal;
typedef Allocator allocator_type;
typedef typename allocator_type::pointer pointer;
typedef typename allocator_type::const_pointer const_pointer;
typedef typename allocator_type::reference reference;
typedef typename allocator_type::const_reference const_reference;
typedef implementation-defined size_type;
typedef implementation-defined difference_type;
typedef implementation-defined iterator;
typedef implementation-defined const_iterator;
typedef implementation-defined local_iterator;
typedef implementation-defined const_local_iterator;
// construct/destroy/copy
explicit concurrent_unordered_map(size_type n = see below,
const hasher& hf = hasher(),
const key_equal& eql = key_equal(),
const allocator_type& a = allocator_type());
template <typename Iterator>
concurrent_unordered_map(Iterator first, Iterator last,
size_type n = see below,
const hasher& hf = hasher(),
const key_equal& eql = key_equal(),
const allocator_type& a = allocator_type());
concurrent_unordered_map(const concurrent_unordered_map&);
concurrent_unordered_map(const concurrent_unordered_map&&);
explicit concurrent_unordered_map(const Allocator&);
concurrent_unordered_map(const concurrent_unordered_map&, const Allocator&);
concurrent_unordered_map(const concurrent_unordered_map&& table, const Allocator&);
concurrent_unordered_map(initializer_list<value_type>,
size_type = see below,
const hasher& hf = hasher(),
const key_equal& eql = key_equal(),
const allocator_type& a = allocator_type());
~concurrent_unordered_map();
concurrent_unordered_map& operator=(const concurrent_unordered_map&);
concurrent_unordered_map& operator=(concurrent_unordered_map&&);
concurrent_unordered_map& operator=(initializer_list<value_type>);
allocator_type get_allocator() const noexcept;
// size and capacity
bool empty() const noexcept;
size_type size() const noexcept;
size_type max_size() const noexcept;
// iterators
iterator begin() noexcept;
const_iterator begin() const noexcept;
iterator end() noexcept;
const_iterator end() const noexcept;
const_iterator cbegin() const noexcept;
const_iterator cend() const noexcept;
// modifiers
template <class... Args> pair<iterator, bool> emplace(Args&&... args);
template <class... Args> iterator emplace_hint(const_iterator position, Args&&... args);
pair<iterator, bool> insert(const value_type& obj);
template <class P> pair<iterator, bool> insert(P&& obj);
iterator insert(const_iterator hint, const value_type& obj);
template <class P> iterator insert(const_iterator hint, P&& obj);
template <class InputIterator> void insert(InputIterator first, InputIterator last);
void insert(initializer_list<value_type>);
iterator unsafe_erase(const_iterator position);
size_type unsafe_erase(const key_type& key);
iterator unsafe_erase(const_iterator first, const_iterator last);
void clear() noexcept;
void swap(concurrent_unordered_map&);
// Observers
hasher hash_function() const;
key_equal key_eq() const;
// lookup
iterator find(const key_type& key);
const_iterator find(const key_type& key) const;
size_type count(const key_type& key) const;
std::pair<iterator, iterator> equal_range(const key_type& key);
std::pair<const_iterator, const_iterator> equal_range(const key_type& key) const;
mapped_type& operator[](const key_type& key);
mapped_type& operator[](key_type&& key);
mapped_type& at(const key_type& key);
const mapped_type& at(const key_type& key) const;
// bucket interface
size_type unsafe_bucket_count() const noexcept;
size_type unsafe_max_bucket_count() const noexcept;
size_type unsafe_bucket_size(size_type n);
size_type unsafe_bucket(const key_type& key) const;
local_iterator unsafe_begin(size_type n);
const_local_iterator unsafe_begin(size_type n) const;
local_iterator unsafe_end(size_type n);
const_local_iterator unsafe_end(size_type n) const;
const_local_iterator unsafe_cbegin(size_type n) const;
const_local_iterator unsafe_cend(size_type n) const;
// hash policy
float load_factor() const noexcept;
float max_load_factor() const noexcept;
void max_load_factor(float newmax);
void rehash(size_type buckets);
void reserve(size_type n);
};
template <class Key, class T, class Hash, class Pred, class Alloc>
void swap(concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& x,
concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& y);
template <class Key, class T, class Hash, class Pred, class Alloc>
bool operator==(const concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& a,
const concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& b);
template <class Key, class T, class Hash, class Pred, class Alloc>
bool operator!=(const concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& a,
const concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& b);
}
The synopsis for concurrent_unordered_multimap is analogous, and omitted for brevity.
namespace std {
template <class Key,
class Hash = hash<Key>,
class Pred = std::equal_to<Key>,
class Allocator = std::allocator<Key> >
class concurrent_unordered_set
{
public:
// types
typedef Key key_type;
typedef Key value_type;
typedef Hash hasher;
typedef Pred key_equal;
typedef Allocator allocator_type;
typedef typename allocator_type::pointer pointer;
typedef typename allocator_type::const_pointer const_pointer;
typedef typename allocator_type::reference reference;
typedef typename allocator_type::const_reference const_reference;
typedef implementation-defined size_type;
typedef implementation-defined difference_type;
typedef implementation-defined iterator;
typedef implementation-defined const_iterator;
typedef implementation-defined local_iterator;
typedef implementation-defined const_local_iterator;
// construct/destroy/copy
explicit concurrent_unordered_set(size_type n = see below,
const hasher& hf = hasher(),
const key_equal& eql = key_equal(),
const allocator_type& a = allocator_type());
template <typename Iterator>
concurrent_unordered_set(Iterator first, Iterator last,
size_type n = see below,
const hasher& hf = hasher(),
const key_equal& eql = key_equal(),
const allocator_type& a = allocator_type());
concurrent_unordered_set(const concurrent_unordered_set&);
concurrent_unordered_set(const concurrent_unordered_set&&);
explicit concurrent_unordered_set(const Allocator&);
concurrent_unordered_set(const concurrent_unordered_set&, const Allocator&);
concurrent_unordered_set(const concurrent_unordered_set&& table, const Allocator&);
concurrent_unordered_set(initializer_list<value_type>,
size_type = see below,
const hasher& hf = hasher(),
const key_equal& eql = key_equal(),
const allocator_type& a = allocator_type());
~concurrent_unordered_set();
concurrent_unordered_set& operator=(const concurrent_unordered_set&);
concurrent_unordered_set& operator=(concurrent_unordered_set&&);
concurrent_unordered_set& operator=(initializer_list<value_type>);
allocator_type get_allocator() const noexcept;
// size and capacity
bool empty() const noexcept;
size_type size() const noexcept;
size_type max_size() const noexcept;
// iterators
iterator begin() noexcept;
const_iterator begin() const noexcept;
iterator end() noexcept;
const_iterator end() const noexcept;
const_iterator cbegin() const noexcept;
const_iterator cend() const noexcept;
// modifiers
template <class... Args> pair<iterator, bool> emplace(Args&&... args);
template <class... Args> iterator emplace_hint(const_iterator position, Args&&... args);
pair<iterator, bool> insert(const value_type& obj);
template <class P> pair<iterator, bool> insert(P&& obj);
iterator insert(const_iterator hint, const value_type& obj);
template <class P> iterator insert(const_iterator hint, P&& obj);
template <class InputIterator> void insert(InputIterator first, InputIterator last);
void insert(initializer_list<value_type>);
iterator unsafe_erase(const_iterator position);
size_type unsafe_erase(const key_type& key);
iterator unsafe_erase(const_iterator first, const_iterator last);
void clear() noexcept;
void swap(concurrent_unordered_set&);
// Observers
hasher hash_function() const;
key_equal key_eq() const;
// lookup
iterator find(const key_type& key);
const_iterator find(const key_type& key) const;
size_type count(const key_type& key) const;
std::pair<iterator, iterator> equal_range(const key_type& key);
std::pair<const_iterator, const_iterator> equal_range(const key_type& key) const;
// bucket interface
size_type unsafe_bucket_count() const noexcept;
size_type unsafe_max_bucket_count() const noexcept;
size_type unsafe_bucket_size(size_type n);
size_type unsafe_bucket(const key_type& key) const;
local_iterator unsafe_begin(size_type n);
const_local_iterator unsafe_begin(size_type n) const;
local_iterator unsafe_end(size_type n);
const_local_iterator unsafe_end(size_type n) const;
const_local_iterator unsafe_cbegin(size_type n) const;
const_local_iterator unsafe_cend(size_type n) const;
// hash policy
float load_factor() const noexcept;
float max_load_factor() const noexcept;
void max_load_factor(float newmax);
void rehash(size_type buckets);
void reserve(size_type n);
};
template <class Key, class Hash, class Pred, class Alloc>
void swap(concurrent_unordered_set<Key, Hash, Pred, Alloc>& x,
concurrent_unordered_set<Key, Hash, Pred, Alloc>& y);
template <class Key, class Hash, class Pred, class Alloc>
bool operator==(const concurrent_unordered_set<Key, Hash, Pred, Alloc>& a,
const concurrent_unordered_set<Key, Hash, Pred, Alloc>& b);
template <class Key, class Hash, class Pred, class Alloc>
bool operator!=(const concurrent_unordered_set<Key, Hash, Pred, Alloc>& a,
const concurrent_unordered_set<Key, Hash, Pred, Alloc>& b);
}
The synopsis for concurrent_unordered_multiset is analogous, and omitted for brevity.
get_allocator empty, size, max_size begin, end, cbegin, cend insert find, count, equal_range, operator[], at load_factor max_load_factor() operator==, operator!=assuming that the requisite operations on the key type (and mapped_type if applicable) are concurrency safe.
Hence we recommend a weaker guarantee where the container is split into two abstract parts, and require that each part be a linearizable object. The two parts are:
The order of operations on the parts guarantees that, in the absence of concurrent erasure, that if a thread inspects the size of the container, then that size is a lower bound on the number of items that it can find in the container. For example, let a be a freshly constructed concurrent associative container, and consider the following execution:
Thread 1 | Thread 2 |
---|---|
a.insert(t); // modify "content"; modify "size" | s = a.size();   // inspect "size" c = a.count(t); // inspect "content" |
Disallowed outcome: s==1 && c==0 |
Our requirement for partial linearizability minimizes surprises and delineates where they might occur. Each abstract part behaves in an intuitive manner, and surprises can exist only for properties relating both parts as a whole.
Because linearizability of the entire container is not guaranteed, it is possible for a thread to find an item and subsequently retrieve a count that does not include that item yet. For example, following execution has four possible outcomes.
Thread 1 | Thread 2 |
---|---|
a.insert(t); // modify "content"; modify "size" | c = a.count(t); // inspect "content" s = a.size(); // inspect "size" |
Allowed outcomes: c∈{0,1}, s∈{0,1} |
In the absence of concurrent erasure, the comparison operations operator== or operator!= on concurrent containers are linearizable. Though these comparison operations were omitted from the TBB/PPL products, they should be straightforward to implement by walking the underlying split-ordered lists. Indeed these walks are probably faster than the means necessary for comparing their non-concurrent counterparts.