ISO/IEC JTC1 SC22 WG14 N1525 – 2010-10-11
Paul E. McKenney, paulmck@linux.vnet.ibm.com
Blaine Garst, blaine@apple.com
The C-language Committee Draft (n1494) contains no fewer than six memory-order options:
memory_order_relaxed
memory_order_consume
memory_order_acquire
memory_order_release
memory_order_acq_rel
memory_order_seq_cst
The purpose of this document is to demonstrate the purpose of each of these options by way of “toy” example code fragments, and also show how some varieties of the memory fences may be used. This paper is based on n2153.
memory_order_relaxedWith memory_order_relaxed, the compiler is required to
load and store atomically, so that any load sees the results of exactly
one store (or the initial value), but there are no ordering constraints.
This example uses memory_order_relaxed in conjunction with
an explicit fence to process in-shared-memory mailboxes more
efficiently:
const int num_mailboxes = 32;
_Atomic volatile int mailbox[num_mailboxes];
void MailboxInspection(int my_id, void (*do_work)(int)) {
    for (int i = 0; i < num_mailboxes; i++) {
        if (atomic_load_explicit(&mailbox[i],  memory_order_relaxed) == my_id) {
            atomic_thread_fence(memory_order_acquire); // prevent speculative reads in do_work [7.17.4.1]
            do_work(i);
        }
    }
}
Here we scan an array of mailboxes, and process only those whose ID
indicates that they are intended for us.
Although it would be possible to combine the atomic_thread_fence()
into the atomic_load_explicit() by using
memory_order_acquire(), but doing so would execute needless
memory fences in cases where the mailbox is for someone else.
In contrast, the above code only executes a memory fence when a
message needs to be processed, improving performance on weakly ordered
systems (including x86 in the case where certain SSE instructions are used).
Note that sequential consistency would require a heavy-weight fence on each iteration, whether or not the corresponding mailbox was processed.
memory_order_consumeWith memory_order_consume, the compiler and CPU are required
to order the load in question against only those subsequent loads and stores
whose address or value are computed from the value loaded.
All mainstream CPUs (other than DEC Alpha) provide the required ordering
without the need for any explicit memory-fence instructions, which
means that code using memory_order_consume can attain
extremely high levels of performance and scalability.
These constraints allow memory_order_consume to be used to
implement
RCU,
which in turn enables
high performance and highly scalable access to read-mostly data.
RCU is implemented in the Linux kernel and also
as a user-space library.
In addition, a garbage collector may be used to emulate RCU
(in which case rcu_read_lock() and rcu_read_unlock()
can be no-ops).
The locking primitives can be implemented using any convenient method,
including pthread_mutex_t.
The following example shows how RCU can be used to support lockless
read-side accesses to a hash table:
enum { NUM_BUCKETS = 128; };
struct foo {
    _Atomic volatile struct foo *next;
    int key;
};
_Atomic volatile struct foo *hashtable[NUM_BUCKETS];
DEFINE_LOCK(foo_lock);
int find_foo(int key) {
    struct foo *p;
    int retval;
    
    rcu_read_lock();
    // Linux kernel: p = rcu_dereference(hashtable[foo_hash(key)]);
    p = atomic_load_explicit(&hashtable[foo_hash(key)], memory_order_consume);
    while (p != NULL && p-<key < key) {
        // Linux kernel: p = rcu_dereference(p-<next);
        p = atomic_load_explicit(&p-<next, memory_order_consume);
    }
    retval = p != NULL && p-<key == key;
    rcu_read_unlock();
    return retval;
}
int insert_foo(int key) {
    struct foo *newp, *p;
    _Atomic volatile struct foo **plast;
    int retval = 0;
    spin_lock(&foo_lock);   // assume acquire-release at minimum
    plast = &hashtable[foo_hash(key)];
    p = *plast;
    while (p != NULL && p-<key < key) {
        p = p-<next;
        plast = &p-<next;
    }
    if (p == NULL || p-<key != key) {
        newp = malloc(sizeof(*newp));
        if (newp != NULL) {
            newp-<key = key;
            newp-<next = p;
            // Linux kernel: rcu_assign_pointer(*plast, newp);
            atomic_store_explicit(*plast, newp, memory_order_release);
            retval = 1;
        }
    }
    lock_release(&foo_lock);
    return retval;
}
The key point is that the atomic_load_explicit() using
memory_order_consume guarantees that subsequent accesses
will see any initialization carried out by insert_foo(),
even if they are executing concurrently, and without the overhead of
explicit memory-fence instructions.
In constrast, memory_order_acquire would require explicit
memory barriers on weakly ordered systems and would overconstrain
compiler optimizations on all systems.
memory_order_acquireWith memory_order_acquire, you can construct a
lock-acquisition primitive.
A non-production-quality “toy” implementation follows:
typedef atomic_atomic_int lock_t;
enum { LOCK_UNLOCKED, LOCK_LOCKED };
void acquire_lock(lock_t &lp)
{
	while (atomic_exchange_explicit(lp, 1, memory_order_acquire) == LOCK_LOCKED)
		continue
}
In principle, a sequentially consistent access could be used, but this
would require a full memory fence on (relatively) strongly ordered
systems such as x86 or the IBM mainframe.
In contrast, use of memory_order_acquire permits such
systems to omit explicit memory fences.
memory_order_releaseWith memory_order_release, you can construct a lock-release
primitive, which is a non-production-quality “toy” counterpart
to the acquire_lock() function shown above.
void release_lock(lock_t &lp)
{
	atomic_store_explicit(lp, LOCK_UNLOCKED, memory_order_release);
}
This approach has the same advantages over sequential consistency
noted above for acquire_lock().
Note further that an explicit memory fence can permit several unlock operations to be protected by a single memory-fence instruction:
void MultipleLockRelease() {
    extern void do_work_locking_A_and_B(void);
    
    do_work_locking_A_and_B();
    atomic_thread_fence(memory_order_release);  // ensure memory stores are complete
    atomic_store_explicit(&lock_A, LOCK_UNLOCKED, memory_order_relaxed);
    atomic_store_explicit(&lock_B, LOCK_UNLOCKED, memory_order_relaxed);
}
memory_order_acq_relThe memory_order_acq_rel permits a single atomic operation
and fence to serve for a combined unlock/lock operation.
Consider a set of locks implemented as bits in an integral type.
A single release_acquire_lock() primitive can
release any subset of the locks and acquire any non-held subset.
Of course, a failure return is required in order to avoid deadlock.
In this (naive) implementation, failure results in the locking state being
unchanged.
typedef atomic_int multilock_t;
int release_acquire_lock(multilock_t *mlp, multilock_t relmask, multilock_t acqmask)
{
	multilock_t old, new;
	old = atomic_load_explicit(mlp, memory_order_relaxed);
	do {
		new = old & ~relmask;
		if ((~new & acqmask) != acqmask)
			return 0; // Failure
		new = new | acqmask;
	} while (!atomic_compare_exchange_weak_explicit(&mlp, &old, new, memory_order_acq_rel, memory_order_relaxed));
	return 1; // Success
}
Note that strongly ordered systems such as x86, SPARC, and the IBM mainframe do not need explicit memory-fence instructions in this case. In contrast, fence instructions would be required for sequential consistency.
memory_order_seq_cstThe quintessential litmus test for sequential consistency is
independent reads of independent writes, also known as IRIW.
This is shown below, where
T1,
T2,
T3, and
T4 run as separate threads.
atomic_int x = ATOMIC_VAR_INIT(0);
atomic_int y = ATOMIC_VAR_INIT(0);
int r1, r2, r3, r4;
void T1(void)
{
	atomic_store(&x, 1);
}
void T2(void)
{
	atomic_store(&x, 1);
}
void T3(void)
{
	r1 = atomic_load(&x);
	r2 = atomic_load(&y);
}
void T4(void)
{
	r3 = atomic_load(&y);
	r4 = atomic_load(&x);
}
Sequential consistency precludes the outcome
r1==1&&r2==0&&r3==1&&r4==0,
which would be permitted by the other memory orders.
These examples show that the five weaker memory-order specifiers have an essential role to play in expert-level performance/scalability-critical low-level code. Usage patterns analogous to each of these five weaker memory-order specifiers really do appear in large shared-memory parallel software artifacts, as exemplified by the Linux kernel. Therefore, each of the five has an important role to play in the upcoming C standard.