Document number: N2633=08-0143

Programming Language C++, Library/Concurrency Subgroup

2008-05-16

Improved support for bidirectional fences

Background
Current Approach
Suggested Alternative
Proposed Text

I. Background

Relaxed (non-sequentially consistent) memory models are typically specified in terms of either acquire/release semantics that introduce ordering between two synchronizing operations from separate threads, or bidirectional fences that introduce ordering between two operations from the same thread. The C++ memory model chooses the first approach, for good reasons. Nevertheless, many existing hardware platforms are actually based on the second and rely on fences, also called memory barriers. The canonical fence-based architecture is Sun SPARC in RMO mode, and the IBM POWER architecture is widely available. Even the poster child for the acquire/release model, the Intel IA-64 architecture, includes a bidirectional full fence instruction, mf, and so does the x86/x64 architecture (mfence).

A fence is a primitive that enforces ordering between preceding loads or stores and subsequent loads or stores. In SPARC RMO terms the various combinations are called #LoadLoad, #LoadStore, #StoreLoad and #StoreStore and can be combined, leading to a total of 16 combinations (one of which is a no-op, and another is a full fence). In practice, the #StoreLoad barrier is the most expensive and its cost is essentially the same as a full fence; the others are relatively cheap even in combination. These costs have allowed the designers of the POWER architecture to simplify the full set into two instructions, LWSYNC (#LoadLoad | #LoadStore | #StoreStore) and SYNC (a full fence). We'll use lwsync from now on as a generic way to refer to the relatively cheap fence.

(Another possibility is to provide an acquire fence, #LoadLoad | #LoadStore, a release fence, #LoadStore | #StoreStore, and an acquire+release fence, #LoadLoad | #LoadStore | #StoreStore. This corresponds better to the existing terminology in the C++ memory model and we'll return to it later on.)

While it's trivial to map the acquire/release model to a bidirectional fence model, albeit with a slight theoretical loss in performance, the reverse is not true. This has led to a pressure to include fence primitives into the C++ memory model; see for example N2153 that makes a fine case. This pressure has led to the addition of a fence member function to all atomic types and to the definition of a global atomic object called atomic_global_fence_compatibility.

The motivation for having fence primitives in C++ is fourfold:

It's hard to port existing code that is written for a specific fence-based architecture to the C++ acquire/release memory model. We want C++ to provide a way for rewriting such code in a manner that increases portability without sacrificing efficiency.
Some programmers may be familiar with a fence-based memory model and, ideally, we want to help them be immediately productive in C++.
In many real-world cases, the programmer knows exactly what sequence of instructions he wants to produce on the target hardware. We want to enable these programmers to achieve their goals while still writing portable code. (Stated simply, we want the programmer to be able to emit a lwsync instruction without sacrificing portability and theoretical correctness.)
There are situations in which a fence-based formulation on the source level leads to more efficient code.

The rest of this paper explains why the existing mechanism in the working draft does not meet these challenges, and proposes an alternative that does.

II. Current Approach

The current working draft, N2588, has adopted a mechanism put forward in N2324 that consists of atomic types supporting a dedicated "fence" operation. On the memory model level, this is a read-modify-write operation that leaves the value unchanged. This is an elegant attempt to provide a solution to the conditional synchronization problem described in N2153/N2237 without introducing a separate primitive (which corresponds to our point 4 above). It also aims to address points 1-3, but as a secondary goal.

At first glance, a programmer and an implementer would reasonably expect these fence operations to be implemented as follows:

Table 1:

	POWER	x86, x64
`x.fence(memory_order_acquire)`	LWSYNC	no-op
`x.fence(memory_order_release)`	LWSYNC	no-op
`x.fence(memory_order_acq_rel)`	LWSYNC	no-op
`x.fence(memory_order_seq_cst)`	SYNC	mfence, or a locked instruction

This implementation turns out to not match the specification. Consider the following

Example 1:

Thread 1:

Thread 2:

x.store( 1, memory_order_relaxed );
z.fence( memory_order_acq_rel );
r1 = y.load( memory_order_relaxed );

y.store( 1, memory_order_relaxed );
z.fence( memory_order_acq_rel );
r2 = x.load( memory_order_relaxed );

Since read-modify-write operations on a single location z form a total order (1.10/5) and since z.fence is a RMW operation, we must conclude that one of the two z.fence calls precedes the other in this order. As the example is symmetrical, we can arbitrarily pick an order without sacrificing generality; let for example the fence operation in thread 1 precede the fence operation in thread 2. Since the fences in our example have both acquire and release semantics due to the memory_order_acq_rel constraint, they synchronize-with each other, and hence, the fence in thread 1 happens before the fence in thread 2. This, in turn, implies that the x.store in thread 1 happens before x.load in thread 2, and that r2 must be 1.

In other words, non-sequentially consistent outcomes are prohibited by the fences.

If we now look at our table above, we see that the intended LWSYNC or no-op implementation of z.fence(memory_order_acq_rel) is not enough to prohibit the non-sequentially consistent outcome of r1 == r2 == 0. We must conclude that z.fence(memory_order_acq_rel) must be implemented as SYNC or mfence:

Table 2:

	POWER	x86, x64
`x.fence(memory_order_acquire)`	LWSYNC	no-op
`x.fence(memory_order_release)`	LWSYNC	no-op
`x.fence(memory_order_acq_rel)`	SYNC	mfence, or a locked instruction
`x.fence(memory_order_seq_cst)`	SYNC	mfence, or a locked instruction

No problem, we'll just use acquire or release when we need lwsync, right?

Example 2:

Thread 1:

Thread 2:

x.store( 1, memory_order_relaxed );
z.fence( memory_order_release );
z.fence( memory_order_acquire );
r1 = y.load( memory_order_relaxed );

y.store( 1, memory_order_relaxed );
z.fence( memory_order_release );
z.fence( memory_order_acquire );
r2 = x.load( memory_order_relaxed );

Again, we have a total order over the operations on z. No matter how we reorder the four fences, one of the release fences will precede the acquire fence in the other thread, the two will synchronize-with each other, and order one of the stores with its corresponding load, as happened in example 1.

Therefore, the acquire and release fences can't both be LWSYNC or a no-op, as this would again allow the r1 == r2 == 0 outcome that the memory model tells us cannot occur. We are forced to update our implementation table once again:

Table 3:

	POWER	x86, x64
`x.fence(memory_order_acquire)`	SYNC or LWSYNC	mfence or no-op
`x.fence(memory_order_release)`	LWSYNC or SYNC	no-op or mfence
`x.fence(memory_order_acq_rel)`	SYNC	mfence
`x.fence(memory_order_seq_cst)`	SYNC	mfence

This looks disturbing enough, but we aren't done yet. Since fences are read-modify-write operation, they can synchronize with acquire loads and release stores:

Example 3:

Thread 1:

Thread 2:

x.store( 1, memory_order_relaxed );
z.fetch_add( 0, memory_order_release );
z.fence( memory_order_acquire );
r1 = y.load( memory_order_relaxed );

y.store( 1, memory_order_relaxed );
z.fetch_add( 0, memory_order_release );
z.fence( memory_order_acquire );
r2 = x.load( memory_order_relaxed );

Example 4:

Thread 1:

Thread 2:

x.store( 1, memory_order_relaxed );
z.fence( memory_order_release );
z.load( memory_order_acquire );
r1 = y.load( memory_order_relaxed );

y.store( 1, memory_order_relaxed );
z.fence( memory_order_release );
z.load( memory_order_acquire );
r2 = x.load( memory_order_relaxed );

Using a similar reasoning as with example 2, and observing the release sequence rules in 1.10/6 and 1.10/7, we can reasonably conclude that both acquire and release fences need to map to SYNC. To be fair, an implementer on the POWER platform still does have the option to use LWSYNC for one of the fences, at the cost of moving the SYNC to the load or to the fetch_add, respectively. This is unlikely, as it will decrease the performance of a primary primitive in favor of a secondary one. Our final implementation table now looks like:

Table 4:

	POWER	x86, x64
`x.fence(memory_order_acquire)`	SYNC	mfence
`x.fence(memory_order_release)`	SYNC	mfence
`x.fence(memory_order_acq_rel)`	SYNC	mfence
`x.fence(memory_order_seq_cst)`	SYNC	mfence

In short, on the currently popular platforms the fence primitive is an overengineered way to insert an expensive full fence, and its argument plays no part. This makes it extremely unattractive, especially on the x86/x64 platform. It is quite likely that implementations will deliberately ignore the specification and just provide the obvious implementation in Table 1, which is what programmers expect and can make use of.

Again, to be fair, we must acknowledge the possibility of a new hardware being able to take advantage of the current specification and somehow implement the primitive in a more performant way. This however doesn't help. Any program that uses fences will suffer severe performance penalties when ported to, for example, x86/x64; this in practice renders the fence primitive non-portable and platform-dependent.

Other surprising properties of the specification concern the atomic_global_fence_compatibility atomic object. It is provided as a migration path for programs using traditional address-free fences, such as in the typical

Example 5:

Thread 1:

Thread 2:

x = 1; // non-atomic

atomic_global_fence_compatibility
    .fence( memory_order_acq_rel );

y.store( 1, memory_order_relaxed );

if( y.load( memory_order_relaxed ) == 1 )
{
    atomic_global_fence_compatibility
        .fence( memory_order_acq_rel );

    assert( x == 1 );
}

(We can assume that the original version of the code contained a #StoreStore or stronger fence in thread 1, and a #LoadLoad or stronger fence in thread 2.)

This port works, but it's important to understand why. As before, there is a total order on the operations on atomic_global_fence_compatibility, and we must examine both possibilities.

If the fence in thread 1 precedes the fence in thread 2, we conclude from the release semantics of the former and the acquire semantics of the latter that the store to x in thread 1 happens before the load from x in thread 2, the assert passes, and all is well.

If, on the other hand, the fence in thread 2 precedes the fence in thread 1, we conclude in a similar manner that the load from y happens before the store to y, which means that the branch isn't taken and the assert is not executed. Note, however, that now we used the acquire semantics of the fence in thread 1 and the release semantics of the fence in thread 2.

In short, we needed both fences to have both acquire and release semantics, otherwise we wouldn't be able to conclude our analysis and prove that the example is correct and the assertion never fails. Even though the original code contained only #StoreStore and #LoadLoad, two of the most cheap fences, we were forced to employ memory_order_acq_rel to achieve equivalent semantics.

It would be reasonable to assume that programmers may be tempted to try memory_order_acquire or memory_order_release instead; they are provided as options, so they must be useful for something! But this is a trap and would lead to incorrect code, at least according to the specification. (The code may well work in practice, as suggested by table 4, but it isn't guaranteed to.)

The obvious alternative fomulation:

void atomic_global_fence();

provides equivalent functionality to the current

extern const atomic_flag atomic_global_fence_compatibility;

while at the same time being better named and much less error prone. The aim of the explicit global variable atomic_global_fence_compatibility is to render the dependency on a single location explicit, highlighting the potential performance problem and encouraging users to migrate their algorithms away from a global fence formulation and into a form that is more friendly to the C++ memory model. It is questionable whether the the benefits of this approach outweigh its obvious downsides, and the long and inconvenient name that has been chosen is certain to lead to its shortening or wrapping, hiding it from view and subverting the intent.

Having concluded our analysis, we can now revisit our criteria 1-4:

It is possible to port existing code to use the currently supplied fence primitive, but this would generally come at a significant performance cost due to the fact that all fences are heavyweight full fences. This will discourage the porting efforts and potentially lead to less availability of standard and portable code.
Programmers that are familiar with fence-based architectures will find the innovative concept of per-variable fences odd. The atomic_global_fence_compatibility alternative is error prone, suboptimal and is, in fact, designed to discourage people from using it. Some of these programmers will be slow to migrate to the C++ memory model. (Others will of course bite the bullet and learn the new acquire/release style; but the point is that they could've been writing code using their old skills while learning.)
The programmer has no way of emitting a lwsync, that is, a fence weaker than a full fence. This is especially problematic. Programmers are very good at solving problems; when they need to emit a fence instruction, they will find a way to do it, for example by using a dummy atomic load on a randomly chosen variable. Ideally, we want to let them express their intent directly, without abusing an unrelated primitive and producing code that looks portable on the surface, compiles, runs, and sometimes fails in subtle and mysterious ways.
The fact that all fences are mapped to SYNC/mfence often negates all performance gains that could theoretically be achieved by conditional or separate synchronization. Consider for example the multiple lock release example from N2153 (shown here in its original form):

do_work(A,B);
release_fence();
lock_A.store_raw(LOCK_UNLOCKED);
lock_B.store_raw(LOCK_UNLOCKED);

and its formulation in terms of the current working draft primitives:

do_work(A,B);

lock_A.fence( memory_order_release );
lock_B.fence( memory_order_release );

lock_A.store( LOCK_UNLOCKED, memory_order_relaxed );
lock_B.store( LOCK_UNLOCKED, memory_order_relaxed );

The ideal outcome on a x86 platform, where all stores already have release semantics, would be for the fences to disappear completely. Alas, this is not what would happen; as a look at table 4 demonstrates, a typical implementation would emit two mfence instructions (one of which may be eliminated by the peephole optimizer), with a predictably undesirable performance consequences.

III. Suggested Alternative

The alternative proposed in this paper isn't new. It has been suggested, with a slightly different syntax, in N2153 (N2237), N2195 and N2262. It consists of a single function, atomic_memory_fence:

void atomic_memory_fence( memory_order mo );

and its intended implementation on some platforms of interest is shown in the following table:

mo	SPARC RMO	POWER	IA-64	x86, x64
memory_order_relaxed	no-op	no-op	no-op	no-op
memory_order_acquire	#LoadLoad \| #LoadStore	LWSYNC	mf	no-op
memory_order_release	#LoadStore \| #StoreStore	LWSYNC	mf	no-op
memory_order_acq_rel	#LoadLoad \| #LoadStore \| #StoreStore	LWSYNC	mf	no-op
memory_order_seq_cst	#LoadLoad \| #LoadStore \| #StoreLoad \| #StoreStore	SYNC	mf	mfence, or a locked instruction

What is new is that this paper proposes a specification of atomic_memory_fence that describes the intended semantics within the framework of the C++ memory model.

This primitive meets our criteria 1-4 above, for the following reasons:

Existing code written in terms of SPARC or POWER barriers can easily be ported to the C++ model by consulting with the table above and choosing a row appropriately. In the SPARC case, one chooses the first row that encompasses the barrier used in the ported code; in the POWER case, one can either use memory_order_acq_rel for all LWSYNCs, or use knowledge of the algorithm being implemented to replace some LWSYNCs with a theoretically weaker memory_order_acquire or memory_order_release fence.
For the same reason, programmers fluent in SPARCese or POWERian can immediately put their knowledge into use.
The programmer can easily emit a LWSYNC instruction, or, stated equivalently, insert an lwsync fence, by using atomic_memory_fence(memory_order_acq_rel).
Since the primitive is based on N2153/N2237/N2262, it addresses all performance-related use cases presented therein.

In addition,

void atomic_compiler_fence( memory_order mo );

is proposed as a companion primitive that only restricts the compiler optimizations and reorderings in the exact same way an atomic_memory_fence does, without actually emitting a hardware fence instruction. This primitive supports our use case 3 above; it is intended to allow a programmer that already knows what instructions are to be emitted to communicate this knowledge to the compiler.

The theoretical justification of atomic_compiler_fence is that it allows the programmer to specify the order in which relaxed writes to atomic variables become visible to a signal handler executed in the same thread.

IV. Proposed text

Remove the

void fence(memory_order) const volatile;

members from all types in [atomics].

Remove the

void atomic_flag_fence(const volatile atomic_flag *object, memory_order order);

function.

Remove the

void atomic_memory_fence(const volatile atomic_type*, memory_order);

functions.

Remove the definition

extern const atomic_flag atomic_global_fence_compatibility;

Add

// 29.5, fences
void atomic_memory_fence(memory_order);
void atomic_compiler_fence(memory_order);

to the synopsis of <cstdatomic>.

Add a new section, [atomic.fences], with the following contents:

29.5 Fences

This section introduces synchronization primitives called fences. Fences can have acquire semantics, release semantics, or both. A fence with acquire semantics is called an acquire fence. A fence with release semantics is called a release fence.

A release fence R synchronizes with an acquire fence A if a load that is sequenced before A observes the value written by a store sequenced after R, or a value written by any side effect in the release sequence (1.10/6) that store would produce had it been a release operation.

A release fence R synchronizes with an acquire operation A if A observes the value written by a store sequenced after R, or a value written by any side effect in the release sequence that store would produce had it been a release operation.

A release operation R synchronizes with an acquire fence A if a load, sequenced before A, observes the value written by R, or a value written by any side effect in the release sequence headed by R.

Fences can be sequentially consistent. A sequentially consistent fence has both acquire and release semantics. Sequentially consistent fences participate in the single total order S over all memory_order_seq_cst operations (29.1/2). Each sequentially consistent fence synchronizes with the sequentially consistent fence that follows it in S.

void atomic_memory_fence(memory_order mo);

Effects: Depending on the value of mo, this operation:

has no effects, if mo == memory_order_relaxed;

is an acquire fence, if mo == memory_order_acquire;

is a release fence, if mo == memory_order_release;

is both an acquire fence and a release fence, if mo == memory_order_acq_rel;

is a sequentially consistent fence, if mo == memory_order_seq_cst;

void atomic_compiler_fence(memory_order mo);

Effects: equivalent to atomic_memory_fence(mo), except that synchronizes with relationships are established only between a thread and a signal handler executed in the same thread.

[Note: atomic_compiler_fence can be used to specify the order in which actions performed by the thread become visible to the signal handler. — end note]

[Note: Compiler optimizations or reorderings of loads and stores are inhibited in the same way as with atomic_memory_fence, but the hardware fence instructions that atomic_memory_fence would have inserted are not emitted. — end note]

Thanks to Hans Boehm, Lawrence Crowl, Paul McKenney and Raul Silvera for reviewing this paper.

--end