Doc. No.:	WG21/N2392 J16/07-252
Date:	2007-09-09
Reply to:	Hans-J. Boehm
Phone:	+1-650-857-3406
Email:	Hans.Boehm@hp.com

N2392: A Memory Model for C++: Sequential Consistency for Race-Free Programs

Here we explore consequences of the proposed N2334 concurrency memory model for C++, and in the process suggest a change to the proposed try_lock API (see N2320). This paper does not itself propose specific wording changes to the standard.

A program has a sequentially consistent execution if there is an interleaving of the actions performed by each thread such that each read access to a scalar object sees the last write to the object that precedes it in the interleaving. Slightly more formally, we must be able to arrange the actions performed by all threads in a single total order T, such that:

Actions performed by each individual thread occur in T in the order in which they were performed by that thread.
Each load of a scalar object s observes the last prior write to s in T.

Here we argue that programs that are data-race-free by either the definitions in N2334, or some more intuitive formulations, and use only locks and sequentially consistent atomics for synchronization, exhibit only sequentially consistent executions.

This is somewhat analogous to the corresponding theorem for Java.

Main Claims: If a program uses no synchronization operations other than

Sequentially consistent atomics, and
Simple lock() and unlock() operations with acquire and release semantics respectively,

then

If a program allows no data races on a given input (using the N2334 definition), then the program obeys sequential consistency rules, i.e. it behaves as though it had been executed by sequentially interleaving its actions, and letting each load instruction see the last preceding value stored to the same location in this interleaving.
If a program does allow data races on a given input (using the N2334 definition), then there exists a (possibly partial) sequentially consistent execution, with two conflicting actions, neither of which happens before the other. In effect, we only need to look at sequentially consistent executions in order to determine whether there is a race. For example, a program such as

Thread1 Thread2

x = 0;
if (x) y = 1; y = 0;
if (y) x = 1;
cannot possibly contain a race, since in a sequentially consistent execution, each variable is accessed by only one thread.
A program allows a data race on a given input according to the N2334 definition, if and only if there exists a (partial) sequentially consistent execution in which the two unordered conflicting actions are adjacent in the sequential interleaving, i.e. one occurs directly before the other.

From a pure correctness perspective, condition variable notification can be modelled as a no-op, and a condition variable wait as a an unlock() followed by a lock() operation. Hence the results here also apply to programs with condition variables.

Assumptions about Synchronization Operations:

First note that although N2334 no longer explicitly requires the happens-before relation to be irreflexive, i.e. acyclic, this is in fact still an implicit requirement. If there were a cycle such that A happened before A, then this cycle would have to involve at least one inter-thread synchronizes-with relationship, for which both the store S and load L appear in the cycle. But this would prevent S from being in the visible sequence (1.10p10) of L, since S also "happens after" L.

We now restrict our attention to programs whose only conflicting accesses between threads are to locks (lock() and unlock() only for now) and sequentially consistent atomics.

We assume that there exists a single strict irreflexive total "atomic synchronization order" SO on atomic operations, such that:

SO is consistent with the happens-before relation, i.e. the transitive closure of the union of the happens before relation and the synchronization order remains irreflexive.
SO is consistent with the modification order of each atomic variable, i.e. the modification orders are just SO restricted to operations on that variable.
Each load operation performed on an atomic object yields the value of the immediately preceding (in SO) store to that atomic object.

This is required by the current atomics proposal N2381.

We further know that lock acquisitions and releases of lock l occur in a single total order, namely the modification order M_l for l. Since every lock release synchronizes with the next lock acquisition in M_l, and we assume that for every lock acquisition, the next operation in M_l is a release of the lock performed by the same thread, it follows that M_l is a subset of the happens before relation.

Since SO is consistent with happens-before, and the M_l for various locks are subsets of happens before, we can extend SO to a total order that includes all the M_l. (We get such an order by topologically sorting the transitive closure of the union of all the M_l and S.) From here on, we will assume, without loss of generality, that SO includes lock operations, and refer to it simply as the "synchronization order".

The peculiar role of try_lock():

So far we have limited ourselves to only lock() and unlock() operations on locks. Boehm, "Reordering constraints for pthread-style locks", PPoPP 07 points out that try_lock() introduces a different set of issues. Herb Sutter has pointed out that timed_lock() shares the same issues.

The fundamental issue here is that, since lock() does not have release semantics, a failed try_lock() sees the value of a modification that did not necessarily happen before it. There is not necessarily a single irreflexive order that is consistent with both SO and the apparent ordering of lock operations. For example, the following is consistent with our currently proposed semantics, since there are no synchronizes-with relationships between the two threads:

Thread1 Thread2

store(&x, 1);
lock(&l1); r2 = try_lock(&l1); // fails
r3 = load(&x); // yields 0
Thus, in spite of contrary claims in earlier versions of this paper, even our first claim does not appear to hold in the presence of try_lock().

If we wanted the first claim to hold with the customary interpretation of try_lock(), we would need to preclude the above outcome by ensuring that the two statements executed by thread 2 in the above example become visible in order. This would certainly require that failed try_lock() operations have acquire semantics, which has non-negligible cost on some architectures. If we want all our claims to hold in the presence of a standard try_lock(), we would also need the lock() operation to have release semantics (in addition to its normal acquire semantics), since it writes the value read by a failed try_lock(). This often has a substantial performance cost, even in the absence of lock contention.

It was generally agreed that we do not want to incur either of the above costs solely in order to support abuses of try_lock(), such as the one in the above example. We thus proceed on a different path.

We will assume that if try_lock() is present at all, then it can fail spuriously, i.e. fail to acquire the lock and return failure, even if the lock is available. Similarly, if timed_lock() is available, it may fail to acquire the lock, even if the lock was available during the entire time window in which we attempted to acquire it.

These have the effect of ensuring that neither a failed try_lock() nor a failed timed_lock() provides useful information about the state of the lock. Hence they no longer act as read operations, and we can no longer "read" the value of a lock unless the corresponding "write" operation happened before the read.

The example above is no longer a counter-example to our first claim. The outcome is possible in a sequentially consistent execution in which all of thread 2 is executed before thread 1, since the try_lock() in thread 2 can fail spuriously.

Proof Of Main Claim 1:

Again consider a particular race-free execution on the given input, which follows the rules of N2334.

The corresponding happens-before relation (hb) and synchronization order are irreflexive and consistent, and hence can be extended to a strict total order T.

Clearly the actions of each thread appear in T in thread order, i.e. in is-sequenced-before order.

It remains to be shown that each load sees the last preceding store in T that stores to the same location. (For present purposes we view bit-field stores as loading and then storing into the entire "location", i.e. contiguous sequence of non-zero-length bit-fields.)

Clearly this is true for operations on atomic objects, since all such operations appear in the same order as in SO, and each load in SO sees the preceding store in SO. A similar argument applies to operations on locks.

Lock operations on a single lock, other than failed try_locks, are also totally ordered by SO. Thus each such operation must see the results of the last preceding one in T.

We can say little about where a failed try_lock() operation on a lock l appears in T. But, since we assume that try_lock() may fail spuriously, it does not matter. A failure outcome is acceptable no matter what state the lock was left in by the last preceding operation (in T) on l. No matter where the failed try_lock() appears in T, the operations on l could have been executed in that order, and produced the original results.

From here on, we consider only ordinary, non-atomic memory operations.

Consider a store operation S_visible seen by a load L.

By the rules of N2334, S_visible must happen before L. Hence S_visible precedes L in the total order T.

Now assume that another store S_middle appears between S_visible and L in T.

We know from the fact that T is an extension of hb, that we cannot have either of

L hb S_middle

S_middle hb S_visible

since that would be inconsistent with the reverse ordering in T.

However all three operations conflict and we have no data races. Hence they must all be ordered by hb, and S_middle must also be hb ordered between the other two. But this violates the second clause of the visibility definition in 1.10p9, concluding the proof.

Proof Of Main Claim 2:

We show that any data race by our definition corresponds to a data race in a sequentially consistent execution.

Consider an execution with a data race. Let T be the total extension of the happens before and synchronization orders, as constructed above.

Consider the longest prefix TP of T that contains no race. Note that each load in TP must see a store that precedes it in either the synchronization or happens before orders. Hence each load in TP must see a store that is also in TP. Similarly each lock operation must see the state produced by another lock operation also in TP, or it must be a failed try_lock() whose outcome could have occurred if it had seen such a state.

By the arguments of the preceding section, the original execution restricted to TP is equivalent to a sequentially consistent execution.

The next element N of T following TP must be an ordinary memory access that introduces a race. If N is a write operation, consider the original execution restricted to TP ∪ {N}. Otherwise consider the same execution except that N sees the value written by the last write to the same variable in TP.

In either case, the resulting execution (of TP plus the operation introducing the race) is sequentially consistent; if we extend T' from above with the (variant of) N, the resulting sequence is guaranteed to still be an interleaving of the evaluation steps of the threads, such that each read sees the preceding write. (If N was a write, it could not have been seen by any of the reads in TP, since those reads were ordered before N in T. If N was a read, it was modified to make this claim true by construction.) Thus we have a sequentially consistent execution of the program that exhibits the data race.

Proof Of Main Claim 3:

If A synchronizes with, or is sequenced before B, then clearly A must precede B in the interleaving corresponding to a sequentially consistent execution. In the first case B sees the value stored by A, in the second case, the order within a thread must be preserved. Thus if A happens before B, A must precede B in the interleaving.

It follows that if two conflicting operations executed by different threads are adjacent in the interleaving, neither can happen before the other. The synchronization operations introducing the happens-before ordering would otherwise have to occur between them. Thus if two such operations exist in the interleaving, we must have an N2334 race.

It remains to show the converse: If we have an N2334 race, there must be an interleaving corresponding to a sequentially consistent execution in which the racing operations are adjacent.

Start with the execution restricted to TP ∪ {N}, as above, with any value read by {N} adjusted as needed, also as above. We know that nothing in this partial execution depends on the value read by {N}. Let M be the other memory reference involved in the race.

We can further restrict the execution to those operations that happen before either M or N. This set still consists of prefixes of the sequences of operations performed by each thread. Since each load sees a store that happens before it, the omitted operations cannot impact the remaining execution. (This would of course not be true for try_lock().)

Define a partial order race-order on { x that happen before M or N } ∪ {M} ∪ {N} as follows. First divide this set into three subsets:

All elements that happen before either M or N.
{ M }
{ N }

Race-order orders elements in each subset after elements of the subset(s) that precede it, and before elements of the subset(s) that follow it. We impose no ordering on the elements within the initial subset. Clearly M and N must be adjacent in any total extension of race-order.

Race-order is consistent with happens-before and synchronization order. It imposes no additional order on the initial subset. Neither M nor N is ordered by the synchronization order, and neither is race ordered or happens before any of the elements in the first subset. If we had a cycle A₀, A₁, ..., A_n = A₀, where each element of the sequence happens before or is race-ordered or synchronization ordered before the next, neither M nor N could thus appear in the cycle. This is impossible, since happens-before and the synchronization order are required to be consistent.

Construct the total order T' as a total extension of the reflexive transitive closure of the union of

happens before
synchronization order
race order

By the preceding observation, this exists.

By the same arguments as in the proof of claim 1, every memory read must see the preceding write in this sequence, except possibly N, since it is the only one that may see a value stored by a racing operation. But we can again simply adjust the value seen by N to obtain the property we desire, without affecting the rest of the execution. Thus T' is the desired interleaving of thread actions in which the racing actions are adjacent.

Concluding Observation:

None of the above applies if we allow load_acquire and store_release operations, since the synchronization operations themselves may not behave in a sequentially consistent manner. In particular, consider the following standard ("Dekker's") example:

Thread1 Thread2

store_release(&y, 1);
r1 = load_acquire(&x); store_release(&x, 1);
r2 = load_acquire(&y);

This allows r1 = r2 = 0, where sequential consistency (and Java) do not.

Thread1	Thread2
store(&x, 1); lock(&l1);	r2 = try_lock(&l1); // fails r3 = load(&x); // yields 0

Thread1	Thread2
store_release(&y, 1); r1 = load_acquire(&x);	store_release(&x, 1); r2 = load_acquire(&y);

Thread1	Thread2
x = 0; if (x) y = 1;	y = 0; if (y) x = 1;