Doc. No.: | WG21/N2392 J16/07-252 |
---|---|
Date: | 2007-09-09 |
Reply to: | Hans-J. Boehm |
Phone: | +1-650-857-3406 |
Email: | Hans.Boehm@hp.com |
A program has a sequentially consistent execution if there is an interleaving of the actions performed by each thread such that each read access to a scalar object sees the last write to the object that precedes it in the interleaving. Slightly more formally, we must be able to arrange the actions performed by all threads in a single total order T, such that:
Here we argue that programs that are data-race-free by either the definitions in N2334, or some more intuitive formulations, and use only locks and sequentially consistent atomics for synchronization, exhibit only sequentially consistent executions.
This is somewhat analogous to the corresponding theorem for Java.
Main Claims: If a program uses no synchronization operations other than
then
Thread1 | Thread2 |
x = 0; if (x) y = 1; |
y = 0; if (y) x = 1; |
From a pure correctness perspective, condition variable notification can be modelled as a no-op, and a condition variable wait as a an unlock() followed by a lock() operation. Hence the results here also apply to programs with condition variables.
Assumptions about Synchronization Operations:
First note that although N2334 no longer explicitly requires the happens-before relation to be irreflexive, i.e. acyclic, this is in fact still an implicit requirement. If there were a cycle such that A happened before A, then this cycle would have to involve at least one inter-thread synchronizes-with relationship, for which both the store S and load L appear in the cycle. But this would prevent S from being in the visible sequence (1.10p10) of L, since S also "happens after" L.
We now restrict our attention to programs whose only conflicting accesses between threads are to locks (lock() and unlock() only for now) and sequentially consistent atomics.
We assume that there exists a single strict irreflexive total "atomic synchronization order" SO on atomic operations, such that:
We further know that lock acquisitions and releases of lock l occur in a single total order, namely the modification order Ml for l. Since every lock release synchronizes with the next lock acquisition in Ml, and we assume that for every lock acquisition, the next operation in Ml is a release of the lock performed by the same thread, it follows that Ml is a subset of the happens before relation.
Since SO is consistent with happens-before, and the Ml for various locks are subsets of happens before, we can extend SO to a total order that includes all the Ml. (We get such an order by topologically sorting the transitive closure of the union of all the Ml and S.) From here on, we will assume, without loss of generality, that SO includes lock operations, and refer to it simply as the "synchronization order".
The peculiar role of try_lock():
So far we have limited ourselves to only lock() and unlock() operations on locks. Boehm, "Reordering constraints for pthread-style locks", PPoPP 07 points out that try_lock() introduces a different set of issues. Herb Sutter has pointed out that timed_lock() shares the same issues.
The fundamental issue here is that, since lock() does not have release semantics, a failed try_lock() sees the value of a modification that did not necessarily happen before it. There is not necessarily a single irreflexive order that is consistent with both SO and the apparent ordering of lock operations. For example, the following is consistent with our currently proposed semantics, since there are no synchronizes-with relationships between the two threads:
Thread1 | Thread2 |
store(&x, 1); lock(&l1); |
r2 = try_lock(&l1); // fails r3 = load(&x); // yields 0 |
If we wanted the first claim to hold with the customary interpretation of try_lock(), we would need to preclude the above outcome by ensuring that the two statements executed by thread 2 in the above example become visible in order. This would certainly require that failed try_lock() operations have acquire semantics, which has non-negligible cost on some architectures. If we want all our claims to hold in the presence of a standard try_lock(), we would also need the lock() operation to have release semantics (in addition to its normal acquire semantics), since it writes the value read by a failed try_lock(). This often has a substantial performance cost, even in the absence of lock contention.
It was generally agreed that we do not want to incur either of the above costs solely in order to support abuses of try_lock(), such as the one in the above example. We thus proceed on a different path.
We will assume that if try_lock() is present at all, then it can fail spuriously, i.e. fail to acquire the lock and return failure, even if the lock is available. Similarly, if timed_lock() is available, it may fail to acquire the lock, even if the lock was available during the entire time window in which we attempted to acquire it.
These have the effect of ensuring that neither a failed try_lock() nor a failed timed_lock() provides useful information about the state of the lock. Hence they no longer act as read operations, and we can no longer "read" the value of a lock unless the corresponding "write" operation happened before the read.
The example above is no longer a counter-example to our first claim. The outcome is possible in a sequentially consistent execution in which all of thread 2 is executed before thread 1, since the try_lock() in thread 2 can fail spuriously.
Proof Of Main Claim 1:
Again consider a particular race-free execution on the given input, which follows the rules of N2334.
The corresponding happens-before relation (hb) and synchronization order are irreflexive and consistent, and hence can be extended to a strict total order T.
Clearly the actions of each thread appear in T in thread order, i.e. in is-sequenced-before order.
It remains to be shown that each load sees the last preceding store in T that stores to the same location. (For present purposes we view bit-field stores as loading and then storing into the entire "location", i.e. contiguous sequence of non-zero-length bit-fields.)
Clearly this is true for operations on atomic objects, since all such operations appear in the same order as in SO, and each load in SO sees the preceding store in SO. A similar argument applies to operations on locks.
Lock operations on a single lock, other than failed try_locks, are also totally ordered by SO. Thus each such operation must see the results of the last preceding one in T.
We can say little about where a failed try_lock() operation on a lock l appears in T. But, since we assume that try_lock() may fail spuriously, it does not matter. A failure outcome is acceptable no matter what state the lock was left in by the last preceding operation (in T) on l. No matter where the failed try_lock() appears in T, the operations on l could have been executed in that order, and produced the original results.
From here on, we consider only ordinary, non-atomic memory operations.
Consider a store operation Svisible seen by a load L.
By the rules of N2334, Svisible must happen before L. Hence Svisible precedes L in the total order T.
Now assume that another store Smiddle appears between Svisible and L in T.
We know from the fact that T is an extension of hb, that we cannot have either of
L hb Smiddle
Smiddle hb Svisible
since that would be inconsistent with the reverse ordering in T.
However all three operations conflict and we have no data races. Hence they must all be ordered by hb, and Smiddle must also be hb ordered between the other two. But this violates the second clause of the visibility definition in 1.10p9, concluding the proof.
Proof Of Main Claim 2:
We show that any data race by our definition corresponds to a data race in a sequentially consistent execution.
Consider an execution with a data race. Let T be the total extension of the happens before and synchronization orders, as constructed above.
Consider the longest prefix TP of T that contains no race. Note that each load in TP must see a store that precedes it in either the synchronization or happens before orders. Hence each load in TP must see a store that is also in TP. Similarly each lock operation must see the state produced by another lock operation also in TP, or it must be a failed try_lock() whose outcome could have occurred if it had seen such a state.
By the arguments of the preceding section, the original execution restricted to TP is equivalent to a sequentially consistent execution.
The next element N of T following TP must be an ordinary memory access that introduces a race. If N is a write operation, consider the original execution restricted to TP ∪ {N}. Otherwise consider the same execution except that N sees the value written by the last write to the same variable in TP.
In either case, the resulting execution (of TP plus the operation introducing the race) is sequentially consistent; if we extend T' from above with the (variant of) N, the resulting sequence is guaranteed to still be an interleaving of the evaluation steps of the threads, such that each read sees the preceding write. (If N was a write, it could not have been seen by any of the reads in TP, since those reads were ordered before N in T. If N was a read, it was modified to make this claim true by construction.) Thus we have a sequentially consistent execution of the program that exhibits the data race.
Proof Of Main Claim 3:
If A synchronizes with, or is sequenced before B, then clearly A must precede B in the interleaving corresponding to a sequentially consistent execution. In the first case B sees the value stored by A, in the second case, the order within a thread must be preserved. Thus if A happens before B, A must precede B in the interleaving.
It follows that if two conflicting operations executed by different threads are adjacent in the interleaving, neither can happen before the other. The synchronization operations introducing the happens-before ordering would otherwise have to occur between them. Thus if two such operations exist in the interleaving, we must have an N2334 race.
It remains to show the converse: If we have an N2334 race, there must be an interleaving corresponding to a sequentially consistent execution in which the racing operations are adjacent.
Start with the execution restricted to TP ∪ {N}, as above, with any value read by {N} adjusted as needed, also as above. We know that nothing in this partial execution depends on the value read by {N}. Let M be the other memory reference involved in the race.
We can further restrict the execution to those operations that happen before either M or N. This set still consists of prefixes of the sequences of operations performed by each thread. Since each load sees a store that happens before it, the omitted operations cannot impact the remaining execution. (This would of course not be true for try_lock().)
Define a partial order race-order on { x that happen before M or N } ∪ {M} ∪ {N} as follows. First divide this set into three subsets:
Race-order is consistent with happens-before and synchronization order. It imposes no additional order on the initial subset. Neither M nor N is ordered by the synchronization order, and neither is race ordered or happens before any of the elements in the first subset. If we had a cycle A0, A1, ..., An = A0, where each element of the sequence happens before or is race-ordered or synchronization ordered before the next, neither M nor N could thus appear in the cycle. This is impossible, since happens-before and the synchronization order are required to be consistent.
Construct the total order T' as a total extension of the reflexive transitive closure of the union of
By the same arguments as in the proof of claim 1, every memory read must see the preceding write in this sequence, except possibly N, since it is the only one that may see a value stored by a racing operation. But we can again simply adjust the value seen by N to obtain the property we desire, without affecting the rest of the execution. Thus T' is the desired interleaving of thread actions in which the racing actions are adjacent.
Concluding Observation:
None of the above applies if we allow load_acquire and store_release operations, since the synchronization operations themselves may not behave in a sequentially consistent manner. In particular, consider the following standard ("Dekker's") example:
Thread1 | Thread2 |
store_release(&y, 1); r1 = load_acquire(&x); |
store_release(&x, 1); r2 = load_acquire(&y); |
This allows r1 = r2 = 0, where sequential consistency (and Java) do not.