Doc. No.: | WG21/N2176 J16/07-0036 |
---|---|
Date: | 2007-03-09 |
Reply to: | Hans-J. Boehm |
Phone: | +1-650-857-3406 |
Email: | Hans.Boehm@hp.com |
The contents benefitted significantly from discussions with many other people such as Lawrence Crowl, Peter Dimov, Paul McKenney, Doug Lea, Sarita Adve, Herb Sutter, and Jeremy Manson and others. Not all of the paticipants in those discussions agree with all of my conclusions.
This document is essentially the concatenation of five shorter ones that previously appeared on the author's C++ memory model web site:
Some arguments are already given in the introductory "Rationale" section of committee paper N1942. This is a more detailed exploration.
The basic arguments for undefined data race semantics in C++ are:
There appears to be a consensus that we do not want to incur the fence overhead in such cases. This means that we need to allow wild branches in response to some kinds of data races. Allowing this in some cases, but not allowing it in all cases, would undoubtedly complicate the specification.
unsigned i = x; if (i < 2) { foo: ... switch (i) { case 0: ...; break; case 1: ...; break; default: ...; } }Assume the code at label foo is fairly complex, and forces i to be spilled and that the switch implementation uses a branch table. (If you don't believe the latter is a reasonable assumption here, assume that "2" is replaced by a larger number, and there are more explicit cases in the switch statement.)
The compiler now performs the following reasonable (and common?) optimizations:
Some such redesign is necessary anyway to avoid speculative writes. But I think that is usually limited to a few transformations which can currently introduce such writes. I suspect the redesign required here would be significantly more pervasive. And I do not see how to justify the cost.
The one benefit of providing better defined race semantics seems to be somewhat easier debuggability of production code that encounters a race. But I'm not sure that outweighs the negative impact on race detection tools.
The interfaces for atomic objects that we have been considering provide ordering constraints as part of the atomic operations themselves. This is consistent with the Java and C# volatile-based approach, and our atomic_ops package, but inconsistent with some other atomic operation implementations, such as that in the Linux kernel, which often require the use of explicit memory fences..
Some rationale for this choice is provided as part of N2145 and N2047. Here we provide a somewhat expanded and updated version of that rationale.
Note that N2153 argues that explicit fences are still needed for maximum performance on certain applications and architectures. The arguments here do not preclude providing them as well.
Here we list our reasons for explicitly associating ordering semantics with atomic operations, and correspondingly providing different variants with different ordering constraints:
On X86 processors, the fence is redundant only if precisely the right kind of fence (for a store, one that prevents "LoadStore" and "StoreStore" reordering) is used. ( N2153 does suggest such a fence.)
On Itanium, the fence, once requested, can generally not be optimized back to an st.rel instruction. To see this, consider the hypothetical lock release sequence:
x = 1; // lock protect assignment LoadStore_and_StoreStore_fence(); // hypothetical optimal fence lock.store_relaxed(0); // atomically clear spinlock y.store_relaxed(42); // Unrelated atomic assignmentNote that this does not allow the assignments to x and y to be reordered. However, on Itanium, we really want to transform this to the equivalent of
x = 1; // lock protect assignment lock.store_release(0); // atomically clear spinlock y.store_relaxed(42); // Unrelated atomic assignmentThis is not safe, since this version does allow the assignments to x and y to be reordered. In most realistic contexts, it would be hard to determine that there is no subsequent assignment like the store to y. Hence I would not expect this transformation to be generally feasible.
This issue is admittedly largely Itanium specific. But note that the above reordering should be allowed; the fence-based implementation adds a useless constraint. If we only have fences, we can't express the abstractly correct ordering constraint for even a spin-lock release.
r1 = x.load_relaxed(); r2 = *r1;
Most processors guarantee that, for a naive translation of this code, the second load becomes visible after the first, since there is a data dependency between the two. Indeed Java almost requires such a guarantee in order to allow efficient implementation of final fields.
Such a guarantee is occasionally useful. It would guarantee that, for the above example, if another thread initialized an object, and then stored a pointer to it into a previously null x in a way that ensured the ordering of the two stores, then we could not see a non-null r1 with an uninitialized value for r2.
Note however that all such guarantees become vacuous if only load_acquire and store_release are used. In the most interesting cases, dependency-based ordering allows us to use load_relaxed instead of load_acquire. The resulting performance impact varies from none on X86, SPARC, or PA-RISC to that of a full memory fence on Alpha. PowerPC and Itanium fall in between.
Our most recent C++ atomics (N2145) and memory model (N2171) proposals provide no dependency-based ordering guarantees. Java provides no dependence-based guarantees for non-volatile (unordered) accesses, except for final field guarantees, which rely on dependence-based reasoning at most in the implementation. In contrast, N2153 does assume dependency based ordering.
The point of this note is to discuss the issues that bear on the C++ decision. Much of this relies on an earlier discussion led by Bill Pugh and Jeremy Manson in the context of the Java memory model, some of which is reproduced in the POPL 2005 paper by Manson, Pugh, and Adve. However, the Java discussion is in the context of ordinary variable accesses, and the motivation there relies heavily on not impeding compiler optimizations. Since the C++ rules apply only to "relaxed" (unordered) accesses to atomic variables, which are expected to be infrequent, these arguments are less convincing for C++.
r1 = x.load_relaxed(); r3 = &a + r1 - r1; r2 = *r3;Statically, the value of r3 appears to depend on r1. However, many compilers will optimize this to at least
r1 = x.load_relaxed(); r3 = &a; r2 = *r3;removing the dependency in the process. As a result, much common hardware (e.g. PowerPC, ARM, Itanium) will no longer guarantee that the first and last load appear to be executed in order. And it is unclear that even later stages of the compiler should be required to preserve the order; one of the reasons for introducing load_relaxed was to allow compiler reordering around the operation.
As we stated earlier, we expect load_relaxed to be infrequent in practice. Thus it might not be unreasonable to disallow optimizations involving it. But note that the above optimization is performed entirely on the middle statement, which does not mention load_relaxed, and in fact might occur in a function call, whose implementation is in a separate translation unit.
Thus requiring memory ordering based on static dependencies would impede optimization in other translation units, which might not even mention load_relaxed. This is clearly unacceptable.
For example,
r1 = x.load_relaxed(); if (r1 == 0) r2 = *r1; else r2 = *(r1 + 1);might be transformed to
r1 = x.load_relaxed(); if (r1 == 0) r3 = r1; else r3 = r1 + 1; r2 = *r3;Again, the transformations that potentially break the dependency may happen in a separate compilation unit. Thus restricting them appears impractical.
r1 = x.load_relaxed(); if (r1) { r2 = y.a; } else { r2 = y.a; }Either of the loads of the ordinary variable y are clearly conditionally executed.
If we identify the two loads of y since they both perform the same action, we end up with what almost certainly an unusable programming model. This would mean that these loads are not dependent on the initial load_relaxed, and hence are not ordered with respect to it.
To see why this is unusable, consider instead the variant that involves function calls:
r1 = x.load_relaxed(); if (r1) { f(&y); } else { g(&y); }Assume that the implementations of f() and g() are defined elsewhere and not visible to the author of this code. If we identified the above loads in the original example, and both f() and g() happen to perform the same loads, then presumably we would have to say that the accesses to y by f() and g() are also unordered with respect to the load_relaxed. On the other hand, if they access different fields, the accesses would be ordered.
This leaves us in a position in which the author of the client code cannot tell whether memory accesses are ordered without inspecting the implementation of called functions. Even worse, ordering may be only partially guaranteed because the two functions happen to perform one common memory access. We do not believe this is viable.
By a similar argument, we need to treat two memory operations as distinct if they correspond to the same source statement, but are reached via different call chains. For example, if the functions f() and g() both call h(y), which is implemented as
int h(int *y) { return *y; }then the load is always performed by the same statement, but nothing substantive has changed.
r1 = x.load_relaxed(); if (r1) { r2 = y.a; } else { r2 = y.a; }and treat the two loads of y.a as distinct (as we must), then they depend on, and ordered with respect to, the load_relaxed. And for a naively compiled version, most hardware would implicitly enforce that.
However, if the above were compiled to
r1 = x.load_relaxed(); r2 = y.a;as many compilers would, then the dependence has been removed, and the two loads are no longer ordered. Hence such optimizations would have to be disallowed.
It quickly becomes clear that this is intolerable, when we consider that this code may again be distributed among several compilation units. Consider:
int h(int r1) { if (r1) { return y.a; } else { return y.a; } } int f() { ... r1 = x.load_relaxed(); r3 = h(r1); ... }The dependency, and hence ordering, between the load_relaxed and the y.a loads is broken if h is optimized in the obvious way to just unconditionally return y.a. But h() could easily reside in a separate compilation unit, possibly one that hasn't been recompiled since the addition of atomics.
Our conclusion is that this approach is also not viable.
It may be feasible to restrict dependency-based ordering to the important special case in which the dependent operation is a load or store to a field of an object pointed to by the result of an initial load_relaxed. But this seems to still run afoul of some, admittedly much more obscure, optimizations, which would otherwise be legal, and could be carried out in separate compilation units. Consider:
r2 = x.load_relaxed(); r3 = r2 -> a;If we have some reason to believe, e.g. based on profiling, that the resulting value in r2 is usually the same as in r1, it may well be profitable to transform this to
r2 = x.load_relaxed(); r3 = r1 -> a; if (r1 != r2) r3 = r2 -> a;since it usually breaks the dependency between the two loads. But by doing so, it also breaks the hardware-based ordering guarantee.
Thus even this case doesn't appear very clear-cut.
If we tried to extend this to dependencies through array subscripts, things become more dubious still. If we have
r1 = x.load_relaxed(); r2 = a[r1 -> index % a_size];and assume that some other thread t' initializes an object q, sets a[r1 -> index], and then performs a store_release to set x to p, are we guaranteed that r2 will contain the value assigned by t' to the array element, instead of an earlier value?
Usually the answer would be "yes". However if the compiler happens to be able to prove that a only has a single element, e.g. because the program is built for a particularly small configuration, the answer unexpectedly becomes "no".
We could easily add a templatized library function loads and dereferences an atomic pointer, guaranteeing only that the dereferenced value was completely read after the pointer dereference. But that's of somewhat limited utility and error-prone.
Many modern architectures provide fence instructions that are specific to only loads or stores. For example, SPARC provides an assortment of such fence instruction variants. Alpha's "write memory barrier" is another well-known example. We have not proposed to provide similar facilities in the C++ memory model or atomic operations library.
Although there are cases in which load-only or store-only ordering are useful, we believe that the correctness constraints are exceedingly subtle, even relative to the other constructs we have been discussing. Hence we believe it makes sense to consider support for these only if there is evidence that such restricted ordering is likely to be much cheaper on a reasonable selection of processors. (This is apparently not the case on PowerPC, ARM, X86, MIPS, or Itanium. I have not seen measurements for SPARC, where it might matter. I think wmb is a bit cheaper than mb on most Alphas, but that's of limited interest.)
To see why read-only and write-only fences are tricky to use correctly, consider the canonical use case for acquire/release ordering constraints. We use an atomic flag variable x_init to hand off data (an ordinary variable x from one thread to another:
Thread 1: <Initialize x> x_init.store_release(true); Thread 2: if (x_init.load_acquire()) <use x>At first glance, this should be the canonical use case that requires only store ordering in thread 1. However, that is actually only safe under very limited conditions. To explain that, we consider two cases, depending on whether the uses of x are entirely read-only or not.
This case is highly problematic.
Clearly, and unsurprisingly, it is unsafe to replace the load_acquire with a version that restricts only load ordering in this case. That would allow the store to x in thread 2 to become visible before the initialization of x by thread 1 is complete, possibly losing the update, or corrupting the state of x during initialization.
More interestingly, it is also generally unsafe to restrict the release ordering constraint in thread 1 to only stores. To see this, consider what happens if the initialization of x also reads x, as in
x.a = 0; x.a++; x_init.store_write_release(true);and the code that uses x in thread 2 updates it, with e.g.
if (x_init.load_acquire()) x.a = 42;If the release store in thread 1 were restricted to only ensure completion of prior stores (writes), the load of x.a in thread 1 (part of x.a++) could effectively be reordered with the assignment x_init, and could hence see a value of 42, causing x.a to be initialized to 43.
This admittedly appears rather strange, since the assignment of 43 to x.a would have to become visible before the load, on which it depends, is completed. However, we've argued separately that it doesn't make sense to enforce ordering based on dependencies, since they cannot be reliably defined at this level. And machine architectures do not even consistently guarantee ordering based on dependencies through memory. Thus this is an unlikely and very surprising result, but not one we could easily (or apparently even reasonably) disallow in the presence of a store-only fence. And clearly, it is surprising enough that it is important to disallow.
One might argue that there are nonetheless interesting cases in which the initial operations are known to involve only stores, and hence none of this applies. There are three problems with this argument:
In this case, it appears to be safe to limit the release and acquire constraints to loads and stores respectively.
However, I don't believe the argument here requires only that the operation here be logically read-only; it must truly not write to the object. Since I believe we generally allow library classes (e.g. containers) to avoid synchronization during construction, and then lock "under the covers" for read-only operations (e.g. for caching) this appears to be difficult to guarantee.
Simple formulations of programming language memory models often allow "out-of-thin-air" values to be produced. To see how this might happen, consider the following canonical example (Example 1), where variables x and y are initially zero and atomic, variables starting with "r" are local ("register") variables, and assignment denotes unordered (raw, relaxed) loads or stores:
Thread 1: r1 = x; y = r1; Thread 2: r2 = y; x = r2;The code executed by each thread simply copies one global to another; we've just made the intermediate temporaries explicit to emphasize that both a load and a store is involved.
The question is whether each load can "see" the value written by the other thread's store. If this is possible, then we can get an apparently consistent execution in which each store stores a value of 42, which is then read by the load in the other thread, justifying the stored value.
At some level this appears "obviously absurd". But here are two arguments that, although we probably want to disallow the behavior, it is not as absurd as might appear at first glance:
Thread 1: r1 = x; y = r1; Thread 2: r2 = y; x = 42;Since the assignments represent unordered operations, we would like to allow either the compiler or hardware to reorder the two statements in the second thread. Once that happens, we get the questionable behavior if the first thread executes entirely between the two statements. This corresponds to exactly the same visibility of stores as in the original example. The only difference is the "data dependency" between the two statements in the second thread.
(Adding the dependency based ordering to only the "precedes" relation, as suggested earlier, doesn't help, as becomes apparent below.)
To see why things are so bad, consider another variant (example 3) of the above example:
Thread 1: r1 = x; y = r1; Thread 2: r2 = y; if (r2) x = 42; else x = 42;This differs from the preceding version in that each assignment to x is fairly clearly dependent on y, for any reasonable definition of what that means. (Recall that considering only data dependencies doesn't work, since a data dependent assignment could be transformed to a switch statement with many constant assignments. Somehow looking at "the next assignment to x" no matter where it occurs syntactically also doesn't work, for the reasons given in the dependency discussion, and since we could easily replace x = 42; by x = 41; x = 42;.)
The problem of course is that example 3 can be easily optimized to example 2, and most people would expect a good compiler to do so. Allowing r1 = r2 = 42 in example 2, but not example 3, seems untenable and, based on the previous dependency discussion, probably infeasible.
As a result, we now have a second example, which has basically the same structure as example 1 and the same dependency pattern. The difference is really just that in one case the dependency can be "optimized away", and in the other case it cannot.
An unordered atomic store shall only store a value that has been computed from constants and program input values by a finite sequence of program evaluations, such that each evaluation observes the values of variables as computed by the last prior assignment in the sequence. The ordering of evaluations in this sequence must be such that
[Note: This requirement disallows "out-of-thin-air", or "speculative" stores of atomics. Since unordered operations are involved, evaluations may appear in this sequence out of thread order. For example, with x and y initially zero,
Thread 1: r1 = y.load_relaxed(); x.store_relaxed(r1); Thread 2: r2 = x.load_relaxed(); y.store_relaxed(42);is allowed to produce r1 = r2 = 42. The sequence of evaluations justifying this starts consists of:
y.store_relaxed(42); r1 = y.load_relaxed(); x.store_relaxed(r1); r2 = x.load_relaxed();On the other hand,
Thread 1: r1 = y.load_relaxed(); x.store_relaxed(r1); Thread 2: r2 = x.load_relaxed(); y.store_relaxed(r2);may not produce r1 = r2 = 42, since there is no sequence of evaluations that results in the computation of 42.]
This has the large advantage that we simplify the core memory model and move the complexity to the section on unordered atomics, where it belongs.
It has the potential disadvantage that my characterization above of what constitutes an "out-of-thin-air" results may well again be wrong. On the other hand, even if we discover that it is, and there is no adequate replacement, I think this rule is sufficiently esoteric that it may be acceptable just to state it as "no speculative unordered atomic stores", and then leave the note.
Jeremy Manson has already pointed out that the above rule fails to preclude
Thread 1: r1 = x; if (r1 == 42) y = r1; Thread 2: r2 = y; if (r2 == 42) x = 42;from assigning 42 to r1 and r2, in spite of the fact that the initial 42 still comes "out-of-thin-air". This is clearly unacceptable if x and y are ordinary variables, since this code is data-race free, and should thus be sequentially consistent. And we guarantee that r1 = r2 = 0 for ordinary variables.
Whether this is acceptable for relaxed atomics, or whether we should specifically state that relaxed atomics will give sequentially consistent results if the relaxed atomics do not "race" is unclear to me. My impression is that users will not care because there is no reason to use relaxed atomics unless there is a race, and implementors will not care, since no reasonable implementation of atomics will exhibit weaker properties than for ordinary variables.
N2171 incorporates the first half of the change we proposed here, i.e. the last trace of dependency-based ordering was dropped from N2052.