Doc. No.: | WG21/N2338 J16/07-0198 |
---|---|
Date: | 2007-08-04 |
Reply to: | Hans-J. Boehm |
Phone: | +1-650-857-3406 |
Email: | Hans.Boehm@hp.com |
This paper describes the compiler consequences of adopting the N2334 memory model proposal (a revision of N2300).
The concurrency memory model imposes constraints on the order in which memory operations on shared variables may become visible to other threads. Operations on locations known to be thread-local (e.g. operations on function scope variables whose address is not taken) are unaffected, and may be optimized as in the single-threaded case.
Thus the effect of the proposal is limited to three kinds of memory operations:
Clearly the first is the most interesting, since those operations are likely to be the most frequent. The frequency of the last in existing code is zero, and in purely lock-based code will remain there. Performance of the last is probably determined much more by the runtime implementation than compiler transformations. Nonetheless, we discuss them all in turn, at least briefly.
Thus source-to-source transformations on C++ programs may not introduce data races. Doing so would introduce undefined semantics.
Indeed, a weak converse of this statement is generally also true. A transformation that preserves sequential correctness, and preserves the state of memory at synchronization operations, and does not introduce data races, preserves correctness. If it failed to preserve correctness a thread would have to observe the changed sequence of state changes between synchronization operations. It cannot do so without performing a concurrent memory operation on a location altered between those synchronization points. This involves a data race, either in the original program, or introduced by the compiler. We excluded the latter case, and the former preserves correctness.
Thus the crucial question is to identify transformations that may introduce data races. We examine what we expect to be some of the most common examples.
struct { char a; int b:9; int c:7; char d; }Many existing compilers on conventional 32-bit machines will implement the update by loading the entire 1 word structure, replacing the bits corresponding to b, and then rewriting the entire structure. By overwriting other fields, including for example a, in the process, this will introduce a race with another thread concurrently updating field a. According to the proposed standard, field a is in a separate memory location; thus this introduces a race not present in the original source code.
Indeed, it can unexpectedly overwrite such a concurrent update to a, thus resulting in incorrect results.
Generally updates to one memory location, as defined in the standard, may not result in implicit updates to others, even if the original value is written back.
(This constraint can be weakened a bit, since acquire and release operations generally allow one-way motion of ordinary memory operations across them. Thus an update to x immediately after a lock release allows bogus write to x to be inserted just before the lock release. A race with the latter would imply a race with the former. But current compilers rarely perform this sort of transformation, and we expect it to rarely be substantially profitable.)
Thus if we have the code:
switch (y) { case 0: x = 17; w = 1; break; case 1: x = 17; w = 3; break; case 2: w = 9; break; case 3: x = 17; w = 1; break; case 4: x = 17; w = 3; break; case 5: x = 17; w = 9; break; default: x = 17; w = 42; break; }and x is potentially shared, it is not acceptable to reduce code size by transforming this to
tmp = x; x = 17; switch (y) { case 0: w = 1; break; case 1: w = 3; break; case 2: x = tmp; w = 9; break; case 3: w = 1; break; case 4: w = 3; break; case 5: w = 9; break; default: w = 42; break; }Doing so would introduce a race if y were 2, and another thread were concurrently updating x. Again the concurrent update might be lost as a result of the spurious store.
A typical example is
for (p = q; p = p -> next; ++p) { if (p -> data > 0) ++count; }where the variable count is potentially shared. Many compilers currently translate this to something along the lines of
register int r1 = count; for (p = q; p = p -> next; ++p) { if (p -> data > 0) ++r1; } count = r1;This is unsafe if the list q is known to the programmer to contain no positive data, and there is a concurrent thread updating count. In this rather esoteric case, the transformation introduces a race. Since it potentially hides an update that did not race in the original source, it is disallowed by the proposal.
(This example is designed to explain the compiler consequences of the proposed rules, not motivate them. For a more motivational example, see Boehm, Threads Cannot be Implemented as a Library, PLDI 2005. Particularly section 4.3.)
There are usually alternative ways to optimize such code. The following transformed version is safe, and approximately equally efficient:
register int r1 = 0; for (p = q; p = p -> next; ++p) { if (p -> data > 0) ++r1; } if (r1) count += r1;This relies on a fairly special purpose transformation. By the arguments of the next section, the following more general variant is usually also safe in an optimizing compiler context, in spite of the fact that it may introduce a race:
register int r1 = count; register bool count_modified = false; for (p = q; p = p -> next; ++p) { if (p -> data > 0) { ++r1; count_modified = true; } } if (count_modified) count = r1;This potentially involves a race, since it introduces a load of count that was not requested by the original source.
However, there is a significant difference between the two: A newly introduced store always has the potential to hide the value of a concurrent, now racing, store. Thus such an introduction not only produces a C++ program with formally undefined semantics; the resulting program is also virtually guaranteed to compile to machine code that is an incorrect translation of the original code.
On the other hand, a compiler that introduces a racing load at the machine code level, and then discards the result of the load, generally will not have changed the semantics of the program.
This is a subtle distinction: C++-source-to-C++-source transformations that introduce a racing load are invalid. On the other hand compiler transformations that do so on the way to generating object code for conventional machines may be completely correct, since races in the object code typically do not have undefined semantics. At this level, the semantics are defined by the architecture specification, not the C++ standard.
This may seem like a spurious distinction, but it is not. The author of C++ source code provides an implicit promise to the compiler: There are no data races on ordinary variables. This allows the compiler to safely perform some common optimizations that would otherwise be invalid. (See the relevant section of N2176 for details.) If the compiler itself introduces potentially racing loads in intermediate code, it can, and has to, remember that these loads may race, and have to be treated as such by subsequent optimizations.
This distinction also allows the relatively easy use of race detection tools at various levels. If we had hypothetical hardware (or a virtual machine) that aborted the program when a race was detected, compilers would not be allowed to introduce potentially racing loads. In this environment, adding such a race, even if the result of the load were unused would change the semantics of the program. Although such environments are not commonplace, they have certainly been proposed, and may well turn out to be very useful.
This again argues that whether or not it is safe to introduce racing loads should depend on the target architecture and its semantics. If the target is C++ source code, it is disallowed. If it is a standard, for example X86, processor, it usually is allowed. If the target provides race detection for debugging purposes, it is again disallowed.
Thus if our source program reads
if (x) z = w + 1;and our target is a conventional hardware platform, it is acceptable to compile this as
r1 = w; if (x) z = r1 + 1;to better overlap load latencies for x and w. This may introduce a new race if x is false. But the result of the racing load is not used in that case. And the semantics of the resulting object program on a standard architecture is unaffected.
The final transformation of the previous section has essentially the same characteristics.
The one significant change over some past standards is that the memory model makes it explicit that lock acquisition only has "acquire" semantics, and movement of memory operations past a lock acquisition into the critical section is allowed. This allows the compiler some flexibility that is technically disallowed under the current Posix thread specification. More importantly, it often means that lock acquisition requires one, not two, memory fence instructions. This may speed it up significantly. (This is what motivated the slightly more complicated N2334 definition of a data race.)
However, we believe that this optimization is already performed in many cases, in spite of its dubious standing with respect to existing standards. Thus even this does not appear to be imply a real change to existing practice.
We use variable v to represent an atomic in the examples. Other variables are not atomic.
We expect that atomic operations will initially be treated as opaque by the compiler, just like the other synchronization operations from the last section, and thus prevent movement of other memory operations across atomic operations. The only substantial difference from locking primitives is that atomic operations would usually be expanded in-line.
This is safe. And for most applications, we should expect that only small amounts of performance will be left on the table as a result of this simplification. In most cases, the presence or absence of an additional memory fence is likely to have far more impact than compiler optimizations involving reordering of atomics.
However, some more aggressive compiler optimizations on atomics are allowed by the model:
x = 1; r1 = v.load_acquire();can be safely transformed to
r1 = v.load_acquire(); x = 1;and
v.store_release(1); x = 1;can safely be transformed to
x = 1; v.store_release(1);Similarly, an atomic release store followed by an atomic acqire load may be reordered.
Currently there is no explicit prohibition against translating
while (v.load_acquire()) {}to
r1 = v.load_acquire(); while (r1) {}though I think most of us would agree that this should be avoided in a high quality implementation. However, such transformations may also be difficult to disallow, without disallowing implementations that simply never schedule this thread. And I think it would be a mistake to try to dictate fairness at this level, since that would disallow some special-purpose cooperative threading implementations.
Once again, combining an unbounded number of consecutive stores in this way is strongly discouraged, but currently not formally disallowed.
In this way atomics are very different from volatiles. Volatiles guarantee that the right number of memory operations are performed, but do not guarantee atomicity. Atomics do guarantee atomicity, but allow consecutive operations to be combined.
r1 = x + 1; r2 = v.load_acquire(); r3 = x + 1;to
r1 = x + 1; r2 = v.load_acquire(); r3 = r1;This would effectively move the second load of x to before the load_acquire(), i.e. in the wrong direction. This optimization would clearly not be performed by the more naive implementation that treats load_acquire() as an opaque function call.
On most fence-based architectures, we expect v.store_ordered(x) to be implemented as something like
fence_preventing_store_load_and_store_store_reordering; v = x; full_fence;The full fence is typically needed to prevent reordering with a later atomic load. (Really we only need a fence that prevents store-load reordering. But typically this requires the most expensive fence.) It can be removed if the compiler can see that the next atomic load is preceded by a fence for another reason. Depending on the architecture, this may happen, for example, in the double-checked locking case.
On X86 processors, we expect that the standard implementation of an ordered store will consist of a single xchg instruction, whose result is ignored. This is expected to provide both the necessary fencing and total ordering guarantees.
The store_release operation can usually be implemented without the trailing expensive fence. We expect that on X86 hardware it will be implemented as a plain store instruction, which implicitly provides the required ordering guarantees on that architecture.
On most architectures, both a sequentially consistent load and a load_acquire operation will be implemented with a load instruction, followed by a fence that provides both load-store and load-load ordering. In a few cases, a more expensive fence may be required. On X86, a simple load instruction is expected to suffice.
The proposed C++ sequentially consistent atomics are essentially equivalent to Java volatiles. Thus Doug Lea's JSR133 Cookbook also provides some useful implementation advice.