Doc. no. | N1942=06-0012 |
Date: | 2006-02-26 |
Reply to: | Hans Boehm <Hans.Boehm@hp.com> |
This is an attempt to outline a "memory model" for C++. It addresses the multi-threaded semantics of C and C++, particularly with respect to memory visibility. We concentrate on the question of what values a load of an object (i.e. an l-value to r-value conversion) may observe.
Much of this proposal is still rather informal. It has benefitted from input from many people, including Andrei Alexandrescu, Kevlin Henney, Ben Hutchings, Doug Lea, Jeremy Manson, Bill Pugh, Alexander Terekhov, Nick Maclaren, and others. It is mostly an attempt by Hans Boehm to turn the discussion results into a semi-coherent document. It builds on the earlier papers we have submitted to the committee: N1680=04-0120, N1777=05-0037, and N1876=05-0136, as well as the work cited there.
We use a style in which the core document is presented in the default font and color. We occasionally insert additional more detailed discussion, motivation, and status as bracketed text in a smaller green font. This additional text is not fundamental to understanding the proposal.
A more dynamic, evolving, and probably less consistent, version of this document, along with further background material, can be found here.
The reasons we are not currently pursuing that route, in spite of the overlap among the participants are:
This is complicated somewhat by the desire to support "atomic operations" which are effectively synchronization operations that can legitimately participate in data races.
[ This is at least our third attempt at a description of the memory model. The first attempt was to define the semantics as sequential consistency for programs that had no data races, where a data race was defined as consecutive execution of conflicting operations in a sequentially consistent execution. That was nice and simple. But it makes it very difficult to define synchronization primitives that allow some reordering around them. These include even a pthread-like locking primitive that has only "acquire" semantics, i.e. allows preceding memory operations to be moved into the critical section. Surprisingly, at least for one of us, pthread_mutex_lock does not have that property, due to the presence of pthread_mutex_trylock. ]
[ There is an argument for introducing alternate syntax for atomic operations, e.g. by describing variables as __async volatile. (It was concluded that recycling the plain volatile keyword introduces a conflict with existing usage, a relatively small, but nonzero, fraction of which is demonstrably legitimate. Whether or not that constitutes sufficient reason to leave the current largely platform-dependent semantics of volatile in place remains controversial.) Introducing some such syntax may make it easier to code idioms like double-checked locking, and to use consistent idioms across programming languages. Without a better understanding of the form the atomic operations library will take, it is unclear whether this argument is valid. The argument against it is simplicity, and elimination of an ugly or reused keyword. ]
In the absence of atomic operations, which may concurrently operate on the same data for different threads, this notion is equivalent to Lamport's definition of sequential consistency.
We now define two kinds of order on memory operations (loads, stores, atomic updates) performed by a single thread. The first one is intended to replace the notion of sequence point that is currently used in parts of the standard:
We will say that a subexpression A of the source program is-sequenced-before another subexpression B of the same source program to indicate that all side-effects and memory operations performed by an execution of A occur-before those performed by the corresponding execution of B, i.e. as part of the same execution of the smallest expression that includes them both.
We propose roughly that wherever the current standard states that there is a sequence point between A and B, we instead state that A is-sequenced-before B. This will constitute the precise definition of is-sequenced-before on subexpressions, and hence on memory actions and side effects.
A very similar change in the specification of intra-thread sequencing of operations is being simultaneously proposed by Clark Nelson in N1944=06-0014, which explores the issue in more detail. Our hope is that a proposal along these lines will be accepted, and can serve as the precise definition of is-sequenced-before.
[ Based on recent email discussions, at this point there appears to be some uncertainty in the interpretation of the current standard, which is complicating matters. The main issue seems to be the precise meaning of the restriction on interleaving of function calls in arguments. It appears important to resolve this even in the single-threaded case. ]
An ordinary memory operation is never inter-thread-ordered-before another ordinary memory operation.
Most atomic operations will specify some combination of acquire and release ordering constraints, which enforce ordering with respect to subsequent and prior memory actions, respectively. These constraints are reflected in the is-inter-thread-ordered-before relation.
Lock acquisition imposes at least an acquire constraint, and lock release will normally impose a release constraint. Whenever an action A has an acquire constraint, and A is-sequenced-before B, then A is-inter-thread-ordered-before B. Whenever A has a release constraint, and B is-sequenced-before A, then B is-inter-thread-ordered-before A.
Note that the definition of depends-on is relative to a particular execution, and always involves a dependence of an atomic store on an atomic load. Ordinary memory operation do not depend-on each other. We need the depends-on relation only to outlaw certain anomalous executions of atomic operations that informally violate causality, i.e. in which an atomic operation causes itself to be executed in a particular way.
We next discuss relations on actions between threads.
Informally, a lock release communicates-with the next acquisition of the same lock. A barrier (in the pthread_barrier_wait sense or OpenMP sense) communicates-with all corresponding executions of the barrier in other threads. A memory fence (or barrier in the other sense) communicates-with the next execution of the fence, usually by another thread. An atomic store communicates-with all atomic loads that read the value saved by the store, i.e. for this purpose they behave like ordinary loads and stores.
We now have enough machinery to describe the ordering we really use to describe memory visibility.
[ Note that this definition of happens-before is a bit different from that used in Lamport's original 1978 paper ("Time, Clocks, and the Ordering of Events in Distributed Systems", CACM 21,7), and eventually in the Java model, but it is essentially just an adaptation of Lamport's definition to a system in which actions within a thread are also not totally ordered. The detailed style of definition grew out of a discussion with Bill Pugh, though we're not yet sure he approves of the result. ]
Given is-sequenced-before, is-inter-thread-ordered-before, depends-on and communicates-with relations on a set of actions, we define the happens-before relation to be the smallest transitively closed relation satisfying the following constraints:
[ This means we view the relation as normally irreflexive. If we normally want it to be reflexive, we can tweak this slightly. ]
If lock/unlock enforce only acquire/release ordering, and there is no other form of synchronization, then it is less apparent that our definition is equivalent to sequential consistency. However, this can be proven if there is no trylock primitive.
[ At least some of us believe that the most plausible interpretation of the existing pthread semantics can be closely approximated by defining the various lock() primitives such that they have both acquire and release semantics. This still leaves issues related to failing pthread calls, etc. We believe that these introduce no fundamental technical challenges, but the details are not currently completely clear. ]
[ The fact that we insist that we require much stronger ordering for ordinary memory accesses than for atomic accesses initially seems out of place here. But, as Bill Pugh points out, simple Java-like happens-before consistency is otherwise insufficient. (For an example, see below.) And the ordering constraints on ordinary memory actions really only affect the definition of a data race; the meaning of data-race-free program is not affected, since this ordering is invisible. ]
We define an execution to contain an intra-thread race if a thread performs two conflicting actions, and neither is-sequenced-before the other.
We define an execution to contain an inter-thread race if two threads perform two conflicting actions, and neither happens-before the other.
[ I'm not sure this is quite the best way to state this, since the "communicates-with" relation on ordinary memory accesses contributes to happens-before. Thus "conflicting" actions may "communicate-with" each other, and thus "happen-before" each other, and thus no longer conflict. I think this technically doesn't matter because non-conflicting actions on ordinary memory normally only imply happens-before relationships that must exist anyway. And if there is an execution with an initial conflict that is eliminated by the "communicates-with" edge generated by the conflict, then there is an alternate consistent execution in which that edge doesn't exist, and there is a real conflict. But I don't like the fact that a subtle argument is required to demonstrate that this definition is sane.
We do need to include the extra "communicates-with" edges in the requirement that happens-before is acyclic. Otherwise we get nothing like sequential consistency in the data-race-free case. ]
If, for a given program and input, there are consistent executions containing either kind of race, then the program has undefined semantics on that input. Otherwise the program behaves according to one of its consistent executions.
[ Bill Pugh points out that that the notion of "input" here isn't well defined for a program that interacts with its environment. And we don't want to give undefined semantics to a program just because there is some other sequence of interactions with the environment that results in a data race. We probably want something more along the lines of stating that every program behavior either
I think the notion of "before" in the second clause is easily definable, since we can insist that IO operations be included in the effectively total order of ordinary variable accesses.
It is unclear to me whether this is something that needs to be addressed with great precision, since the current standard doesn't appear to, and I think the danger of confusion is minimal. ]
For purposes of the above definitions, object destruction is viewed as a store to all its sub-objects, as is assignment to an object through a char * pointer if the target object was constructed as a different type. Assignment to a component of a particular union member is treated as a store into all components of the other union members. Different threads may not concurrently access different union members.
lock(a); x = 1; unlock(a); lock(b); y = 1; unlock(b);may appear to another thread to be executed as
lock(a); lock(b); y = 1; x = 1; unlock(a); unlock(b);Unlike in Java, a hypothetical observer thread might see the assignments occur out of order. However, so long as x and y are ordinary variables, and the assignments do not reflect atomic operations, this is not observable, since such an observer thread would introduce a race.
Note that although our semantics allows the above reordering in a particular execution, compilers may not in general perform such rearrangements, since that might introduce deadlocks.
We claim that for data-race-free programs using only simple locks and no atomic operations, our memory model is identical to the Java one.
In the case of simple locks, we effectively insist on a total order among synchronization operations, happens-before is an irreflexive partial order, and everything behaves as in the Java memory model.
Next consider load_acquire and store_release primitives, where there must be an action that communicates-with every load_acquire: either an initialization or a store_release on the same variable. But there are no other restrictions on communicates-with.
Consider the following example:
Thread1 | Thread2 |
x = 1; store_release(&flag, 1); |
r1 = load_acquire(&flag); r2 = x; |
Due to the acquire and release ordering constraints on the references to flag, the individual pairs of assignments in each thread are ordered by is-inter-thread-ordered-before.
Only two possible actions can communicate-with the load_acquire in this example: The initialization of the flag variable, or the store_release action. It follows that if we get r1 = 1, then the store_release must have communicated-with the load-acquire. Given the ordering constraints,this implies that the assignment to x happens-before the assignment to r2. Hence r1 = 1 and r2 = 0 is an impossible outcome, as desired.
Next consider the following example, under the same ground rules:
Thread1 | Thread2 |
store_release(&y, 1); r1 = load_acquire(&x); |
store_release(&x, 1); r2 = load_acquire(&y); |
There is no is-inter-thread-ordered-before ordering between the statements in each thread. Initialization operations may synchronize with both load_acquire operations. Hence we can get r1 = r2 = 0, as expected.
We can model memory fences as synchronization operations with both acquire and release semantics, which must be totally ordered, and the communicates-with relation must respect the total order, i.e. each fence instance communicates-with the next fence instance in the total order. Consider:
Thread1 | Thread2 |
x = 1; r1 = z; fence(); y = 1; r2 = w; |
w = 1; r3 = y; fence(); z = 1; r4 = x; |
In any given execution either the thread 1 fence communicates-with the thread 2 fence, i.e. the thread 1 fence executes first (a), or vice-versa (b). If r3 = 1, then we must be in the first case. (If thread 2's fence came first, the load of y happens-before the store, and hence cannot see the store.) Hence r4 must also be one. Similarly r1 = 1 implies r2 = 1.
Next consider the following suggestive example, where initially self_destruct_left and self_destruct_right are false:
Thread1 | Thread2 |
while (!self_destruct_left); self_destruct_right = true; blow_up_port_side_of_ship(); |
while (!self_destruct_right); self_destruct_left = true; blow_up_starboard_side_of_ship(); |
Assume there is no other thread which sets either of the self_destruct... variables. Assume the thread 1 loop nonetheless terminates. This is only possible if it saw the store in thread 2. For such an execution to be intra-thread consistent, thread 2's loop must have seen the store from thread1. Thus in each case, the store depends-on the load in the while loop in the same thread, and the store communicates-with the loop in the other thread. The happens-before relation must be consistent with all of these and transitively closed. Hence all actions mentioned are in a cycle, i.e. happen-before themselves. Thus such an execution is not allowed.
If we use ordinary memory operations instead of atomic operations in the above example, then the loads in the while loops are sequenced-before the stores in the same thread. We need the same communicates-with relationships as before, thus happens-before will again be cyclic, and this version of the program is also safe.
[ Unlike earlier versions of this proposal, the presence of the depends-on relation seems to introduce enough of a causality requirement here to prevent the anomalous outcome if we use unordered atomic operations. With ordinary memory operations there is nothing surprising. By adding communicates-with relationships for all matching cross-thread store-load pairs, and insisting that the result not contain cycles, we are essentially insisting on sequential consistency. Needs more examples. ]
struct s { char a; int b:9; int c:5; char d; int e:1; };
An assignment to the bit-field b conflicts with an access to c, but no accesses to a different pair of fields conflicts.
Note that in some existing ABIs, fields a,b, c, and d are allocated to the same 32-bit word. With such ABIs, compilers may not implement an assignment to b as a 32-bit load, followed by an in-register bit replacement, followed by a 32-bit store. Such implementations do not produce the correct result if a or d are updated concurrently with b.
[ Note that the above example illustrates the most controversial aspect of this rule that we have found. It will take work to make existing compilers conform. For example, gcc on X86 currently does not. However, the resulting slowdown in object code appears to be very minor. To our knowledge, cases like the above exist, but are rare, in production code. And the necessary overhead amounts to some additional in-register computation, and a second store instruction to the same cache line as the first.
The alternatives would add complexity to the programming task, which some of us believe we want to avoid unless the benefits are much larger than this. Consider the slight variant of the above example:
struct s { something_declared_third_party_header a; int b:9; int c:5; char d; int e:1; };
Whether or not accesses to a and b conflict is now hard to predict without implementation knowledge of a's type, even if we understand the ABI conventions.
Probably the only plausible alternative is to allow bit-field accesses to conflict with accesses to any other data member in the same struct or class, and to thus encourage bit-fields to only appear in a nested struct. But this would add a rather obscure rule to the already complicated set that must be understood by threads programmers. ]
As far as the programmer is concerned, padding bytes are not written as part of a bit-field or data member update, i.e. such writes do not need to be considered in determining whether there is a data race. But we are not aware of cases in which a compiler-generated store that includes a padding byte would adversely impact correctness.
If proper synchronization is used, there is no need for additional guarantees, since the synchronization will ensure visibility as needed.
This is consistent with current practice, which does not guarantee even vtable visibility in the absence of synchronization, and allows insufficiently synchronized programs to crash, jump to the wrong member function, etc.
[There appeared to be consensus among those attending the Lillehammer C++ standards meeting that both options should be provided to the programmer. Subsequent discussion pointed out that it is more reasonable than some of us had thought to always require thread-safety. In particular, there seem to be no practical cases in which a compiler decision to implement an initialization statically breaks ordering guarantees that would reasonably be expected. The down side is that this imposes some overhead on uses that do not require synchronization. On X86, this overhead can be significant for the initialization, but probably not for later uses. On some other architectures significant overhead may be introduced even for later references. Currently I think this issue is unresolved.
The protected keyword was chosen arbitrarily, and should be considered more carefully. ]
If the atomic operations library turns out to be insufficiently convenient to provide for lock-free inter-thread communication, we propose that accesses to __async volatile variables and data members are viewed as synchronization operations.
Loads of such variables would have an acquire ordering constraint, and stores would have a release constraint.
[ It seems to make sense to put this on hold until we have a better handle on the atomic operations library, so that we can tell whether that would be a major inconvenience.
The possible reasons to retain this are (1) possibly improved convenience, and (2) possibly better consistency in programming idioms across languages (in this case Java and C#). The argument for discarding it is simplicity.
If we want to retain it, we now have to ask whether there is a total order among volatile accesses.
Current implementations of volatile generally use weaker semantics, which do not prevent hardware reordering of volatiles. This appears to have no use in portable code for threads, since such code cannot take advantage of the fact that operations are reordered "only by the hardware". It is occasionally useful for variables that are either modified after a setjmp, that may be accessible through multiple memory mappings, or the like. ]
There are no atomicity guarantees for accesses to volatile variables. Accesses to __async volatile variables of pointer types, integral types (other than long long variants), bool type, enumeration types, and type bool are atomic. The same applies to the individual data member accesses in e.g. struct assignments, but not to the assignment as a whole. There is no defined order between these individual atomic operations.
[We can't talk about async-signal-safety here. We might suggest that __async volatile int and __async volatile pointers be async-signal-safe where that's possible and meaningful. My concern here is with uniprocessor embedded platforms, which might have to use restartable critical sections to implement atomicity, and might misalign things. }
[ We need to better understand current conventions here. Presumably a file read is logically a "write" operation, since it updates the current position in the file. But I don't know how much locking is done by current C++ I/O implementations. I suspect it's more than we would like.]
[ Paragraph 21.3(5), which deals with basic_string copy-on-write support, will be difficult to support here in any reasonable fashion. I've long advocated stripping it to prohibit copy-on-write since I'm not convinced it makes sense without threads either. Unfortunately, removing it will drastically change the performance characteristics of existing implementations, often for the better, but occasionally for the worse. ]
All of these are discussed elsewhere.
[ It is unclear to what extent this needs to or should be addressed here. I think there is agreement that thread cancellation (though not PTHREAD_CANCEL_ASYNCHRONOUS-style) and exceptions should be unified. But the details are controversial, and that seems to be more of a threads API issue.
Nick Maclaren argues that we need to say something about the state that is visible to an exception handler that was thrown to reflect a synchronous error, such as an arithmetic overflow. Since we are effectively respecifying intra-thread memory visibility, there are strong interactions with threads issues, and the presence of synchronization primitives gives us an opportunity for a meaningful specification that is at least somewhat useful to a programmer, I'm inclined to agree. What follows is an approximate restatement of one of the options he proposed.
This essentially requires that compilers treat operations that may generate exceptions as memory operations, and not move them out of critical sections etc. I would be surprised if existing implementations did so.
This may need further work, even if we go with substantially this statement. In particular, the handler kind of needs to be modelled as a new thread replacing the old one, since it can have an inconsistent view of updates performed by the original thread. But on the other hand, it potentially has access to local variables whose address was not taken, and hence can see otherwise private state of the original thread. ]
If an action A throws an intra-thread out-of-band exception, then all actions that happen-before a synchronization action that happens-before A are visible to the exception handler. Conversely, if A happens-before another synchronization action B, then no action C such that B happens-before C is visible to the exception handler.
For this purpose, there are implicit synchronization actions with both acquire and release semantics (effectively memory fences) at the beginning and end of each thread execution.
[ I'm not sure whether the preceding paragraph really buys us anything. ]