Doc. No.:	WG21/N2177 J16/07-0037
Date:	2007-03-09
Reply to:	Hans-J. Boehm
Phone:	+1-650-857-3406
Email:	Hans.Boehm@hp.com

Sequential Consistency for Atomics

Very informally, a sequentially consistent execution of a multi-threaded program is one that behaves as if actions from all the threads were simply interleaved into a single composite execution. For most people, this is the natural review of threads before they learn about the complexities of weak memory models.

There has been an ongoing discussion about the semantics of "high level" C++ atomics. ("High-level" here refers to that part of the atomics library that is meant to be understood by programmers other than the most expert library implementors.) Here we explain why a number of us believe that these "high level" atomic operations should present sequentially consistent semantics, in spite of substantial, though currently highly platform-specific, performance arguments to the contrary.

This is a somewhat hurried attempt to outline the arguments; it is not a thorough study of the trade-offs. In fact, we know that some pieces of essential data are still missing. In particular, we have much too little understanding of the inherent long-term cost of sequentially consistent atomics, with huge disagreements among the experts. Sarita Adve and others are continuing to pursue this.

Although, I have attempted to outline the major issues on both sides, this represents primarily my own biased view. I favor providing an easy-to-use interface to sequentially consistent atomics. Others with large amounts of experience in the area have arrived at different conclusions.

Simple rules for programmers

As it stands, the proposed C++ memory model is designed to allow us to provide simple programming guidelines to the programmer as follows:

Define a memory location as in the proposed standard. Quoting N2052: "A memory location is either an object of scalar type, or a maximal sequence of adjacent bit-fields all having non-zero width."
Define a data race to occur if a memory location not holding a lock, atomic, or other synchronization object is accessed simultaneously by at least two threads, and one of the accesses is a write. Ensure that your program cannot encounter a data race.
If the above rules are satisfied, your program will execute according to sequentially consistent semantics, i.e. as though the actions and memory accesses of the different threads were interleaved, with the accesses from each thread occurring in order, and each access seeing the last prior update to the variable that occurs in this sequence.

For the typical programmer, who uses only lock, unlock, condition variables, and "high-level" atomics, we would hope to make this the whole story.

This basic approach to a programming language level memory model was previously explored by Sarita Adve in her thesis. See for example: S. Adve and Mark Hill, "Weak Ordering - A New Definition", ISCA '90. It is also currently the basis of the Java memory model.

For more advanced programmers, we do need to attach a couple of disclaimers to this for C++:

A trylock primitive can be abused to perform memory operations in response to a lock acquisition instead of a release. The lock acquisition is likely to have only acquire semantics. This causes the above definition of a data race to diverge from that in the standard. There is no legitimate reason to write a program that behaves in this manner. (For details, see Boehm, "Reordering Constraints for Pthread-Style Locks", PPoPP '07.)
"Low-level" atomics, especially but not only those with the relaxed ordering constraints, clearly fall outside this simple programming model, and require more sophisticated programmer understanding of memory visibility. In particular, a program using relaxed atomic variables for all shared memory accesses has no races by our definition, but clearly does not exhibit sequentially consistent semantics. (An alternative would be to technically continue to satisfy the above property by defining such programs to contain a data race.)

There is no question or controversy that we want the preceding rules to hold for programs that use only locks for synchronization. (Condition variables are irrelevant here, and don't impact the property.) The question here is really whether we want this property to hold for programs using both locks and high-level atomics.

It turns out that the most difficult part of ensuring sequential consistency for race free programs containing high-level atomics is to ensure that the high-level atomics themselves behave sequentially consistently, i.e. that a program whose only shared variables are high-level atomics behaves sequentially consistently.

In order to make the case for such sequentially consistent atomics, we argue that:

Atomic variables are, or should be, used by a significant number of programmers.
Under reasonable performance assumptions, sequentially consistent atomics can often satisfy this need.
It is very unclear that there are viable alternatives.

We address these in turn.

Use of atomics

There is a strong argument that the large majority of multi-threaded code should use locks to prevent simultaneous non-read-only access to shared variables. And this is fine most of the time. However, in reality, we do seem to see fairly frequent potentially simultaneous access to shared variables that call for the use of atomics. Notable examples include:

Double-checked-locking, and similar lazy-initialization idioms. I believe Bill Pugh's group counted more than 90 uses of double-checked locking in the Java standard libraries long after it became known that such usage was incorrect. There are similar anecdotes about some other vendors.
Simple counters, e.g. of calls to a particular function. (These are often erroneously implemented without any kind of synchronization if exact answers are not required. This is incorrect, both because it leads to completely undefined behavior, and because it can in practice lead to completely incorrect results. Atomic integers provide the natural solution, which usually incurs substantially less cost than locks.)
Schemes that rely on atomic pointer updates to make read-mostly data structures much more efficient, such as Paul McKenney's RCU in the Linux kernel. These become significantly simpler for user-level code in the presence of garbage collection.
Variables shared with signal or interrupt handlers or other processes. Although the C++ standard is unlikely to fully address such applications, they are important in practice, and we should move towards better support. Locks are often not usable in such contexts.
More sophisticated lock-free data structures. We expect that unlike the others, these will be implemented by experts.

We expect that only the last of these can be strictly confined to library writers.

Performance of Sequentially Consistent Atomics

We can very roughly break down the overhead of providing sequential consistency for atomics into that of satisfying three separate constraints:

Atomics must have acquire/release semantics. This is implicitly true for SPARC TSO, PA-RISC, and MIPS, and generally also believed to be automatic on X86. It requires an explicit but relatively low cost ordering constraint or fence on Itanium, PowerPC, and ARM. We believe that this part of the overhead is not interesting, since atomics without such constraints are extremely difficult to use, and are not correct for use with double-checked-locking, for example. I believe nobody is arguing that this guarantee be dropped for high-level atomics.
We must enforce load-after-store ordering for high-level atomics. This is needed for implementing Dekker's critical section algorithm and similar algorithms. It is not needed in many cases, such as double-checked-locking. (The extra cost can also often be optimized out in that particular case.) Doug Lea's experience with the Java concurrency library was that it is needed more often than expected. (He does not agree with some of our other conclusions.) The cost typically involves a heavy-weight fence following atomic stores, which can usually be optimized away only if it is followed by another atomic store. (This applies to X86, ARM, SPARC, PowerPC, Alpha, Itanium, etc., though not PA-RISC.) There appears to be at least a near-consensus that high-level atomics should enforce this property.

We need to ensure that independent stores by different threads are seen in a consistent order by different observing threads. Thus if all variables in the following example are high-level atomics and initially zero, then that example, customarily referred to as "IRIW" (Independent Reads of Independent Writes) must not result in r1 = r3 = 1 and r1 = r3 = 0, i.e in Threads 3 and 4 seeing the updates to x and y in opposite orders

Thread1	Thread2	Thread3	Thread4
a: x = 1;	b: y = 1;	c: r1 = x; d: r2 = y;	e: r3 = y; f: r4 = x;

The cost of enforcing this guarantee is much less clear. Currently it appears that the only architecture supporting C++ on which the guarantee is not implicitly provided is PowerPC, but the question is not yet fully settled for several other important architectures. On PowerPC, the cost appears to be a heavyweight memory fence on every atomic load, which is clearly very substantial. (See http://x10.sourceforge.net/wiki/index.php/Memory_Model_Table_of_Contents for the details.)

Enforcement of this ("IRIW") property is by far the most controversial, since, at least in one case, it is the most expensive. It is also the one property that is really motivated by the desire for an understandable and teachable programming model, rather than by specific practical applications.

It is possible to construct somewhat realistic code that relies on the IRIW guarantee. But there widespread agreement that such code is difficult to find "in the wild".

Far more problematics is that I do not know of a way to teach students about atomics without the IRIW guarantee, and that its absence invalidates some fairly standard reasoning. If we postulate that there is some global visibility order, and conclude that the assignment at f above must become visible before the assignment at a, since f sees the earlier value, and draw a similar conclusion about d and b, we are quickly lead to a contradiction. But all of the steps in that reasoning chain seem rather natural.

On architectures on which the IRIW guarantee is cheaply enforceable, the overall cost of sequential consistency for atomics appears to be within reason. The added overhead for loads is typically on the order of at most a dozen cycles, which is usually trivial compared to the cost of the alternative of a lock acquisition. The heavier weight fence required on a write for load-after-store ordering may be nearly as expensive as a lock acquisition. But since it is typically used in contexts in which loads greatly outnumber stores, and a store is often associated with a subsequent coherence cache miss, we expect that this rarely a major factor. And indeed the java.util.concurrent implementations appear to have been quite successful in spite of this kind of overhead.

(In cases in which this overhead is unacceptable, and expert programmers are involved, we still provide low-level atomics that can avoid the overhead. But those are not the subject of this paper.)

On the other hand, there is no question that the kind of fences currently required on PowerPC reduce the utility of the facility. In most cases, the costs will be comparable to lock-protecting the accesses. Nonetheless there are some cases in which I believe it remains useful:

If atomics are used in situations in which locks are not a viable alternative (inter-process communication, signal handlers or interrupts) performance is likely to be a secondary issue.
I suspect that many developers of more sophisticated lock-free algorithms will initially want to code the algorithm using sequentially consistent atomics, and then address tuning of memory ordering in a second pass. In this case, the final product is likely to use primarily low-level atomics. But the high-level sequentially consistent atomics are a very useful intermediate step. I know that I have used such a process, and most papers in the literature imply a similar approach, typically by only presenting the sequential-consistency version of the algorithm.
It seems likely that the expensive fences can be automatically optimized out in some common cases (simple counters?). This needs more investigation.
Even in performance sensitive cases, a solution based on sequentially consistent atomics may outperform a lock-based solution, since it avoids modifying any shared cache lines in response to read-only accesses.

There are ongoing discussions about whether the IRIW guarantee is inherently difficult to provide, and whether this would impact scalability to high processor counts. It is clear that an inherent cost is involved. But it is unclear whether this cost can otherwise be safely avoided, or whether essential properties such as causal ordering for memory visibility (effectively transitivity of the happens-before relation) would also be impacted by dropping such a guarantee.

The current practical situation appears to be that we have some architectures providing the guarantee cheaply and others not, with no obvious correlation with available processor counts. In particular, some of the largest machines, measured in processor counts, provide a documented memory model that provides the IRIW guarantee for atomics at no additional cost.

Alternative Models

Doug Lea and Vijay Saraswat have developed precise mathematical models for some alternative consistency models for "high-level" atomics. Details of Vijay's models can again be found at http://x10.sourceforge.net/wiki/index.php/Memory_Model_Table_of_Contents and possibly in paper N2226.

These models have the substantial advantage that they allow a cheaper implementation, at least on PowerPC, since they don't support the IRIW guarantee. They are also mathematically quite elegant.

However, in my opinion, and without advance access to N2226, these models currently lack the kind of easily presentable and teachable description we attempted to outline above. Furthermore, it seems difficult to motivate one particular model. There have been months of discussion about the details of these models. The acceptable outcomes of one particular 5 line, 3 thread program were uncertain during much of that time. To me, it appears that there are several possible similar models, with the choice between them largely driven by existing hardware characteristics. This again makes it likely that the model will be hard to motivate in an educational context.

Currently, the most comprehensible definition of such a model seems to be as "sequential consistency without the IRIW guarantee". This unfortunately is very imprecise, leaves the leagality of several borderline test cases unclear, and does not give the programmer sufficient information for reasoning about the resulting atomics.

My Conclusions

There is ample empirical evidence that atomics with weaker, or explicit, ordering constraints,or requiring explicit fences, can be extremely difficult to use. For example, even the experts designing the APIs, generally got them at least somewhat wrong, which is why we're designing one here. Even the most basic of ordering issues, namely the constraints enforced by lock acquisition and release, has generated much confusion and some clearly incorrect code. (Some details are provided in Boehm, "Reordering Constraints for Pthread-Style Locks", PPoPP '07.) Given that difficulty, it seems critical to include an easier-to-use alternative, even if it does not meet all performance constraints.

To date, only sequentially consistent atomics really seem you accomplish this. Nothing else seems sufficiently easily explainable to the intended audience.

This does not argue that sequential consistency should be the only model supported for C++ atomics. The current plan is to support lower-level atomics with explicit ordering constraints (see N2145) and possibly fences (see N2153). But I am arguing that sequential consistency should be the default model for C++ "high-level" atomics, with provisions for tuning memory visibility where required.

Note that even in this model, it is possible to build efficient lock-free data structures for machines that cannot implement sequentially consistent atomics with the highest performance; but it does require the difficult additional step of specifying explicit ordering constraints for atomic operations.

If we can generate a standards-appropriate definition of a strong memory model without the IRIW guarantee, a possible compromise may be to add a parameter to the atomics template to make it easier to remove the IRIW guarantee in the large majority of cases in which it is not needed, while keeping the far-easier-to-teach sequential consistency alternative as the default.