C++ Data-Dependency Ordering: Atomics and Memory Model

ISO/IEC JTC1 SC22 WG21 N2556 = 08-0066 - 2008-02-29

Paul E. McKenney, paulmck@linux.vnet.ibm.com
Hans-J. Boehm, Hans.Boehm@hp.com, boehm@acm.org
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org


Introduction
    Problem
    Prior Work
    Alternatives Considered
Solution
    Dependency Root
    Dependency Tree
    Informal Specification
Examples
    Indirection
    Code Removal
    Control-Senstive Indirection
    Constant Propogation
    Control Elimination
    Control Dependence
    Conditional Subexpression Elimination
    Constant Results
    Selective Dependency Ordering
Implementation
    Promote to Acquire
    Avoid Some Optimizations
    Track Optimizations
    Truncate Data-Dependency Trees
    Annotate Functions
Proposed Wording
    1.10 Multi-threaded executions and data races [intro.multithread]
    29 Atomic operations library [atomics]
    29.1 Order and Consistency [atomics.order]
    29.4.4 (tentative) Operations [atomics.types.operations (tentative)]

Introduction

The efficiency of data structures that are read frequently and written rarely can substantially affect the scalability of some applications. Based on experience in making the Linux operating system scalable, we propose addenda to the memory model (N2429) and atomics library (N2427) for inter-thread data-dependency ordering.

This proposal admits a trivial implementation, limiting significant implementation investment to those compilers and platforms where that investment will be recovered.

Problem

There are two significant use cases where the current working draft (N2461) does not support scalability near that possible on some existing hardware.

read access to rarely written concurrent data structures
Rarely written concurrent data structures are quite common, both in operating-system kernels and in server-style applications. Examples include data structures representing outside state (such as routing tables), software configuration (modules currently loaded), hardware configuration (storage device currently in use), and security policies (access control permissions, firewall rules). Read-to-write ratios well in excess of a billion to one are quite common.
publish-subscribe semantics for pointer-mediated publication
Much communication between threads is pointer-mediated, in which the producer publishes a pointer through which the consumer can access information. Access to that data is possible without full acquire semantics.

In such cases, use of inter-thread data-dependency ordering has resulted in order-of-magnitude speedups and similar improvements in scalability on machines that support inter-thread data-dependency ordering. Such speedups are possible because such machines can avoid the expensive lock acquisitions, atomic instructions, or memory fences that are otherwise required.

A simplified example use of inter-thread data-dependency ordering, found within the Linux kernel, looks something like the following:


struct foo {
    int a;
    struct foo *next;
};
struct foo *head;

void insert(int a) {
   struct foo *p = kmalloc(sizeof(*p), GFP_KERNEL); /* cannot fail */
   spin_lock(&mylock);
   p->a = 1;
   p->next = head;
   smp_wmb();  /* Can be thought of as a store-release fence. */
   head = p;
   spin_unlock(&mylock);
}

int getfirstval(void) { /* requires head not NULL */
   struct foo *q = rcu_dereference(head);  /* see discussion below. */
   int retval = q->a;
   return retval;
}

The effect of getfirstval is to return the value at the head of the list with little more (or even no more) overhead than would be required if the list were immutable, but while still allowing updates. The rcu_dereference() API used in getfirstval() can be fully implemented in different ways, optimized for different classes of machines:

strong memory ordering (e.g., TSO)
rcu_dereference() simply prevents the compiler from performing optimizations that would order operations with data dependencies on q before the load from head. In this case, the code relies on the strong ordering to prevent the assignment to retval from seeing the pre-initialized version of the a field because the store to a must precede the store to head->next.
weak memory ordering with enforced data-dependency ordering
rcu_dereference() again prevents the compiler from performing optimizations that would order operations with data dependencies on q before the load from head. However, in this case, the code relies on the the machine's enforcement of data-dependency ordering to prevent the assignment to retval from seeing the pre-initialized version of the a field, because q->a depends on q.
weak memory ordering without data-dependency ordering
rcu_dereference() is promoted to a load-acquire operation. Because the acquire prevents all subsequent memory references from being reordered with the load from head, it must prevent any subsequent operations depending on h from being reordered with the load from head.

The current working draft (N2461) would require that these machines implement rcu_dereference() using either an acquire fence or a load-acquire. In both cases, this prohibits useful classes of compiler optimizations that involve code motion that does not break dependencies on the load from head. Worse yet, this requires emitting expensive memory fences for the second class of machines, which can result in unacceptable performance degradation.

More elaborate examples are described in a presentation at the Oxford 2007 meeting, describing use cases from the Linux kernel. These uses cases begin on slide 37 and include traversal of multiple levels of pointers, indexing arrays, and casts.

Prior Work

N2171 and N2176 are the basis for the current memory model. These proposals support a wide range of memory-ordering use cases, but do not support dependency ordering.

N2153 by Silvera, Wong, McKenney, and Blainey was the first proposal to explicitly address weakly ordered architectures and the issues surrounding dependency ordering. It was succeded by N2237. These papers also presented a number of use cases motivating non-SC memory ordering, including dependency ordering.

N2195 by Peter Dimov proposes an atomic_load_address() template function that protects a single level of indirection. Although this suffices for the very simple example above, it does not handle other examples given in a presentation at the Oxford 2007 meeting describing use cases from the Linux kernel (beginning on slide 37). In particular, N2195, does not support data dependencies that traverse multiple levels of indirection nor that traverse array accesses.

An alternative proposal in N2195, introduces the notion of dynamic dependencies. Use of dynamic dependencies would permit the data-dependency trees to be scanned after performing those optimizations that do not break dynamic data-dependency trees. However, this proposal was rejected due to software-engineering concerns, which loom especially large in cases where the compiler is able to perform optimizations that the programmer cannot anticipate. For example, the programmer might be forgiven for assuming that an argument to a given function was variable, but a compiler doing inter-procedural analysis might discover that it was in fact constant, or, worse yet, zero. The compiler is therefore required to propagate dependency trees regardless of optimization.

N2492, by Paul E. McKenney, Hans-J. Boehm, and Lawrence Crowl, and N2493, and by Paul E. McKenney and Lawrence Crowl, present an approach combining elaborations to the memory model, atomics API, and annotations. This proposal adapts that work in light of discussions at the 2008 Bellevue meeting.

Alternatives Considered

Although control dependencies are extremely intuitive, there are comparatively few known control-dependency use cases, and ARM CPUs only partially support control dependencies. Furthermore, some of the more troublesome optimization issues with switch statements involve control rather than data dependencies. Therefore, there is no support for control dependencies. If experience indicates that control dependencies also need to be enforced, a separate proposal will be put forward for them.

Prohibiting dependency-breaking optimizations would remove the need for annotations. This faced severe resistance, as a number of people felt that this would prohibit valuable optimizations. Therefore, this proposal requires annotations for function arguments and return values through which data dependencies are required to flow. As inter-compilation-unit analysis becomes more common, it is hoped that tools will appear that check annotations or perhaps even produce them automatically. However, individual implementations are free to avoid the dependency issue entirely by simply refraining from breaking data dependencies, or by emitting compensating memory fences when breaking data dependencies. (Full disclosure: this was in fact the original proposal.)

Simply relying on acquire fences would remove the need for dependency ordering. Although this is a reasonable strategy for many machines, it is inappropriate for weakly ordered machines that support data-dependency ordering.

Solution

We propose explicit program support for inter-thread data-dependency ordering. Programmers will explicitly mark the root of tree of data-dependent operations, and implementations will respect that ordering.

Dependency Root

To mark the root of a inter-thread data-dependency tree, programmers will use new variant of the atomic load defined in N2427. Specifically, this proposal augments N2427 by adding a memory_order_consume option that may be supplied to operations for which data-dependency semantics are permitted. The memory_order enumeration in N2427 would then read as follows, keeping the enumeration in rough order of increasing memory-ordering strength:


typedef enum memory_order {
    memory_order_relaxed, memory_order_consume, memory_order_acquire,
    memory_order_release, memory_order_acq_rel, memory_order_seq_cst
} memory_order;

Dependency Tree

Given the load value as the root of a data-dependency tree, the tree is loosely defined as any operation or function (within the same thread) that has a data-dependent argument within the tree or reads a variable stored by a data-dependent assignment within the tree.

Note that it is possible for a given value to be part of multiple dependency trees. One way that this might happen would be to add a value in one dependency tree to another value in a different dependency tree. The sum would then be in both dependency trees.

The compiler must preserve the dependency tree through all optimizations. In particular, if the compiler is able to optimize a member of a dependency tree to a constant, then the compiler must either produce code that preserves the dependency tree or emit a memory barrier appropriate to the target architecture.

A data-dependency tree ends by the death of values within the tree. Since trees extend into called functions and out through return values, these trees may extend until the end of program execution. The section on implementation describes strategies for dealing this unbounded extent in the normal compilation process.

When normal compilation of an unbounded extent proves too inefficient, the programmer may explicitly prune a data-dependency tree by passing a value through the identity function std::kill_dependency. The result is, by definition, not inter-thread data-dependent on the argument, even though the values are identical.

Informal Specification

Informally, we define inter-thread data-dependency ordering in terms of

a 'consume' operation
that is a weaker form of the 'acquire' operation,
a 'carries dependency to' relationship
which is a strict subset of the 'sequenced before' relationship, a subset describing how data dependencies propagate,
a 'dependency-ordered before' relationship
that captures the operations depending on the 'consume' operation, and
a 'inter-thread happens before' relationship
that is a term of the general 'happens before' relationship.

The full details are within the formal wording.

Examples

In N2176, Hans Boehm lists a number of example optimizations that can break dependency trees, which are discussed in the following subsections.

Indirection

N2176 example code:


r1 = x.load(memory_order_relaxed);
r2 = *r1;

Recoding to this proposal's API:


r1 = x.load(memory_order_consume);
r2 = *r1;

Assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r1, so the dependency is ordered.

Code Removal

N2176 example code:


r1 = x.load(memory_order_relaxed);
r3 = &a + r1 - r1;
r2 = *r3;

This could legitimately be optimized to the following, breaking the dependency tree:


r1 = x.load(memory_order_relaxed);
r3 = &a;
r2 = *r3;

However, recoding to this proposal's API:


r1 = x.load(memory_order_consume);
r3 = &a + r1 - r1;
r2 = *r3;

Again assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r1, so the dependency is ordered. Because the dependency trees must be traced prior to optimization, if the optimization is performed, a countervailing memory fence or artificial data dependency must be inserted.

Control-Senstive Indirection

N2176 example code, recoding to this proposal's API:


r1 = x.load(memory_order_consume);
if (r1 == 0)
        r2 = *r1;
else
	r2 = *(r1 + 1);

Assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r1, so the dependency is ordered.

Constant Propogation

N2176 example code, as modified during email discussions, where x is known to be either 0 or 1:


if (x.load(memory_order_consume))
	...
else
	...
y = 42 * x / 13;

This might be optimized to the following:


if (x.load(memory_order_consume)) {
	...
	y = 3;
} else {
	...
	y = 0;
}

assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the assignment to y, so the dependency is ordered. If the underlying machine preserves control-dependency ordering for writes, this optimization is perfectly legal. If the underlying machine does not preserve control-dependency ordering, then either this optimization must be avoided, a memory fence must be emitted after the load of x, or an artificial data dependency must be manufactured. An example artificial data dependency might be as follows:


if (r1 = x.load(memory_order_consume)) {
	...
	y = 3;
} else {
	...
	y = 0;
}
y = y + r1 - r1;

The compiler would need to decide whether the add and subtract was better than the multiply and divide.

Control Elimination

N2176 example code:


r1 = x.load(memory_order_consume);
if (r1)
        r2 = y.a;
else
	r2 = y.a;

This might be optimized to the following in order to break dependency trees:


r1 = x.load(memory_order_relaxed);
r2 = y.a;

This is a control dependency, so falls outside the scope of this proposal.

Control Dependence

N2176 example code:


r1 = x.load(memory_order_consume);
if (r1)
	f(&y);
else
	g(&y);

Assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. However, there is no data dependency between the load and either of the function calls. There is instead a control dependency, which does not force ordering in this proposal.

If this example were to be modified so that the variable r1 were passed to f() and g() (rather than y as shown above), then the functions would have a data dependency on the load.

Conditional Subexpression Elimination

N2176 example code:


r2 = x.load(memory_order_consume);
r3 = r2->a;

There might be at temptation to optimize the code as follows:


r2 = x.load(memory_order_consume);
r3 = r1->a;
if (r1 != r2) r3 = r2->a;

However, assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r2, so the dependency is ordered and the optimization prohibited, at least in absence of a compensating fence or artificially generated data dependency.

Constant Results

N2176 example code:


r1 = x.load(memory_order_consume);
r2 = a[r1->index % a_size];

If the variable a_size is known to the compiler to have the value one, then there might be a temptation to optimize as follows:


r1 = x.load(memory_order_consume);
r2 = a[0];

However, again assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r1, so the dependency is ordered. Therefore, this optimization is prohibited unless accompanied by a compensating memory barrier or artificial data dependency.

Selective Dependency Ordering

In some cases, dependency ordering is important only for some fields of a structure. For example:


r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[r2]);

Indexing a[] with an uninitialized field could be fatal, but once the corresponding array element has been fetched, we might not care about subsequent dependencies. The std::kill_dependency primitive enables the programmer to tell the compiler that specific dependencies may be broken, for example, as follows:


r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[std::kill_dependency(r2)]);

This allows the compiler to reorder the call to do_something_with, for example, by performing speculative optimizations that predict the value of a[r2].

Implementation

There are several implementation strategies. The first strategy is acceptable on all machines and compilers. The subsequent strategies are appropriate to subsets thereof.

This proposal is expected to have minimal effect on strongly ordered machines (e.g., x86) and on weakly ordered machines that do not support data-dependency ordering (e.g., Alpha). The major burden of this proposal would fall on weakly ordered machines and their compilers that reorder data-dependent operations, such as ARM and PowerPC (and possibly also Itanium). Even for these architectures, a fully conforming compiler could use the same approach as weakly ordered machines that do not support data-dependency ordering, albeit at a performance penalty.

Promote to Acquire

Simply promoting all memory_order_consume operations to memory_order_acquire will meet the requirements of this proposal.

For weakly ordered machines without data-dependency ordering, this implementation is also necessary. For other machines, it also serves as trivial first implementation.

Avoid Some Optimizations

Compilers can implement memory_order_consume loads as regular loads, so long as the compiler attempts no optimizations that break data dependencies. This strategy will be particularly useful for non-optimizing compilers.

This strategy does not apply to weakly ordered machines without data-dependency ordering, but only to strongly ordered machines or weakly ordered machines with data-dependency ordering.

Track Optimizations

For implementations on strongly ordered machines or weakly ordered machines with data-dependency ordering, compilers can implement memory_order_consume loads as regular loads, so long as the compiler tracks operations within a data-dependency tree and avoids optimizations that break data dependencies of those operations. Note, however, the caveat in the next subsection.

In terms of the implementation burden on compilers, some of the compiler work to implement this strategy is also required to respect the existing memory_order_acquire loads.

This strategy applies primarily to weakly ordered machines with data-dependency ordering, secondarily to strongly ordered machines, and does not apply to weakly ordered machines without data-dependency ordering.

Truncate Data-Dependency Trees

The above strategy implies that the compiler is avoiding optimizations in all functions dynamically called on a data-dependency tree. This implication is unacceptable for compilers that see only a portion of those functions.

However, the compiler does not need to see all functions; it can simply emit an acquire fence on the tree root (which is atomic) before a tree extends into a function call or out of a function return. Given such a convention, the compiler can assume that there are no optimization restrictions at the start of a function. This strategy enables fully-optimized per-function compilation, with run-time performance no worse than, and often much better than, the first strategy.

This strategy becomes more effective when performed after inlining or when considered in inter-procedural optimization.

Annotate Functions

Many uses of data-dependency operations will be in the implementation of data structures. If their (presumably non-inline) access functions must truncate the data-dependency tree on return, much of the potential performance of data-dependency ordering may be lost.

To address this performance opportunity, we propose to annotate function parameters and results to indicate that the compiler should assume that code on the other side of the function will handle depencencies correctly.

As these annotations are not essential to data-dependency ordering, they are covered in a separate proposal, N2493.

Proposed Wording

This section proposes wording changes to working draft N2461.

1.10 Multi-threaded executions and data races [intro.multithread]

Edit paragraph 4 as follows:

The library defines a number of atomic operations (clause 29) and operations on locks (clause 30) that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation is either a consume operation, an acquire operation, or a release operation, or both an acquire and release operation, on one or more memory locations; the semantics of these are described below. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics, also described below. [Note: For example, a call that acquires a lock will perform an acquire operation on the locations comprising the lock. Correspondingly, a call that releases the same lock will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. We do not include "relaxed" atomic operations as synchronization operations although, like synchronization operations, they cannot contribute to data races. —end note]

Paragraph 6 (defining release sequence) is unchanged:

A release sequence on an atomic object M is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first operation is a release, and every subsequent operation

Paragraph 7 (defining synchronizes with) is unchanged:

An evaluation A that performs a release operation on an object M synchronizes with an evaluation B that performs an acquire operation on M and reads a value written by any side effect in the release sequence headed by A. [Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note] [Note: The specifications of the synchronization operations define when one reads the value written by another. For atomic variables, the definition is clear. All operations on a given lock occur in a single total order. Each lock acquisition "reads the value written" by the last lock release. —end note]

After paragraph 7, add the following paragraphs.

An evaluation A carries a dependency to an evaluation B if

[Note: 'Carries a dependency to' is a subset of 'is sequenced before', and is similarly strictly intra-thread. —end note]

An evaluation A is dependency-ordered before an evaluation B if,

[Note: The relation 'is dependency-ordered before' is analogous to 'synchronizes with', but uses release/consume in place of release/acquire. —end note]

An evaluation A inter-thread happens before an evaluation B if,

[Note: The 'inter-thread happens before' relation describes arbitrary concatenations of 'sequenced before', 'synchronizes with' and 'dependency-ordered before' relationships, with two exceptions. The first exception is that a concatenation is not permitted to end with 'dependency-ordered before' followed by 'sequenced before'. The reason for this limitation is that a consume operation participating in a 'dependency-ordered before' relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency. The reason that this limitation applies only to the end of such a concatenation is that any subsequent release operation will provide the required ordering for a prior consume operation. The second exception is that a concatenation is not permitted to consist entirely of 'sequenced before'. The reasons for this limitation are (1) to permit 'inter-thread happens before' to be transitively closed and (2) the 'happens before' relation, defined below, provides for relationships consisting entirely of 'sequenced before'. —end note]

Edit paragraph 8 as follows:

An evaluation A happens before an evaluation B if:

29 Atomic operations library [atomics]

Edit the atomic library synopsis, as follows:

// 29.1, order and consistency


enum memory_order;
template< typename T >
T kill_dependency( T y );

29.1 Order and Consistency [atomics.order]

Edit the enum memory_order synopsis as follows:


typedef enum memory_order {
  memory_order_relaxed, memory_order_consume, memory_order_acquire,
  memory_order_release, memory_order_acq_rel, memory_order_seq_cst
} memory_order;

Edit the memory_order effects as follows:

Element Meaning
memory_order_relaxed the operation does not order memory
memory_order_release the operation performs a release operation on the affected memory location, thus making regular memory writes visible to other threads through the atomic variable to which it is applied
memory_order_acquire the operation performs an acquire operation on the affected memory location, thus making regular memory writes in other threads released through the atomic variable to which is it is applied visible to the current thread
memory_order_consume the operation performs a consume operation on the affected memory location, thus making regular memory writes in other threads released through the atomic variable to which it is applied visible to the regular memory reads that are dependencies of this consume operation.
memory_order_acq_rel the operation has both acquire and release semantics
memory_order_seq_cst the operation has both acquire and release semanticsw, and, in addition, has sequentially-consistent operation ordering

After paragraph 6, append as follows:

template< typename T >
T kill_dependency( T y );

Effects: The argument does not carry a dependency to the return value ([intro.multithread]).

Returns: y

29.3 (tentative) Flag Type and Operations [atomics.flag (tentative)]

Edit paragraph 5 (referring to test and set operations) as follows:

Effects: Atomically sets the value pointed to by object or by this to true. Memory is affected according to the value of order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value. These operations are atomic read-modify-write operations (1.10 [intro.multithread]).

Edit paragraph 10 (referring to fence operations) as follows:

Effects: Memory is affected according to the value of order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value. These operations are atomic read-modify-write operations (1.10 [intro.multithread]).

29.4.4 (tentative) Operations [atomics.types.operations (tentative)]

Edit paragraph 5 (referring to store operations) as follows:

Requires: The order argument shall not be memory_order_consume, memory_order_acquire, nor memory_order_acq_rel.

Edit paragraph 12 (referring to swap operations) as follows:

Effects: Atomically replaces the value pointed to by object or by this with desired. Memory is affected according to the value of order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value. These operations are atomic read-modify-write operations (1.10 [intro.multithread]).

Edit paragraph 15 (referring to compare and swap operations) as follows:

Effects: Atomically, compares the value pointed to by object or by this for equality with that in expected, and if true, replaces the value pointed to by object or by this with desired, and if false, updates the value in expected with the value pointed to by object or by this. Further, if the comparison is true, memory is affected according to the value of success, and if the comparison is false, memory is affected according to the value of failure. When only one memory_order argument is supplied, the value of success is order, and the value of failure is order except that a value of memory_order_acq_rel shall be replaced by the value memory_order_require and a value of memory_order_release shall be replaced by the value memory_order_relaxed. These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value. These operations are atomic read-modify-write operations (1.10 [intro.multithread]).

Edit paragraph 19 (referring to fence operations) as follows:

Requires: The order argument shall not be neither memory_order_relaxed nor memory_order_consume.

Edit paragraph 20 (referring to fence operations) as follows:

Effects: Memory is affected according to the value of order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value. These operations are atomic read-modify-write operations (1.10 [intro.multithread]).

Edit paragraph 22 (referring to fetch and op operations) as follows:

Effects: Atomically replaces the value pointed to by object or by this with the result of the computation applied to the value pointed to by object or by this and the given operand. Memory is affected according to the value of order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value. These operations are atomic read-modify-write operations (1.10 [intro.multithread]).