ISO/IEC JTC1 SC22 WG21 N2260 = 07-0120 - 2007-05-06
Paul E. McKenney, paulmck@linux.vnet.ibm.com
This document presents an interface and minimal implementation for preservation of data dependency ordering to expedite access to dynamic linked data structures that are read frequently and seldom modified.
This proposal is in two parts:
This proposal does not affect library functions. Changes to library functions (for example, annotating the Vector templates) were considered, but rejected because current uses of data dependency ordering are restricted to basic operations such as indirection, field selection, array access, and casting. Longer term experience might indicate that a future proposal affecting library classes is warranted, however, there is insufficient motivation for such a proposal at this time.
This proposal is expected to have minimal affect to strongly ordered machines (e.g., x86) and on weakly ordered machines that do not support data dependency ordering (e.g., Alpha). The major burden of this proposal would fall on weakly ordered machines that order data-dependent operations, such as ARM, Itanium, and PowerPC. Even for these architectures, a fully conforming compiler could use the same approach as weakly ordered machines that do not support data dependency ordering, albeit at a performance penalty.
This proposal enforces only data dependencies, not control dependencies. If experience indicates that control dependencies also need to be enforced, a separate proposal will be put forward for them.
This proposal is based on N2153 by Silvera, Wong, McKenney, and Blainey, on N2176 by Hans Boehm, on N2195, by Peter Dimov, discussions on the cpp-threads list, and on discussions in the concurrency workgroup at the 2007 Oxford meeting.
In such cases, use of data dependency ordering has resulted in order-of-magnitude speedups and similar improvements in scalability.
A simplified example use of data dependency ordering found within the Linux kernel looks something like the following:
struct foo { int a; struct foo *next; }; struct foo *head = NULL; void insert(int a) { struct foo *p = kmalloc(sizeof(*p), GFP_KERNEL); /* cannot fail */ spin_lock(&mylock); p->a = 1; p->next = head->next; smp_wmb(); /* Can be thought of as a store-release fence. */ head->next = p; spin_unlock(&mylock); } int getfirstval(void) { int retval; q = rcu_dereference(head); /* see discussion below. */ assert(q != NULL); retval = q->a; return retval; }
More elaborate examples are described in a presentation at the Oxford 2007 meeting describing use cases from the Linux kernel beginning on slide 37, including traversal of multiple levels of pointers, indexing arrays, and casts. The effect of the above code is to return the value at the head of the list with little more (or even no more) overhead than would be required if the list were immutable, but while still allowing updates. The rcu_dereference() API used in getfirstval() can be implemented in different ways, optimized for different classes of machines:
These machines are not well-supported by any of the prior proposals (N2153 presents the need, but not a concrete proposal to satisfy this need), and is the subject of the remainder of this document.
N2145 would require that these machines implement rcu_dereference() using either an acquire fence or a load-acquire. In both cases, this prohibits useful classes of compiler optimizations that involve code motion that does not break dependencies on the load from head. Worse yet, this requires emitting a heavyweight memory barrier for the second class of machines, which can result in unacceptable performance degradation.
In N2195, Peter Dimov proposes an atomic_load_address() template function that protects a single level of indirection. Although this suffices for the very simple example above, it does not handle other examples given in a presentation at the Oxford 2007 meeting describing use cases from the Linux kernel (beginning on slide 37). In particular, N2195, does not support data dependencies that traverse multiple levels of indirection nor that traverse array accesses.
This proposal requires the programmer to explicitly mark the heads of data dependency chains, so that the head of a data dependency-chain is an explicitly marked load of a pointer or an integer from a shared variable. The value loaded is within the data-dependency chain. Any value produced by a computation that takes as input a value within the data-dependency chain is itself within the data-dependency chain, but only if the computation does not cross an unannotated function-call argument or function-return boundary.
Given any subsequent load, store, or read-modify-write operation by that same thread whose address is taken from the data-dependency chain, that operation is said to have a data dependency on the head of the data-dependency chain. In the case of load and read-modify-write operations, the value returned by the operation is within the data-dependency chain. In the case of store and read-modify-write operations, the value returned by subsequent access to the location stored by this same thread is also within the data-dependency chain, but only if there no intervening unannotated function-call arguments or function-return boundaries have been encountered in the meantime.
The compiler is required to build data-dependency chains before doing any optimizations. An alternative proposal in N2195, introduces the notion of dynamic dependencies. Use of dynamic dependencies would permit the data-dependency chains to be scanned after performing those optimizations that do not break dynamic data-dependency chains.
A dependency chain is thus ended by the death of the register or variable containing a value within the data-dependency chain, or when the value flows through an unannotated function argument or is passed back as an unannotated function return value.
Compilers can avoid tracing dependency chains by emitting a load-acquire for the head of the dependency chain. As noted earlier, this can be a reasonable solution for strongly ordered machines in which a load-acquire operation emits no code, but merely suppresses code-motion operations that would reorder subsequent code before the head of the dependency chain. It is also appropriate for weakly ordered machines that do not order data dependencies. Compilers can also avoid tracing dependency chains by avoiding those optimizations that break these chains.
The pointer or integer at the head of the dependency chain must be such that loads and stores to it are atomic. Some implementations may provide such atomicity given proper alignment. Other implementations may require that the pointer or integer at the head of the dependency chain be declared to be atomic as described in N2145. This document assumes that the load at the head of the dependency chain is an atomic as described in N2145.
As noted earlier, this proposal is in two parts:
These are addressed in the following sections.
The head of each ordered data dependency chain must be marked by the programmer. The API suggested in the presentation at the Oxford 2007 meeting is as follows:
dependence_propagate T *load_dependent_acquire(T **p); void store_dependent_release(T **p, T *newval);
These are described below, and the dependence_propagate keyword
is described in the following section describing propagating dependencies
across function boundaries.
Again, this document assumes that T
is of atomic type
as defined by
N2145.
dependence_propagate T *load_dependent_acquire(T **p);
Returns: *p
.
Constraint: Acquire semantics only with respect to subsequent dependent operations. Note that this is similar to the atomic_load_address() operation in N2195. Note also that this function might well be implemented in conjunction with an attribute, so that compilers for architectures for which dependency ordering is not relevant can simply ignore the attribute.
void store_dependent_release(T **p, T *newval);
Effects: *p = newval
.
Constraint: Release only with respect to earlier accesses dependent
on newval
.
Many implementations might choose to implement this as a
store-release operation.
As noted in the previous section, a dependency chain terminates when a value is passed into or returned from an unannotated function. If a given function's return value is annotated, then dependency chains survive being returned from that function. If a particular argument of a given function is annotated, then dependency chains survive being passed in via that argument. If the function has a prototype, then the annotation must be applied to the prototype as well as to the function definition itself.
This annotation is expected to be implemented as an attribute value for the attribute facility being proposed for standardization, however, for expository purposes, it is abbreviated as "dependence_propagate".
For example, the following propagates dependencies through argument
y
to the return value:
dependence_propagate int *f(dependence_propagate int *y) { return y; }
The following example propagates dependency chains in, but not out:
int f(dependence_propagate int *y) { return *y; }
The following propagates dependency chains out, but not in:
dependence_propagate struct foo *f(int i) { return &foo_head[i]; }
Finally, the following does not propagate dependency chains at all:
template<T> T *kill_dependency_chain(T *y) { return y; }
One possible expansion of dependence_propagate might be:
[[dependence_propagate]]
This is to be the subject of N2236, on which Michael Wong is working.
Minimal implementations of load_dependent_acquire()
and
store_dependent_release()
are as follows:
dependency_propagate T *load_dependent_acquire(T **p) { return atomic_load_acquire(p); } void store_dependent_release(T **p, T *newval) { atomic_store_release(p, newval); }
The dependency_propagate
annotations may be completely
ignored in this minimal implementation.
This implementation of load_dependent_acquire()
is likely "good and sufficient" for strongly ordered machines
(for example, IBM mainframe, MIPS, and SPARC TSO) and for weakly
ordered machines that do not enforce data dependency ordering
(for example, Alpha).
A very similar implementation has been in use in the Linux kernel
for Alpha for several years.
Weakly ordered machines that enforce data dependency ordering
(ARM, Itanium, and PowerPC) will experence memory-fence overhead
that will be intolerable in many cases.
In N2176, Hans Boehm lists a number of example optimizations that can break dependency chains, which are discussed in the following sections.
N2176 example code:
r1 = x.load_relaxed(); r2 = *r1;
Recoding to this proposal's API:
r1 = load_dependent_acquire(&x); r2 = *r1;
Assuming that x
is an atomic, the
load_dependent_acquire
will form the head of a dependency
chain.
Because there are no function calls, the dependency chain extends to the
indirection through r1, so the dependency is ordered.
N2176 example code:
r1 = x.load_relaxed(); r3 = &a + r1 - r1; r2 = *r3;
This could legitimately be optimized to the following, breaking the dependency chain:
r1 = x.load_relaxed(); r3 = &a; r2 = *r3;
However, recoding to this proposal's API:
r1 = load_dependent_acquire(&x); r3 = &a + r1 - r1; r2 = *r3;
Again assuming that x
is an atomic, the
load_dependent_acquire
will form the head of a dependency
chain.
Because there are no function calls, the dependency chain extends to the
indirection through r1, so the dependency is ordered.
Because the dependency chains must be traced prior to optimization,
if the optimization is performed, a countervailing memory fence
or artificial data dependency must be inserted.
N2176 example code:
r1 = x.load_relaxed(); if (r1 == 0) r2 = *r1; else r2 = *(r1 + 1);
Recoding to this proposal's API:
r1 = load_dependent_acquire(&x); if (r1 == 0) r2 = *r1; else r2 = *(r1 + 1);
Assuming that x
is an atomic, the
load_dependent_acquire
will form the head of a dependency
chain.
Because there are no function calls, the dependency chain extends to the
indirection through r1, so the dependency is ordered.
N2176
example code, where x
is known to be either 0 or 1:
if (load_dependent_acquire(&x)) ... else ... y = 42 * x / 13;
This might be optimized to the following:
if (load_dependent_acquire(&x)) { ... y = 3; } else { ... y = 0; }
Assuming that x
is an atomic, the
load_dependent_acquire
will form the head of a dependency
chain.
Because there are no function calls, the dependency chain extends to the
assignment to y, so the dependency is ordered.
If the underlying machine preserves control-dependency ordering
for writes, this optimization is perfectly legal.
If the underlying machine does not preserve control-dependency
ordering, then either this optimization must be avoided,
a memory fence must be emitted after the load of x
,
or an artificial data dependency must be manufactured.
An example artificial data dependency might be as follows:
if (r1 = load_dependent_acquire(&x)) { ... y = 3; } else { ... y = 0; } y = y + r1 - r1;
The compiler would need to decide whether the add and subtract was better than the multiply and divide.
N2176 example code:
r1 = x.load_relaxed(); if (r1) r2 = y.a; else r2 = y.a;
This might be optimized to the following in order to break dependency chains:
r1 = x.load_relaxed(); r2 = y.a;
This is a control dependency, so falls outside the scope of this proposal.
N2176 example code:
r1 = load_dependent_acquire(&x); if (r1) f(&y); else g(&y)
Assuming that x
is an atomic, the
load_dependent_acquire
will form the head of a dependency
chain.
The question is then whether the prototypes and definitions of
functions f
and g
have their arguments
annotated with dependency_propagate
. If they
are so annotated, then the dependency chains propagate into
f
and g
.
N2176 example code:
r2 = load_dependent_acquire(&x); r3 = r2->a;
Without the load_dependent_acquire()
, the following
data-dependency-breaking optimization would be legal:
r2 = load_dependent_acquire(&x); r3 = r1->a; if (r1 != r2) r3 = r2->a;
However, assuming that x
is an atomic, the
load_dependent_acquire
will form the head of a dependency
chain.
Because there are no function calls, the dependency chain extends to the
indirection through r2, so the dependency is ordered and the optimization
prohibited, at least in absence of a compensating fence or artificially
generated data dependency.
N2176 example code:
r1 = load_dependent_acquire(&x); r2 = a[r1->index % a_size];
If the variable a_size
is known to the compiler to have
the value zero, then there might be a temptation to optimize as follows:
r1 = load_dependent_acquire(&x); r2 = a[0];
However, again assuming that x
is an atomic, the
load_dependent_acquire
will form the head of a dependency
chain.
Because there are no function calls, the dependency chain extends to the
indirection through r1, so the dependency is ordered.
Therefore, this optimization is prohibited unless accompanied by
a compensating memory barrier or artificial data dependency.