Doc. No.: | WG21/P1217R0 |
---|---|
Date: | 2018-10-07 |
Reply-to: | Hans-J. Boehm |
Email: | hboehm@google.com |
Authors: | Hans-J. Boehm, with input from many others |
Audience: | SG1 |
This is a status update attempting to summarize
an external discussion
of so-called out-of-thin-air results with memory_order_relaxed
.
It was known that allowing such results often makes it impossible to reason about code
using memory_order_relaxed
, and that our current wording prohibiting them is
excessively vague. Many of us believed that this was a stopgap until we determined a better
way to word this restriction without invalidating current implementations.
It has become much more clear that this cannot happen; current implementations in fact allow
results that, unless we redefine the semantics of basic sequential constructs like if-statements,
can only be understood as out-of-thin-air results.
There is a well-known and simple solution to this problem that amounts to prohibiting the
compiler and hardware from reordering a relaxed load followed by a relaxed store. This remains
controversial, mostly because it adds a small amount of overhead to memory_order_relaxed
operations on architectures like ARM, and an unknown amount of overhead on at least some GPUs.
It also affects the legality of compiler optimizations on memory_order_relaxed
accesses.
This benefited greatly from discussions and debates with many others, including Sarita Adve, Will Deacon, Brian Demsky, Doug Lea, Daniel Lustig, Paul McKenney, and Matt Sinclair. I believe we all agree that there is a problem that should be addressed, but not on a detailed solution.
This topic was briefly discussed by SG1 at the Rapperswil meeting. This paper is in part an attempt to respond to the question, raised at that meeting, as to whether we can just live with the results allowed by current implementations.
The out-of-thin-air problem has been repeatedly discussed, e.g. in N3710, and http://plv.mpi-sws.org/scfix/paper.pdf.
The fundamental problem is that without
an explicit prohibition of such results, the C++ memory model allows
memory_order_relaxed
operations to execute in a way that introduces
causal cycles, making it appear that values appeared out of thin air, commonly abbreviated as OOTA.
The canonical example is the following, where x
and
y
are atomic integers initialized to zero:
OOTA Example 1 Thread 1 Thread 2 r1 = x.load(memory_order_relaxed);
y.store(r1, memory_order_relaxed);
r1 = y.load(memory_order_relaxed);
x.store(r1, memory_order_relaxed);
Without specific requirements to the contrary, the loads are allowed to see the effects of the racing stores. Thus an execution in which the stores write some arbitrary value, say 42, and the loads read those values, is valid. This can be interpreted as an execution in which both loads speculatively return 42, then perform the stores (e.g. to memory only visible to these two threads, where it could be rolled back if we guessed wrong), and then check that the guesses for the loads were correct, which they now are.
Thus with OOTA execution allowed, the above code can set x
and
y
to a value that is computed nowhere in the code, and can only be
justified by a circular argument, or one involving explicit reasoning about speculation.
Similarly, consider the following program in a world with out-of-thin-air results allowed:
OOTA Example 2 Thread 1 Thread 2 if (x.load(memory_order_relaxed))
y.store(1, memory_order_relaxed);
if (y.load(memory_order_relaxed))
x.store(1, memory_order_relaxed);
Again both stores can be speculated to occur, and be read by the loads, thus validating the speculation, and causing both variables to be set to 1. This is annoying because the corresponding program written without atomics, is data-race-free, and always does nothing, as expected.
However the worst scenario in this OOTA world is the following.
Assume that x
and y
are declared as integers, but are
only ever assigned 0, 1, or 2, and that functions f()
and g()
take one such 0-to-2 argument each. f()
and g()
's precondition
effectively includes the restriction that its argument is between 0 and 2.
OOTA Example 3 Thread 1 Thread 2 r1 = x.load(memory_order_relaxed))
f(r1)
r2 = y.load(memory_order_relaxed))
g(r2);
Again, we have to consider an execution in which each load guesses that the value
seen is e.g. 42. This then gets passed to f()
and g()
,
which are allowed to do anything, since their precondition is not satisfied.
"Anything" includes assigning 42 to y
and x
respectively,
e.g. as a result of an out-of-bounds array reference. This action would again validate
the speculation, making it legal.
This last example is particularly disconcerting, since it seems essentially impossible to for the programmer to avoid in real code. It only requires that we perform relaxed loads from two variables in two different threads, and pass the result to some functions that make some assumption about their input. Any significant code performing relaxed loads on pointers is essentially certain to contain a variant of this pattern.
Fortunately nobody believes that any of the above three examples can actually produce these results in real implementations. All common hardware disallows such results by prohibiting a store from speculatively executing before a load on which it "depends". Unfortunately, as we will see below, the hardware notion of "depends" is routine invalidated by many important compiler transformations, and hence this definition does not carry over to compiled programming languages.
Although such results are not believed to occur in practice, the inability to precisely
preclude them has serious negative effects. We have no usable and precise rules
for reasoning about programs that use memory_order_relaxed
. This means we have
no hope of formally verifying most code that uses memory_order_relaxed
(or memory_order_consume
). "Formal verification" includes partial verification to
show the absence of certain exploits. Since we don't have a precise semantics to base it
on, informal reasoning about memory_order_relaxed
also becomes harder.
Compiler optimization and hardware rules are unclear.
It is also worth noticing that so far we have constrained ourselves to examples in which each thread is essentially limited to two lines of code. The effect on more realistic code is not completely clear. Most of this paper focusses on a new class of examples that we hadn't previously explored.
However, we have been unable to provide a meaningful C++-level definition of "out-of-thin-air" results. And thus the standard has not been able to meaningfully prohibit them. The current standard states:
“Implementations should ensure that no "out-of-thin-air" values are computed that circularly depend on their own computation.”
while offering only examples as to what that means. (See N3786 for a bit more on how we got here. This did not go over well in WG14.)
So far, our hope has been that this problem could be addressed with more ingenuity on our part, by finally developing a proper specification for "out-of-thin-air" results. Here I explain why I no longer believe this is a reasonable expectation.
Although it was previously known (see e.g. Boehm and Demsky and Bill Pugh et al's much earlier Java memory model litmus tests) that there were borderline cases, which had a lot of similarity to out-of-thin-air results, but could be generated by existing implementations, it wasn't fully clear to us how similar these can get to true out-of-thin-air results, and the catastrophic damage they can already do. I no longer believe that it is possible to draw a meaningful line between these and "true" out-of-thin-air results. Even if we could draw such a line, it would be meaningless, we would still have no good way to reason about code on the "right" side of the line since, as we show below, that can still produce completely unacceptable results.
Note that we still do not have actual code failures that have been traced to this problem; we do now have small sections of contrived code that can result in the problems discussed here. And we do not have any way to convince ourselves that real code is immune from such problems. The most problematic cases are likely to involve atomics managed by different modules, with the atomic accesses potentially quite far apart. Thus failure rates are likely to be very low, even if the problem does occur in real code.
The central problem continues
to be our inability to reason about code using memory_order_relaxed
.
Unfortunately, our initial canonical out-of-thin-air example above can be turned into a slightly more complicated example that can generate the same questionable result. Abstractly, as defined by the standard, the only difference between the two examples is an unexecuted conditional, something that should not make any difference. And I believe it cannot make any difference with anything like the specification we use in the current standard.
In external discussion, such examples have been dubbed "RFUB" ("read from unexecuted branch") instead of OOTA. Our initial example, with additions over OOTA Example 1 highlighted, is:
RFUB Example 1 Thread 1 Thread 2 r1 = x.load(memory_order_relaxed);
y.store(r1, memory_order_relaxed);
bool assigned_42(false);
r1 = y.load(memory_order_relaxed);
if (r1 != 42) {
assigned_42 = true;
r1 = 42;
}
x.store(r1, memory_order_relaxed);
assert_not(assigned_42);
We argue that entirely conventional optimizations can result in
an execution in which the assertion succeeds, but
x
and y
have been assigned 42. In reality, this is achieved
with a combination of hardware and compiler transformations.
(We assume that assert_not
is a separately compiled function, with the
implied semantics. In our environment, we get different code if we use the
actual assert
macro.)
The compiler transformations proceed roughly as follows:
x
must always assign 42. Update it to just store 42.
r1
is now dead. Remove it.
assigned_42
,
with essentially assigned_42 = (r1 != 42)
.
This gives us essentially:
bool assigned_42; r1 = y.load(memory_order_relaxed); assigned_42 = (r1 != 42); x.store(42, memory_order_relaxed); assert_not(assigned_42);
Gcc 7.2 -O2 on Aarch64 (ARMv8) generates (from the source in the table):
ldr w1, [x0] // w1 = y.load(...) add x0, x0, 8 // x0 = &x mov w2, 42 str w2, [x0] // x = 42 cmp w1, w2 // w0 = (y != 42) cset w0, ne b _Z10assert_notb // call assert_not(w0)
Since ARMv8 allows an independent load followed by an independent store to be reordered,
the store (str
) instruction may be executed before any of the other instructions,
effectively transforming the code to:
x.store(r1, memory_order_relaxed); bool assigned_42; r1 = y.load(memory_order_relaxed); assigned_42 = (r1 != 42); assert_not(assigned_42);
If Thread 1 executes immediately after the store instruction, before any other Thread 2
instructions, the load of y
will read 42, assigned_42
will be
false, and we will end up with exactly the problematic execution.
Note that there is no real obstacle to this occurring on a strongly ordered architecture like x86 as well; the hardware would not reorder a load with a later store, but there is no rule preventing the compiler from doing so. It is just difficult to construct examples in which that would be a profitable compiler transformation. On the other hand, the hardware may find this to be profitable if e.g. the load misses the cache.
This behavior makes it look as though the then-clause in the condition was partially executed. That is clearly not allowed by the current spec. But, aside from the vague "normative encouragement" to avoid OOTA results, this can be explained as the loads at the beginning of both threads speculatively observing the stores at the end, i.e. as an OOTA result.
Unfortunately, this means that current mainstream implementations are not actually following the "normative encouragement" to avoid OOTA results.
The separate external discussion has considered a number of RFUB examples. They don't seem quite as disastrous as OOTA Example 3 above. But the following section argues that reasoning about programs with RFUB results remains essentially intractable, since it invalidates arguments about properly used sequential code.
[This section is not essential for the rest of this discussion.]
Here we focus on the fact that RFUB executions appear to be inconsistent, in that the only way to understand them, other than as a general OOTA result, is as partially executing a conditional clause. This example illustrates that observation more directly by pointing out that expected semantics of correctly called single-threaded code may be violated when combined with concurrent code residing in a separate module. To do so, we look at a larger example, consisting of two different files, which should be thought of as modules written by different authors. This is a thought experiment; I did not compile this code:
This encapsulates a pointer to a piece of data, along with a binary property of that data. The data is automatically deallocated if it is never handed out to a client:
// Concurrent calls on a single object not allowed. class data_wrapper { private: Foo *my_data; int flags; // flag values: static const_expr int ESCAPED = 1; static const_expr int PROPERTY_SET = 2; public: data_wrapper() : flags(0), my_data(new Foo(...)) {} ... bool get_data_property(Foo * x) { return my_data == x && (flags & PROPERTY_SET != 0); } void set_data_property() { flags |= PROPERTY_SET; } Foo* get_the_data() { flags |= ESCAPED; return my_data; } ~data_wrapper() { if (!(flags & ESCAPED)) { delete my_data; } // otherwise it will go away when its arena is deallocated. } ... }
Reasonable programmers would expect the delete call in the destructor to only be invoked
when my_data
did not escape, i.e. was never returned by
get_the_data()
. That safety
argument is based purely on sequential reasoning; there is no concurrency involved
up to this point. It should hold for any caller that does not invoke undefined behavior.
Nonetheless the following concurrent program, which invokes data_wrapper
member
functions from only a single thread, and is thus clearly data-race-free, violates
this sequential reasoning, and allows my_data
to be prematurely deleted.
#include "data_wrapper.h" std::atomic<Foo *> data1; std::atomic<Foo *> data2;
Thread 1 sometimes executes the following. It is the only thread that accesses a
data_wrapper
:
{ data_wrapper dw ....; ... r1 = data1.load(memory_order_relaxed); if (!dw.get_data_property(r1)) { r1 = dw.get_the_data(); } data2.store(r1, memory_order_relaxed); dw.set_data_property(); ... }
Thread 2 may concurrently copy data2
to data1
,
also using relaxed operations, as in e.g.
r2 = data2.load(memory_order_relaxed); data1.store(r2, memory_order_relaxed);
data_wrapper
code
If the data_wrapper
calls are inlined into Thread 1, we get, for the Thread 1
code between ellipses:
r1 = data1.load(memory_order_relaxed); if (!(dw.my_data == r1 && ...)) { dw.flags |= ESCAPED; r1 = dw.my_data; } data2.store(r1, memory_order_relaxed); dw.flags |= PROPERTY_SET;
The compiler is motivated to temporarily promote the flags field to a register. Since the atomic store is relaxed, this is allowed, and we get:
r1 = data1.load(memory_order_relaxed); rflags = dw.flags; if (!(dw.my_data == r1 && ...)) { rflags |= ESCAPED; r1 = dw.my_data; } // else dw.my_data == r1 && ... data2.store(r1, memory_order_relaxed); rflags |= PROPERTY_SET; dw.flags = rflags;
The compiler can now again conclude that r1
will always be
dw.my_data
after the conditional,
and transform this to (after also observing that the r1
assignment in the
conditional is dead after transforming the store to data2
):
r1 = data1.load(memory_order_relaxed); rflags = sd.flags; if (!(dw.my_data == r1 && ...)) { rflags |= ESCAPED; } data2.store(dw.my_data, memory_order_relaxed); rflags |= PROPERTY_SET; dw.flags = rflags;
which, since ESCAPED
happens to be 1, can be transformed to
r1 = data1.load(memory_order_relaxed); rflags = dw.flags; rflags |= !(dw.my_data == r1 && ...); data2.store(dw.my_data, memory_order_relaxed); rflags |= PROPERTY_SET; dw.flags = rflags;
This once again allows the atomic store to be advanced before the atomic load.
ARM or Power hardware are clearly allowed to do so if the &&
expression is compiled
without branches, which is entirely plausible. That in turn allows Thread 2 to copy
data2
to data1
between the store and the load,
causing the data1
load to see a value of dw.my_data
.
Thus rflags
is not altered by the second line, and ESCAPED
is not added, in spite of the fact that both data1
and data2
now contain dw.my_data
. Thus exiting the scope in which dw
is
declared will delete dw.my_data
, leaving data1 and data2 as dangling references.
It is completely unclear what the programmer might have done wrong here. The
data_wrapper
implementation is clearly correct. And the author of the client
couldn’t really avoid this without carefully inspecting the data_wrapper
code,
violating modularity, and fully understanding potential compiler transformations.
The kind of problems we have seen here cannot occur at the assembly language level, or with a naive compiler. Weak hardware memory models like ARM and Power allow the crucial reordering of a load followed by a store, but only if the store does not "depend" on the load. The architectural specification includes a definition of "depend". For our examples, when naively compiled, the final store "depends" on the initial load, so the reordering would be prohibited, preventing the problematic executions.
The core problem is that the architectural definitions of dependence are not preserved by any reasonable compiler. In order to preserve the intent of the hardware rules, and to prevent OOTA behavior, we need to strengthen this notion of dependence to something that can reasonably be preserved by compilation. Many attempts to do so in clever ways have failed. The RFUB examples argue that this is not possible in the context of the current specification, since the only difference between full OOTA behavior, and behavior allowed by current implementations, is an unexecuted if-branch. And even if we could make this distinction, it wouldn't be useful, since implementations currently allow behavior that we don't know how to reason about.
It seems increasingly clear that the best general solution is to strengthen this notion of "dependence" to simply the "sequenced before" relation, and thus require that a relaxed load followed by a relaxed store cannot be reordered. This clearly has the disadvantage of disallowing current implementations. We previously proposed a variant of this solution in N3710. Any change in this direction may need to be phased in in some way, in order to allow compiler, and possibly hardware, vendors time to adjust.
This is also the solution that was formalized in http://plv.mpi-sws.org/scfix/paper.pdf. The formal treatment is quite simple; we just insist that the "reads from" relation is consistent with "sequenced before". We also expect the standards wording to be quite simple.
The performance of this approach was recently studied in
Peizhao Ou and Brian Demsky, "Towards Understanding the Costs of Avoiding Out-of-Thin-Air Results"
(OOPSLA'18, should be available by San Diego meeting). The overhead of preventing
load; store reordering on ARM or Power is much less than that of enforcing acquire/release
ordering. There are some uncertainties about machine memory models in this area, but
the expectation is that the required ordering can be enforced by inserting a never-taken
branch or the like after memory_order_relaxed
loads. In many cases,
even this will not be needed, because an existing branch, load, or store instruction
either already fulfills that role, or can be made to do so. No CPU architecture
appears to require an actual expensive fence instruction to enforce load;store
ordering.
I am not aware of any cost measurements for enforcing load;store ordering on a GPU. Such measurements would be extremely useful.
Pursuing this path will require some compiler changes around the compilation
of memory_order_relaxed
. And I expect it would eventually result
in the introduction of better hardware support than the current "bogus branch"
technique.
Participants in the external discussion largely agreed that we need a replacement
or replacements for memory_order_relaxed
. But there is no clear
consensus on many more detailed questions:
memory_order_relaxed
along the above lines, or leave it as is, and add a stronger version amenable to
precise reasoning?
memory_order_relaxed
semantics with the stronger semantics, we will want to offer one or more
more specialized weaker versions, since some specific usage idioms do not
require the stronger treatment. For example, a relaxed atomic increment of a counter
often does not require the strengthening, because the result of the implied load is not
otherwise used. Should there be one such weaker order, or one for each idiom?
memory_order_relaxed
replacements that are specified in this way?
My opinion is that, since current implementations effectively do not follow the
OOTA guidance in the standard, and since we want to preserve correctness of
current code, while actually specifying its behavior precisely, we should
change the existing semantics of memory_order_relaxed
to guarantee
load;store ordering.
We should then try to develop weaker memory orders tailored to specific idioms, and
specified in a DRFrlx style, to regain any performance lost by the preceding step.
I currently do not believe that we will be able to find a single weaker specification
that avoids our current OOTA issues. Thus I expect the specifications to be tailored
to specific idioms. It currently seems useful to expose
these as different memory_order
s to make it clear what part of the
specification the user is relying on. It also seems likely that each of these
will impose somewhat different optimization constraints. And many of us are of the
opinion that memory_order_relaxed
is already usable primarily
in cases that match one of a small number of reasonably well-known idioms.
I haven't been convinced that the DRFrlx approach by itself is currently viable
as a replacement for memory_order_relaxed
, both due to backwards
compatibility issues, and because some of the use cases we see in Android
appear to be hard to cover.
The end result here should be a specification that provides performance similar to
what we have now, but that is well-defined for all the memory_order
s,
rather than the current hand-waving for memory_order_relaxed
.
It would also address OOTA issues for memory_order_consume
,
but not touch the other memory_order_consume
issues we have been
discussing separately.
Full decisions here will have to wait for some of the missing information, particularly in regard to GPU performance implications. But a preliminary sense of the committee on these issues would be useful.