ISO/IEC JTC1 SC22 WG21 N2745 = 08-0255 - 2008-08-22
Paul E. McKenney, paulmck@linux.vnet.ibm.com
Raul Silvera, rauls@ca.ibm.com
This document presents a PowerPC implementation of the proposed C/C++ memory-order model (including the modifications for dependency ordering proposed in N2556), and analyzes some representative non-SC (sequentially consistent) code sequences. Sequences involving both acquire and dependency-ordering operations are analyzed.
The authors owe a debt of gratitude to Derek Williams, Cathy May, and Brad Frey for their careful and patient explanations of the PowerPC memory model. However, all mistakes within are the sole property of the authors.
There are any number of possible mappings from the C/C++ memory-ordering primitives onto the PowerPC instruction set. This document assumes the following mapping, derived by Raul:
Operation | PowerPC Implementation |
---|---|
Load Relaxed | ld |
Load Consume | ld |
Load Acquire | ld; cmp; bd; isync |
Load Seq Cst | hwsync; ld; cmp; bd; isync |
Store Relaxed | st |
Store Release | lwsync; st |
Store Seq Cst | hwsync; st |
Cmpxchg Relaxed,Relaxed | ldarx; cmp; bc _exit; stcwx; bc _loop |
Cmpxchg Acquire,Relaxed | ldarx; cmp; bc _exit; stcwx; bc _loop ; isync |
Cmpxchg Release,Relaxed | lwsync; ldarx; cmp; bc _exit; stcwx; bc _loop |
Cmpxchg AcqRel,Relaxed,Relaxed | lwsync; ldarx; cmp; bc _exit; stcwx; bc _loop; isync |
Cmpxchg SeqCst,Relaxed,Relaxed | hwsync; ldarx; cmp; bc _exit; stcwx; bc _loop; isync |
The C/C++ definition of “modification order” maps to the PowerPC notion of “memory coherence”. From Section 1.6.3 of PowerPC Book 2 describes memory coherence as follows:
An access to a Memory Coherence Required storage location is performed coherently, as follows.
Memory coherence refers to the ordering of stores to a single location. Atomic stores to a given location are coherent if they are serialized in some order, and no processor or mechanism is able to observe any subset of those stores as occurring in a conflicting order. This serialization order is an abstract sequence of values; the physical storage location need not assume each of the values written to it. For example, a processor may update a location several times before the value is written to physical storage. The result of a store operation is not available to every processor or mechanism at the same instant, and it may be that a processor or mechanism observes only some of the values that are written to a location. However, when a location is accessed atomically and coherently by all processor and mechanisms, the sequence of values loaded from the location by any processor or mechanism during any interval of time forms a subsequence of the sequence of values that the location logically held during that interval. That is, a processor or mechanism can never load a “newer” value first and then, later, load an “older” value.
Memory coherence is managed in blocks called coherence blocks. Their size is implementation-dependent (see the Book IV, PowerPC Implementation Features document for the implementation), but is larger than a word and is usually the size of a cache block.
Release sequences include two cases, subsequent stores by the same thread that performed the release, and non-relaxed atomic read-modify-write operations. The subsequent-stores case is covered by the following sentence of Section 1.6.3 of PowerPC Book 2 quoted above:
That is, a processor or mechanism can never load a “newer” value first and then, later, load an “older” value.
In addition, we shall see that the operation of PowerPC memory barriers causes the subsequent-stores case to trivially follow from the case where the acquire operation loads the head of the release chain.
The non-relaxed-atomic case has not been carefully studied,
but the authors are confident that use of either the lwsync
or the hwsync
instruction will suffice to enforce this
ordering.
A particular concern is the relationship of atomic operations and
release sequences, which are explored in a later section.
The relevant wording from PowerPC Book 2 describing PowerPC's atomic read-modify-write sequences is as follows:
The store caused by a successful “stwcx.” is ordered, by a dependence on the reservation, with respect to the load caused by the “lwarx” that established the reservation, such that the two storage accesses are performed in program order with respect to any processor or mechanism.
The “synchronizes with” relation is handled by the
placement of the memory barriers in the code sequences table above.
The stores to object M follow a lwsync
or
hwsync
instruction, and the corresponding loads
precede one of a number of instruction sequences called out in
the table.
These instructions are described in Section 1.7.1 of
PowerPC Book 2, as follows:
When a processor (P1) executes a
sync
,lwsync
, oreieio
instruction a memory barrier is created, which orders applicable storage accesses pairwise, as follows. Let A be a set of storage accesses that includes all storage accesses associated with instructions preceding the barrier-creating instruction, and let B be a set of storage accesses that includes all storage accesses associated with instructions following the barrier-creating instruction. For each applicable pair ai,bj of storage accesses such that ai is in A and bj is in B, the memory barrier ensures that ai will be performed with respect to any processor or mechanism, to the extent required by the associated Memory Coherence Required attributes, before bj is performed with respect to that processor or mechanism.
The word “performed” is defined roughly as follows:
Section 1.7.1 of PowerPC Book 2 goes on to discuss cumulativity, which is somewhat similar to a causal ordering:
The ordering done by a memory barrier is said to be “cumulative” if it also orders storage accesses that are performed by processors and mechanisms other than P1, as follows.
- A includes all applicable storage accesses by any such processor or mechanism that have been performed with respect to P1 before the memory barrier is created.
- B includes all applicable storage accesses by any such processor or mechanism that are performed after a Load instruction executed by that processor or mechanism has returned the value stored by a store that is in B.
Note that B-cumulativity recurses on stores, as illustrated by the following sequence of operations:
a
.
This store will be in the memory fence's B-set.
a
that returns
the value stored by thread 0. Thread 1's load and all
of thread 1's operations that are ordered after that load
are in thread 0's memory fence's B-set.
b
that
is ordered after the load from a
(for example,
by either code or data dependency). This store is therefore
in Thread 0's memory fence's B-set.
b
that returns
the value stored by thread 1.
Thread 2's load and all of thread 1's operations that are
ordered after that load are in thread 0's memory fence's
B-set.
The recursive nature of B-cumulativity allows this sequence to be extended indefinitely.
In contrast, A-cumulativity has no such recursion. Only those operations performed with respect to the specific thread containing the memory fence (termed “memory barrier” in PowerPC documentation) will be in that memory fence's A-set. There is no A-cumulativity recursion through other threads' loads; such recursion is confined to B-cumulativity.
In both of the above passages from Section 1.7.1, the importance of the word “applicable” cannot be overstated. This word's meaning depends on the type of memory-fence instruction as follows:
eieio
: only stores are “applicable”.
hwsync
: both loads and stores are
“applicable”.
This instruction is often called sync
,
and provides full causal ordering (in fact, full
sequential consistency).
lwsync
applicability is limited to the following
cases:
lwsync
.
bc;isync
: this is a very low-overhead and
very weak form of memory fence.
A specific set of preceding loads on
which the bc
(branch conditional) instruction
depends are guaranteed to have completed
before any subsequent instruction begins execution.
However, store-buffer and cache-state effects can
nevertheless make it appear that subsequent loads
occur before the preceding loads
upon which the twi instruction depends. That
said, the PowerPC architecture does not permit
stores to be executed speculatively, so any store
following the twi;isync instruction is guaranteed to happen
after any of the loads on which
the bc
depends.
Note that the bc;isync
instruction sequence does not provide cumulativity.
This permits the following counter-intuitive sequence of
events, with all variables initially zero, and results of
loads in square brackets following the load:
x=1
r1=x
[1]
lwsync
y=1
r2=y
[1]
bc;isync
r3=x
[0]
This sequence of events is more likely to occur on systems where CPUs 0 and 1 are closely related, for example, when CPUs 0 and 1 are hardware threads in one core and CPU 2 is a hardware thread in another core.
The effects of the isync
instruction is described in
the program note in Section 1.7.1 of PowerPC book 2:
Because an
isync
instruction prevents the execution of instructions following theisync
until instructions preceding theisync
have completed, if anisync
follows a conditional Branch instruction that depends on the value returned by a preceding Load instruction, the load on which the Branch depends is performed before any loads caused by instructions following theisync
. This applies even if the effects of the “dependency” are independent of the value loaded (e.g., the value is compared to itself and the Branch tests the EQ bit in the selected CR field), and even if the branch target is the sequentially next instruction.
This might seem at first glance to prohibit the above code sequence,
however, please keep in mind that CPU 1's lwsync
does
not affect CPU 1's store.
Cumulativity does not come into play here because prior stores
(CPU 0's store to x) and subsequent loads (CPU 2's
load from x) are not applicable to the lwsync
instruction.
However, if the lwsync
were to be replaced with a
hwsync
, the outcome shows above would be impossible.
N2556 defines carrying a dependency as follows:
An evaluation A carries a dependency to an evaluation B if
- the value of A is used as an operand of B, and:
or
- B is not an invocation of any specialization of
std::kill_dependency
, and- A is not the left operand to the comma (',') operator,
- A writes a scalar object or bit-field M, B reads the value written by A from M, and A is sequenced before B, or
- for some evaluation X, A carries a dependency to X, and X carries a dependency to B.
In cases where evaluation B is a load, we refer to Section 1.7.1 of PowerPC Book 2:
If a Load instruction depends on the value returned by a preceding Load instruction (because the value is used to compute the effective address specified by the second Load), the corresponding storage accesses are performed in program order with respect to any processor or mechanism to the extent required by the associated Memory Coherence Required attributes. This applies even if the dependency has no effect on program logic (e.g., the value returned by the first Load is ANDed with zero and then added to the effective address specified by the second Load).
Where evaluation B is a store, we refer to Section 4.2.4 of PowerPC Book 3:
Stores are not performed out-of-order (even if the Store instructions that caused them were executed out-of-order). Moreover, address translations associated with instructions preceding the corresponding Store instruction are not performed again after the store has been performed.
The synchronizes-with examples involve one thread performing an evaluation A sequenced before a release operation on some atomic object M, concurrently with another thread performing an acquire operation on this same atomic object M sequenced before an evaluation B. These examples are the four cases where A and B are all combinations of relaxed loads and stores.
This example shows how the C++ synchronization primitives can be used to order loads. Consider the following C++ code, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = x.load(memory_order_relaxed); y.store(1, memory_order_release);
r2 = y.load(memory_order_acquire); r3 = x.load(memory_order_relaxed);
x.store(1, memory_order_relaxed);
assert(r2==0||r1==0||r3==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
r1=x; | r2=y; | x=1; |
lwsync; | if (r2==r2) | |
y=1; | isync; | |
r3=x; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
r1==1
, we know that,
by cumulativity, thread 2's store to x
is also
in the lwsync
's A-set.
r2==1
, then we know that thread 0's
store to y
synchronizes with thread 1's
load from y
, which in turn means that
thread 1's load
from y
is in the lwsync
's B-set.
r1==1
and
r2==1
, thread 0's store to x
is
performed before thread 1's load from y
with respect to all processors.
isync
,
thread 1's load from y
is performed before thread 1's
load from x
with respect to all processors.
r1==1
and r2==1
, we
know that r3==1
, so that the assert is satisfied.
(Note: we need not rely thread 1's load from x
being in the lwsync
's B-set.
This is fortunate given
that prior stores and subsequent loads are not applicable for
lwsync
.)
Note that the C++ memory model does not actually require the loads to be ordered in this case, since thread 0's relaxed store is not guaranteed to be seen in any particular order by thread 0's and 1's relaxed loads. The fact that POWER provides ordering in this case is coincidental. The following modified code guarantees ordering according to the C++ memory model:
r1 = x.load(memory_order_acquire); y.store(1, memory_order_release);
r2 = y.load(memory_order_acquire); r3 = x.load(memory_order_acquire);
x.store(1, memory_order_release);
assert(r2==0||r1==0||r3==1);
Since this code sequence adds memory barriers and does not remove any, the assert is never violated on Power.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = x.load(memory_order_relaxed); y.store(1, memory_order_release);
r2 = y.load(memory_order_acquire); if (r2 != 0) x.store(1,memory_order_relaxed);
assert(r1==0);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 |
---|---|
r1=x; | r2=y; |
lwsync; | if (r2!=0) |
y=1; | isync; |
x=1; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the load is performed before the store
with respect to any given thread.
r2==1
:
y
synchronizes with thread 1's
load from y
, which in turn means that
thread 1's load from y
and store to x
are in the
lwsync
's B-set.
This means that thread 0's load from x
is performed before thread 1's load from y
with respect to each processor.
y
is performed before
its store to x
with respect to each
processor.
x
is performed
before thread 1's store to x
with respect to
any given processor.
r1==0
.
r2!=1
, then thread 1's
store to x
will not be executed.
Therefore, x
's initial
value of zero is retained, so that r1==0
.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:
x.store(1, memory_order_relaxed); y.store(1, memory_order_release);
r1 = y.load(memory_order_acquire); if (r1 == 1) r2 = x.load(memory_order_relaxed);
assert(r1==0||r2==1);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 |
---|---|
x=1; | r1=y; |
lwsync; | if (r1==1) |
y=1; | isync; |
r2=x; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the store to x
is performed before the store to y
with respect to any given thread.
r1==1
:
y
synchronizes with thread 1's
load from y
, which in turn means that
thread 1's load
from y
is in the lwsync
's B-set.
x
is performed
before thread 1's load from y
with respect to
any given thread.
isync
means
that thread 1's load from y
is performed
before its load from x
with respect to
each processor and mechanism.
x
is
performed before thread 1's load from x
.
r2==1
, satisfying the assert.
r2==0
, then the assert
is directly satisfied.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing at any time:
x.store(1, memory_order_relaxed); y.store(1, memory_order_release);
r1 = y.load(memory_order_acquire); if (r1 == 1) z.store(1,memory_order_relaxed);
assert(z==0||x==1);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 |
---|---|
x=1; | r1=y; |
lwsync; | if (r1==1) |
y=1; | isync; |
z=1; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This means that the store to x
is performed before the store to y
with respect to any given thread.
r1==1
:
y
synchronizes with thread 1's
load from y
, which means that thread 1's load
from y
is in the lwsync
's B-set.
Therefore, thread 0's store to x
is performed
before thread 1's load from y
with respect to
any given thread.
y
is performed before its store to z
with respect
to any processor.
x
is performed
before thread 1's store to z
with respect to
any processor.
x==1
, satisfying the assert.
r1==0
, then z
is never assigned to, so its value remains zero, which
also satisfies the assert.
The dependency-ordered-before examples involve one thread performing an evaluation A sequenced before a release operation on some atomic object M, concurrently with another thread performing an consume operation on this same atomic object M sequenced before an evaluation B that depends on the consume operation. These examples are the four cases where A and B are all combinations of relaxed loads and stores. Some of the examples will require a third thread, for example, the load/load case cannot detect ordering without an independent store.
Like Example 1, this example orders a pair of loads, but using dependency ordering rather than load-acquire. Consider the following C++ code, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = p.a.load(memory_order_acquire); y.store(&p, memory_order_release);
r2 = y.load(memory_order_consume); r3 = r2->a.load(memory_order_acquire);
p.a.store(1, memory_order_release);
assert(r2==&p||r1==0||r3==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
r1=p.a; | r2=y; | p.a=1; |
lwsync; | if (r2==&p) | lwsync; |
y=&p; | r3=r2->a; |
Note that the ordering instructions corresponding to thread 0's
load-acquire from p.a
are folded into the following
lwsync
instruction.
Note also that the ordering instructions corresponding to thread 1's
second load have no effect, and hence have been omitted.
Discussion:
p.a
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
r1==1
, then we know that, by cumulativity,
thread 2's store to p.a
is also in the
lwsync
's A-set.
r2==&p
, then we know that thread 0's
store to y
synchronizes with thread 1's
load from y
, which in turn means that
thread 1's load
from y
is in the lwsync
's B-set.
Therefore, thread 0's load from r2->a
is performed
before thread 1's load from p.a
with respect to
any given thread.
y
is performed before thread 1's
load from r2->a
with respect to all processors.
r1==1
and r2==1
, we know
that r3==1
so that the assert must always be satisfied.
(Note: we need not rely thread 1's load from r2->a
being in the lwsync
's B-set, which is good given
that prior stores and susequent loads are not applicable for
lwsync
.)
The C++ code for this example is as follows, with all variables initially zero:
r1 = p.a.load(memory_order_relaxed); y.store(&p, memory_order_release);
r2 = y.load(memory_order_consume); if (r2 == &p) r2->a.store(1,memory_order_relaxed);
assert(r1==0);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 |
---|---|
r1=p.a; | r2=y; |
lwsync; | if (r2==&p) |
y=&p; | r2->a=1; |
Discussion:
p.a
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that thread 0's load is performed before its store
with respect to any given processor.
r2==&p
, then we know that thread 0's
store to y
synchronizes with thread 1's
load from y
, which in turn means that thread 1's load
from y
is in the lwsync
's B-set.
p.a
is performed
before thread 1's store to r2->a
with respect to
any given thread.
Therefore, in this case, we have r1==0
.
r2!=&p
, then thread 1's
store to p.a
will not be executed,
so that r1==0
.
The C++ code for this example is as follows, with all variables initially zero:
p.a.store(1, memory_order_relaxed); y.store(&p, memory_order_release);
r1 = y.load(memory_order_consume); if (r1 == &p) r2 = r1->a.load(memory_order_relaxed);
assert(r1==&p||r2==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 |
---|---|
p.a=1; | r1=y; |
lwsync; | if (r1==&p) |
y=&p; | r2=r1->a; |
Discussion:
p.a
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the store to p.a
is performed before the store to y
with respect to any given thread.
r1==&p
:
y
synchronizes with thread 1's
load from y
, which in turn means
that thread 1's load from y
is in the
lwsync
's B-set.
y
will be performed before its load from r1->a
with respect to all processors.
p.a
is performed
before thread 1's load from r1->a
with respect to
any given thread.
r2==1
,
satisfying the assert.
r2==0
, then the assert
is directly satisfied.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing at any time:
p.a.store(1, memory_order_relaxed); y.store(&p, memory_order_release);
r1 = y.load(memory_order_consume); if (r1 == &p) r1->b.store(1,memory_order_relaxed);
assert(p.b==0||p.a==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 |
---|---|
p.a=1; | r1=y; |
lwsync; | if (r1==&p) |
y=&p; | r1->b=1; |
Discussion:
p.a
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the store to p.a
is performed before the store to y
with respect to any given thread.
r1==&p
:
y
synchronizes with thread 1's
load from y
, which in turn means that
thread 1's load
from y
is in the lwsync
's B-set.
p.a
is performed
before thread 1's store to r1->b
with respect to
any given thread.
y
is performed before
its store to r1->b
.
p.a
is
performed before thread 1's store to r1->b
with respect to each processor.
p.a==1
,
satisfying the assert.
r2==0
, then thread 2's
assignment to r1->b
is never executed, so that
p.b
is zero, again satisfying the assert.
The C++ memory model also provides for a “release sequences”, which comprise either (1) subsequent stores to the variable that was the subject of the release operation by the same thread that performed the release operation or (2) non-relaxed atomic read-modify-write operations on the variable that was the subject of the release operation by any thread. We consider these two release-sequence components separately in the following sections.
Subsequent same-thread stores are trivially accommodated. Simply add an additional atomic store (but possibly relaxed) operation after the release operation, and modify checks on the corresponding acquire operation to test for this subsequent store.
The same reasoning that worked for the original analyses applies straightforwardly to the updated examples.
This section reprises the examples in the “Synchronizes-With” section, but introducing an atomic read-modify-write operation.
This example shows how the C++ synchronization primitives can be used to order loads. Consider the following C++ code, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = x.load(memory_order_acquire); y.store(2, memory_order_release);
r2 = y.load(memory_order_acquire); r3 = x.load(memory_order_acquire);
x.store(1, memory_order_release);
y.fetch_add(1, memory_order_acq_rel);
assert(r2<=1||r1==0||r3==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 | Thread 2 | Thread 3 |
---|---|---|---|
r1=x; | r2=y; | x=1; | ldarx r4,y |
lwsync; | if (r2==r2) | r5=r4+1 | |
y=2; | isync; | stdcx. y,r5 | |
r3=x; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
r4==1
, then thread 3's stdcx.
is also in the lwsync
's B-set.
r2==1
case was handled in example 1, and is
not repeated here.
r2==2
, we have loaded the value stored by
an operation in the lwsync
's B-set, and therefore
all operations following this load are also in the
lwsync
's B-set.
r1==1
and
r2==2
, thread 0's store to x
is
performed before thread 1's load from y
with respect to all processors.
isync
,
thread 1's load from y
is performed before thread 1's
load from x
with respect to all processors.
r1==1
and r2==2
, we
know that r3==1
, so that the assert is satisfied.
(Note: we need not rely thread 1's load from x
being in the lwsync
's B-set.
This is fortunate given
that prior stores and subsequent loads are not applicable for
lwsync
.)
Note that this line of reasoning does not depend on any memory barriers
in the atomic read-modify-write operation, which means that this code
sequence would maintain the synchronizes-with relationship even when
using memory_order_relaxed
.
However, the standard does not require this behavior, so portable
code must use a non-relaxed memory-order specifier in this case.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = x.load(memory_order_relaxed); y.store(2, memory_order_release);
r2 = y.load(memory_order_acquire); if (r2 >= 2) x.store(1,memory_order_relaxed);
y.fetch_add(1, memory_order_acq_rel);
assert(r1==0);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
r1=x; | r2=y; | ldarx r3,y |
lwsync; | if (r2>=2) | r4=r3+1 |
y=2; | isync; | stdcx. y,r4 |
x=1; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the load is performed before the store
with respect to any given thread.
r2==1
was handled in example 2, and is
not repeated here.
r2==2
, we know that thread 2's ldarx
read the value stored by thread 0, and thread 2's stdcx.
is therefore in the lwsync
's B-set.
In addition, thread 1's load to r2
read the value
stored by thread 2's stdcx.
, so thread 2's
operations ordered after that load are therefore also in
the lwsync
's B-set, meaning that thread 2's
load to r2
is performed after thread 0's
load to r1
with respect to all threads.
isync
ensure
that the load into r2
is performed before the
store to x
with respect all other threads.
r1
will always be zero, as thread 1's
store to x
is never performed unless thread 0's
load to r1
has already been performed.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:
x.store(1, memory_order_relaxed); y.store(1, memory_order_release);
r1 = y.load(memory_order_acquire); if (r1 >= 2) r2 = x.load(memory_order_relaxed);
y.fetch_add(1, memory_order_acq_rel);
assert(r1<=1||r2==1);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
x=1; | r1=y; | ldarx r3,y |
lwsync; | if (r1>=2) | r4=r3+1 |
y=1; | isync; | stdcx. y,r4 |
r2=x; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the store to x
is performed before the store to y
with respect to any given thread.
r1==1
, we have the case covered in example 3,
which will not be further discussed here.
r1>=2
, we know that thread 2's atomic increment
intervened, so that thread 2's stdcx.
is in
lwsync
's B-set.
In addition, thread 1's load to r1
will have loaded
the value stored by thread 2's stdcx.
, so that
all memory operations ordered after
thread 1's load to r1
will also be in the
lwsync
's B-set.
isync
ensure that
thread 1's load to r1
is performed before
its load to r2
.
Therefore, if r1
is equal to two, r2
is
guaranteed to be the value stored to x
by thread 0,
namely, the value 1.
The assertion is therefore always satisfied.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing at any time:
x.store(1, memory_order_relaxed); y.store(1, memory_order_release);
r1 = y.load(memory_order_acquire); if (r1 == 1) z.store(1,memory_order_relaxed);
y.fetch_add(1, memory_order_acq_rel);
assert(z==0||x==1);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
x=1; | r1=y; | ldarx r2,y |
lwsync; | if (r1>=2) | r3=r2+1 |
y=1; | isync; | stdcx. y,r2 |
z=1; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This means that the store to x
is performed before the store to y
with respect to any given thread.
r1==1
, we have the case dealt with in example 4,
which will not be discussed further.
r1==2
, we know that thread 2's atomic operation
intervened, so that thread 2's
stdcx.
is in the lwsync
's B-set.
Furthermore, because thread 1's load into r1
sees the value stored by thread 2's stdcx.
,
all of thread 1's operations ordered after that load are
also in the lwsync
's B-set.
isync
guarantee
that the load into r1
is performed before
its store to z
with respect to all threads.
This means that if the store to z
is performed,
the value of x
must already be one, satisfying
the assert.
The key point in examples 9-12 was the recursive nature of B-cumulativity. This applies straightforwardly to the dependency ordering examples as well, so that Power allows a chain of atomic operations to form a release sequence, regardless of the memory_order argument.
The PowerPC code sequences shown at the beginning of this document suffice to implement the synchronizes-with and dependency-carrying portions of the proposed C++ memory model. These code sequences are able to take advantage of the lightweight memory barriers provided by the PowerPC architecture.