<div dir="ltr"><div dir="ltr">On Mon, Apr 15, 2019 at 2:29 PM Peter Sewell <<a href="mailto:Peter.Sewell@cl.cam.ac.uk">Peter.Sewell@cl.cam.ac.uk</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 15/04/2019, Richard Smith <<a href="mailto:richardsmith@googlers.com" target="_blank">richardsmith@googlers.com</a>> wrote:<br>
> On Mon, Apr 15, 2019 at 7:16 AM Peter Sewell <<a href="mailto:Peter.Sewell@cl.cam.ac.uk" target="_blank">Peter.Sewell@cl.cam.ac.uk</a>><br>
[...]<br>
>> We've also heard suggestions that compilers do things here, but in C<br>
>> the container-of idiom seems pervasive, and WG14 at the last meeting<br>
>> expressed a large majority that it has to be permitted by the<br>
>> standard.<br>
>><br>
><br>
> This is likely one of the areas where C++ wants to have stricter rules than<br>
> C, at least for some categories of types. (We already have limitations on<br>
> the types for which offsetof can be used, and don't require offsets to be<br>
> constant in all cases, and even permit some flavours of structs to not<br>
> store their members within the memory associated with the struct object at<br>
> all.) We probably want to follow C in this area at least for<br>
> standard-layout types, though.<br>
<br>
y<br>
<br>
>> As a consequence of the above, the specification for std::launder says:<br>
>> ><br>
>> > """<br>
>> > Expects: [...] All bytes of storage that would be reachable through the<br>
>> result are reachable through p (see below).<br>
>> > [...]<br>
>> > Remarks: [...] A byte of storage is reachable through a pointer value<br>
>> that points to an object Y if it is within the storage occupied by Y, an<br>
>> object that is pointer-interconvertible with Y, or the<br>
>> immediately-enclosing array object if Y is an array element.<br>
>> > """<br>
>><br>
>> (small thing: you might or might not want that to support movement<br>
>> through nested arrays)<br>
>><br>
>> > Can this be accommodated by the "recreate the provenance on<br>
>> integer-to-pointer cast" model? I think it's not accommodated by the<br>
>> approach you describe above -- the above description would seem to<br>
>> suggest<br>
>> that we need to allow access to the entirety of the largest live object<br>
>> whose address was taken and which encloses the address represented by the<br>
>> integer. So casting an A* to an integer could allow navigation to the<br>
>> enclosing B, if the B object has also had its address cast to an integer.<br>
>><br>
>> Yes. The current proposal allows that. We've not nailed down<br>
>> subobject issues, but one approach that several of us favour would be<br>
>> to enforce subobject boundaries except for void* or character-type*<br>
>> pointer arithmetic - that would let the offsetof container-of idioms<br>
>> still work, while permitting error detection in cases where (eg)<br>
>> someone does int* pointer arithmetic to move between struct members.<br>
>><br>
><br>
> I think something like that would make sense.<br>
><br>
> C++ has a notion of an array of byte-like type providing storage for<br>
> another object (with no formal subobject relationship). I think it would<br>
> make sense to say that all mechanisms by which storage is allocated (of all<br>
> storage durations) implicitly create an array of such a byte-like type<br>
> covering the entire allocation, so every object either is, or is nested<br>
> within, some array of byte-like type representing a storage allocation.<br>
> Then pointer arithmetic on pointers to elements of that enclosing array<br>
> would permit arbitrary navigation, and otherwise navigation would be<br>
> constrained to subobject relationships.<br>
<br>
y (Jens G has a proposal for C along those lines, though as I say<br>
we've not yet actually done the subobject bits yet)<br>
<br>
>> Personally I'm not sure how far we should go to defend this "irreversible<br>
>> subobject navigation" property, but my implementation isn't one that<br>
>> takes<br>
>> advantage of it.<br>
>> ><br>
>> >> A storage instance is deemed exposed by a cast of a pointer to it to<br>
>> >> an integer type, by a read (at non-pointer type) of the representation<br>
>> >> of the pointer, or by an output of the pointer using %p.<br>
>> ><br>
>> ><br>
>> > Hmm. Does that mean that evaluation of a pointer-to-integer cast has a<br>
>> side-effect, and cannot in general be optimized away even if its result<br>
>> is<br>
>> unused?<br>
>><br>
>> In the source language, yes. An intermediate language might perhaps<br>
>> use a more liberal semantics that doesn't rely on that side effect.<br>
>><br>
>> >Or is there an assumption being built in here that the only way to form<br>
>> an equal integer value to cast back to a pointer will necessarily involve<br>
>> a<br>
>> computation that actually depends on the pointer-to-integer cast, even if<br>
>> we don't explicitly require that?<br>
>><br>
>> That will typically be true - that's what allocation-address<br>
>> nondeterminism buys you.<br>
>><br>
>> > How careful do we need to be in future to avoid breaking that<br>
>> assumption? Consider case such as:<br>
>> ><br>
>> > int f(int mode, int offset = 0) {<br>
>> > int a, b;<br>
>> > if (mode == 1) { return (intptr_t)&a - (intptr_t)&b; }<br>
>> > intptr_t a_int = (intptr_t)&a;<br>
>> > intptr_t evil_a_int = (intptr_t)&b + offset;<br>
>> > int *evil_a = (int*)evil_a_int;<br>
>> > printf("%" PRIdPTR " %" PRIdPTR " %p %p\n", a_int, evil_a_int, &a,<br>
>> evil_a);<br>
>> > if (getchar() == 'x') std::terminate(); // #1, allow user to abort if<br>
>> integer/pointer values differ<br>
>> > a = 1;<br>
>> > return *evil_a;<br>
>> > }<br>
>> > int main() { return f(2, f(1)); }<br>
>><br>
>> The precise interaction of UB with user input is another can of worms<br>
>> that we've not yet really opened :-) But, assuming that programs<br>
>> can rely on facts about user input (perhaps based on whatever has been<br>
>> output), which seems reasonable, and that here the user is "required"<br>
>> to press x if the values differ, then<br>
>><br>
>> > Is my implementation conforming if it prints out equal integer and<br>
>> pointer values here, and yet (after the program is resumed by the user)<br>
>> main doesn't return 1? Based on your description, I think the answer is<br>
>> no,<br>
>> which means that my weird pointer comparison machine (including a human<br>
>> component) is effectively enough to keep the pointer exposed.<br>
>><br>
>> As Martin said, the mode==1 execution of f isn't very interesting, as<br>
>> the second allocations of a and b could be arbitrarily spaced. But<br>
>> then in the mode!=1 execution, a and b are both exposed, so the cast<br>
>> (int*)evil_a_int will give usable provenance iff the passed-in offset<br>
>> is correct, so if the user is required to check that, this should be<br>
>> well-defined to return 1.<br>
>><br>
>> If line #1 is deleted, then in some executions a_int will be distinct<br>
>> to evil_a_int (by allocation nondeterminism), so indeed it will be UB.<br>
>><br>
>> >However, if line #1 is deleted, I believe an implementation that prints<br>
>> out equal values and then does not return 1 would be correct: the<br>
>> implementation can claim that a_int != evil_a_int, so that the 'return<br>
>> *evil_a;' had undefined behavior, and therefore it was permitted to do<br>
>> whatever it liked, including printing out values that appeared to be the<br>
>> same.<br>
>><br>
>> >I /think/ the upshot is that you *can* optimize away a<br>
>> > pointer-to-integer<br>
>> cast if its value is unused, but you cannot optimize one away in the<br>
>> presence of potential data- or control-dependencies on it (including<br>
>> indirect ones such as escaping the pointers through the IO system and<br>
>> comparing them somewhere else, then reading back the result) that might<br>
>> in<br>
>> any way influence a later integer-to-pointer cast. Does that match your<br>
>> intent?<br>
>><br>
>> yes<br>
>><br>
>> ><br>
>> > Integer-to-pointer casts seem to have surprising evaluation semantics<br>
>> > as<br>
>> a result of this approach. It seems reasonable to me to give cases like<br>
>> this defined behavior:<br>
>> ><br>
>> > char buffer[32];<br>
>> > struct A { int n; };<br>
>> > struct B : A {};<br>
>> > struct C : A {};<br>
>> > A *p = new (buffer) B;<br>
>> > intptr_t x = (intptr_t)p;<br>
>> > A *q = new (buffer) C;<br>
>> > A *r = (A*)x; // ok, r points to the A base of the C object<br>
>> > r->n = 123; // OK, r points to the new A object not the old one<br>
>><br>
>> If we think of an analogous example in which the B and C objects are<br>
>> heap or stack allocated (with non-overlapping lifetimes) that happen<br>
>> to get the same address, with the code checking that, then:<br>
>><br>
>> > ... but it matters when the cast from integer type to pointer type is<br>
>> performed.<br>
>><br>
>> that does indeed make a difference.<br>
>><br>
>> >If we move the cast to pointer type earlier, the resulting example would<br>
>> not be defined under the "points within a live object" / launder model:<br>
>> ><br>
>> > A *p = new (buffer) B;<br>
>> > intptr_t x = (intptr_t)p;<br>
>> > A *r = (A*)x; // ok, r points to the A base of the B object<br>
>> > A *q = new (buffer) C;<br>
>> > r->n = 123; // undefined: r points to the old A object that is not<br>
>> within its lifetime<br>
>> ><br>
>> > Giving casts between pointer and integer types side-effects, and<br>
>> > effects<br>
>> that depend on when they're evaluated, makes me nervous.<br>
>><br>
>> As (I think) Jens said, the pointer-to-integer cast side-effect, of<br>
>> marking the storage instance as exposed, seems like it more-or-less<br>
>> has to be a temporal thing.<br>
>><br>
><br>
> If I'm understanding the model correctly, I think that depends on your<br>
> perspective.<br>
><br>
> If I understand correctly, a pointer-to-integer cast is only temporal to<br>
> the extent that there must be something "forcing" it to happen before a<br>
> integer-to-pointer cast that depends upon it exposing an object. (Ideally,<br>
> I think we'd like to specify that the integer-to-pointer cast has a<br>
> dependency on the pointer-to-integer cast, but that's a major can of worms<br>
> comparable to the consume memory order.)<br>
<br>
(quite. let's not :-)<br>
<br>
> If we model a program execution as<br>
> a set of possible executions (with defined behavior only if all possible<br>
> executions have defined behavior), and model pointers as having<br>
> nondeterministic corresponding integer values, then a pointer-to-integer<br>
> cast (along with the other mechanisms that expose pointers as integers) can<br>
> be viewed as a pure mathematical function (and in particular, it has no<br>
> temporal dependence nor side-effects), but it's an oracle that exposes<br>
> information that is not observable in any other way -- and we can see that<br>
> a program that never uses the oracle cannot possibly correctly "guess" the<br>
> integer corresponding to a pointer[1][2], so any integer-to-pointer<br>
> conversion must necessarily introduce a possible execution with undefined<br>
> behavior.<br>
> [1]: The oracle is the only way to determine the relevant information<br>
> about which execution in the set of possible executions is currently<br>
> occurring.<br>
<br>
All that's a pretty exactly description of the PNVI-plain variant.<br>
PNVI-ae-udi keeps the result of the pointer-to-integer cast pure, but<br>
adds the "make this storage instance exposed" side effect. And that<br>
makes a difference only to the integer-to-pointer casts one can do.<br>
<br>
(treating reads of representation bytes and suchlike as similar to p-to-i casts,<br>
and accesses via pointers that have, for example, had some of their<br>
bytes written via char* pointers, as similar to i-to-p casts)<br>
<br>
> [2]: In principle, a program could keep allocating memory and casting<br>
> pointers to integers until it's seen all integer values except one, and<br>
> then correctly guess the address of the remaining object. But an<br>
> implementation can prevent that by refusing to allocate all addressable<br>
> memory.<br>
<br>
y. This is the concern that pushed Juneyoung Lee et al. into their twin<br>
allocation model. But instead, for a source language, we can just<br>
limit attention to programs that never almost-exhaust memory<br>
(ie that leave space for a copy of their biggest (and suitably<br>
aligned) allocation).<br>
<br>
<br>
<br>
><br>
>> The integer-to-pointer cast *could* be left ambiguous until it's<br>
>> resolved - but that allows other strange things, eg it can be<br>
>> (eventually) resolved to an object that didn't even exist earlier.<br>
>> The temporal view seems (I hope :-) quite easy for programmers to<br>
>> understand.<br>
>><br>
><br>
> I think it's surprising either way -- either you find that a cast from<br>
> integer to pointer has an execution side-effect, and it matters where you<br>
> write it, or you find that you can cast an integer into a pointer to an<br>
> object that doesn't yet exist at the point of the cast.<br>
<br>
true<br>
<br>
> I think the former is surprising for practicing programmers, whereas the<br>
> latter is likely only surprising for language lawyers.<br>
<br>
hmm - the latter would be annoyingly complex to specify (like the "udi"<br>
part of PNVI-ae-udi but worse, as one gradually accumulates constraints<br>
on what the result of such a cast might be pointing to, based on<br>
what arithmetic is done to the pointer). It's hard to imagine that becoming<br>
widely understood...<br></blockquote><div><br></div><div>Under <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0593r3.html">p0593r3</a> (recently approved by WG21's Evolution Working Group and on its way towards C++20), a similar "pick the answer that gives the program defined behavior" rule will already be in use in C++ for other cases. That's not to say that that means it'll be understood, but we will at least have precedent.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> Moreover, I'd expect<br>
> the latter to be the model that implementations actually use,<br>
<br>
Can you expand on that? Not sure I understand.<br></blockquote><div><br></div><div>Here's an example in C++ where implementations might care about which object a pointer points to:</div><div><br></div><div>struct A { virtual void f() = 0; };</div><div>struct B : A { void f() override; };</div><div>struct C : A { void f() override; };</div><div>alignas(B, C) std::byte storage[std::max(sizeof(B), sizeof(C))];</div><div>A *p = new (storage) B;</div><div>intptr_t n = (intptr_t)p;</div><div>A *q = (A*)n; // #0</div><div>new (storage) C;</div><div>q->f(); // #1</div><div>q->f(); // #2</div><div><br></div><div>The C++ rules allow us to assume that the two q->f() calls call the same function: the same pointer value is used, so the pointer must point to an object with the same dynamic type. This permits us to remove a redundant vptr load in line #2.</div><div><br></div><div>Under the temporal model, an implementation could go further and prove that there's a B object within its lifetime at the point of the integer-to-pointer cast in line #0, and thereby decide that #1 and #2 both call B::f instead of C::f. I'm suggesting that implementations aren't going to do that, and that instead they'll only make the more conservative assumption that #1 and #2 load the same vptr value without assuming what that value is.</div><div><br></div><div>More broadly, I would expect the implementation model actually used for pointer-to-int conversions and int-to-pointer conversions to be based on (a conservative approximation to) determining on which pointer-to-integer conversions a given integer-to-pointer conversion is value- or control-dependent, rather than giving either conversion side-effects in the intermediate representation.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> so the<br>
> difference between the two will become a theoretical gotcha and we'll end<br>
> up with theoretically non-portable code relying on such casts being<br>
> timeless.<br>
><br>
> Regarding the pointer_from_integer_1pg.c example, I think we could address<br>
> that case in a somewhat different way. Currently in C++ we have a rule that<br>
> says:<br>
><br>
> """<br>
> When the end of the duration of a region of storage is reached, the values<br>
> of all pointers representing the address of any part of that region of<br>
> storage become invalid pointer values (6.7.2).<br>
> """<br>
><br>
> and similarly in C:<br>
><br>
> """<br>
> The value of a pointer becomes indeterminate when the object it points to<br>
> (or just past) reaches the end of its lifetime.<br>
> """<br>
<br>
Yes, although there is currently a move for C to at least partially remove that,<br>
as there are a bunch of longstanding concurrent algorithms that depend on<br>
== comparison with pointers to lifetime-ended objects. Not sure what will<br>
happen there.<br>
<br>
> (I'm ignoring the memory model problems with the use of "become[s]" here.)<br>
> It would seem natural to extend this so it applies at both the point of<br>
> allocation and the point of deallocation. Then it's not the cast to pointer<br>
> that has temporal behavior; rather, it's the allocation of the<br>
> automatic-storage-duration variable that causes there to be no valid<br>
> pointers into that region of storage, and changes the value of 'p' in the<br>
> example to an invalid/indeterminate pointer value. (It still ends up<br>
> mattering whether you perform the cast inside or outside the function, but<br>
> for a different reason.)<br>
<br>
That's exotic - never thought of that possibility. But I would like to have a<br>
semantics that doesn't rely on lifetime end-zap if we can.<br>
<br>
> This might be equivalent to integer-to-pointer casts being temporal in C<br>
> (because objects are by definition the same as the storage regions they<br>
> occupy), but not in C++.<br>
><br>
>>I think it'd be preferable to give them a single-but-unknown provenance,<br>
>> following the "pick whichever single value makes the rest of the program<br>
>> work" model of P0593R3 (notionally pretty similar to C's effective type<br>
>> rule -- you resolve to the first provenance that you use with the<br>
>> pointer).<br>
<br>
If an access is first, that's not too complex, but if pointer arithmetic happens<br>
first, it gets messy to record the resulting constraints. And we get a lot more<br>
instantaneous action-at-a-distance as these get resolved.<br>
<br>
>> That'd mean casts to pointer type are timeless and freely reorderable,<br>
>> and<br>
>> both the above examples are defined, not only the first one. However,<br>
>> N2363<br>
>> has a scary example involving guessing the address of a function<br>
>> parameter<br>
>> that is apparently defanged by the time-dependence of integer-to-pointer<br>
>> casts. To what extent is that essential? Is there a different way that<br>
>> guessing a pointer value could be disallowed?<br>
>><br>
>> Simple allocation-address nondeterminism disallows pointer value<br>
>> guessing.<br>
>><br>
>> thanks,<br>
>> Peter<br>
>><br>
>><br>
>> >> The user-disambiguation refinement adds some complexity but supports<br>
>> >> roundtrip casts, from pointer to integer and back, of pointers that<br>
>> >> are one-past a storage instance.<br>
>> >><br>
>> >><br>
>> >><br>
>><br>
><br>
<br>
best,<br>
Peter<br>
</blockquote></div></div>