Document number | P1434R0 |
Date | 2019-01-21 |
Project | Programming Language C++, SG12 (Undefined and Unspecified Behavior) |
Reply-to | Hal Finkel <hfinkel@anl.gov> |
Authors | Hal Finkel <hfinkel@anl.gov>, Jens Gustedt <jens.gustedt@inria.fr>, Martin Uecker <Martin.Uecker@med.uni-goettingen.de>, |
There is ongoing work on a proposal for WG14 based on this POPL 2019 paper: Exploring C Semantics and Pointer Provenance. The authors of this paper, along with significant work by Jens Gustedt, are working on proposed wording changes to the C specification. Of the options discussed in that paper, the model variant currently receiving this attention is the provenance-not-via-integer, tainting all, user-disambiguation model (PNVI-taint-all-udis).
See also the storage-instance paper by Jens (WG14 N2328), and the closely-related formal model by Kang et al. (alt).
What follows is a summary of this model by Jens and Martin. This represents work still under development and active revision; early feedback from WG21 is requested.
A "storage instance" is the "byte array" that is created when either
an object starts its lifetime (for static, automatic and thread
storage duration) or an allocation function is called (malloc
,
calloc
etc). Storage instances are more than just an address, they
have a unique ID throughout the whole execution. Once their lifetime
ends, another storage instance may receive the same address, but never
the same ID.
The provenance of a valid pointer is the "storage instance" to which the pointer refers (or one past). This is part of the "abstract state" in C's abstract machine, not necessarily part of the object representation of the pointer itself.
Valid pointers keep provenance to the encapsulating storage instance
of the referred object. When the storage instance dies (falls out of
scope, end of thread, free
) the pointer becomes indeterminate.
Ordered comparisons (<, >, >=, <=) between pointers are only defined when the two pointers have the same provenance. They then can be defined by the relative byte position in the byte array of the common storage instance.
Equality of pointers is handled by a case analysis:
Pointer arithmetic (addition or subtraction of integers) preserves provenance. The pointer becomes indeterminate if the result is outside the storage instance or goes beyond the array that the pointer is referring to (or is is the "one past" address).
Pointer difference is only defined for pointers with the same provenance and within the same array.
Pointer values can be copied by the usual means that is: assignment,
memcpy
and byte-wise copy. These copy over provenance in addition to
the representation and the effective type. (There is certainly more
work to do here to say exactly what that means. For the moment, let's
go with "any copy operation that would propagate the effective type".)
No other manipulation of the representation of a pointer will lead to a valid pointer value, because neither the effective type nor the provenance can be reconstructed from such manipulations. Thus the value of such pointers is indeterminate.
A storage instance is "tainted" once any valid pointer with this
provenance is converted to integer (cast) or to IO (printf
with
"%p"). For the sake of the "happened before" relation, "tainting"
constitutes a side effect, even though the taint is not observable.
This "tainting" does *also* happen for the end address of a storage instance. An pointer-to-integer cast has to result in the same integer value, regardless if a the pointer has the provenance as end address of one storage instance A or as the start address of another storage instance B, where B happens to immediately follow A in the address space.
The idea behind "tainting" is that once a pointer has escaped to an integer or to IO, all aliasing analysis is jeopardized. On the other hand, pointers to a storage instance for which a compiler can prove that it is untainted (e.g a because it is stack variable and no address has been taken), can never alias unexpectedly.
An integer-to-pointer conversion (cast) or IO (scanf
with "%p") is
only defined if the corresponding storage instance had been tainted,
and if the result is a pointer to a byte (or one-after) of the storage
instance.
Ambiguous Provenance:
With the above, there is one special case where a back-converted pointer (let's just assume integer-to-pointer) could have two different provenances. This can happen when:
a
having provenance A, and b
having provenance B.In such a situation, both A and B could be valid choices for the provenance.
Our trick is to leave which of A or B is chosen to the programmer. It is their responsibility to be consistent, and to disambiguate such situations when necessary:
If p
is the result of an integer-to-pointer cast with two
possible provenances and p
is used with both provenances, the
behavior is undefined.
Note: If the result p
of an integer-to-pointer conversion is the end address of
a tainted storage instance A and the start address of another tainted
storage instance B that happens to follow immediately in the address
space, a conforming program must only use one of these provenances in any
expressions that is derived from p
.
The following three cases determine if p
is used with one of
A or B and must hence not be used otherwise:
p
with either A
or B and do not prohibit a use with the other:
q
may have both provenances, that is where
q
is also the result of a similar conversion and where
p == q
;q == p
and q != p
regardless of the provenance
of q
;p
with A and prohibit any use with B:
q
has provenance A and cannot have
provenance B.p + n
and p[n]
, where n
is an integer
strictly less than 0
.p - n
, where n
is an integer strictly greater
than 0
.p
with B and prohibit any use with A:
q
has provenance B and cannot have
provenance A.p + n
and p[n]
, where n
is an integer
strictly greater than 0
.p - n
, where n
is an integer strictly less
than 0
.*p
or p[n]
for n == 0
) and member access
(p->member
).