Pointers and pointer provenance
Surprisingly, C++ pointers are not just addresses; here we explain what they are in the C++ standard and in BRiCk.
Two pointers might differ despite representing the same address, depending on how they’re constructed, and C/C++ optimizers are allowed to treat them differently.
For instance, in the following snippet px1
and py
are always different
pointers, even when they often have the same address. In particular, since px1
is
created from a pointer to x
, it cannot be used to read or write to y
; this
is similar to the restriction we show in Undefined behavior and optimizations.
int foo() {
int x, y;
int *px = &x;
int *py = &y;
int *px1 = px + 1;
//...
}
More generally, a pointer identifies the “object” to which it points, and pointer arithmetic is only allowed to produce pointers to unrelated objects in limited circumstances.
Objects in the C++ standard
Unlike in traditional object-oriented programming jargon,
in the C++ and C standards the word “object” means an instance of some
type (not necessarily a class type). Objects include variables and what
is created by new
expressions. These and other cases are introduced by
[intro.object].
Objects form a hierarchy of complete objects and subobjects (intro.object#2):
Objects can contain other objects, called subobjects. A subobject can be a member subobject ([class.mem]), a base class subobject ([class.derived]), or an array element. An object that is not a subobject of any other object is called a complete object.
Moreover, to support custom memory allocators, C++ allows allocating a complete object inside a character array (intro.object#3); the underlying array object is said to provide storage for the new object:
If a complete object is created ([expr.new]) in storage associated with another object e of type “array of N unsigned char” or of type “array of N std::byte” ([cstddef.syn]), that array provides storage for the created object if: […]
Pointers, pointer values and objects
BRiCk introduces an abstract type ptr
of pointer values, which models
the analogous concept in the standard2. BRiCk uses pointer
values to identify objects: each object has an associated pointer value, but a
pointer value might not point to an object. Specifically:
Whenever an object
o
is created, the BRiCk logic creates a pointer valuep : ptr
that describes whereo
lives. This pointer value need not exist in the C++ program, but it always exists in the logic.All C++ pointers have an associated pointer value in
ptr
, but the latter might not point to an actual object.
For instance, in the following program, at line 3, we can say that the pointer
value x_ptr
points to an integer object with value 42
. Note that the name
of the variable used for the pointer is not important.
1 void foo() {
2 int x = 42;
3 // `x_ptr |-> intR 42 1`
4 return x;
5 }
BRiCk pointers and optional addresses
Pointers always contain an address, but BRiCk pointer values need not. In our
last example, x
might live in a register or be removed altogether by
optimizations. But since BRiCk pointer values need not have an address, we can
reason about x_ptr
uniformly, irrespective of optimizations.
We use the function ptr_vaddr to compute the virtual address of a pointer.
Parameter ptr_vaddr : ptr -> option vaddr.
Pointer provenance in BRiCk
Each (valid) pointer value must contain an allocation ID. This ID identifies the complete object that the pointer refers to; similar concepts are common in modern formalizations of pointers, from CompCert onwards.
Notably, a single call to malloc
might allocate storage for multiple objects:
each such object will have a distinct allocation ID [#invalid-ptr-no-alloc-id].
Parameter ptr_alloc_id : ptr -> option alloc_id.
Importantly, the ID of a complete object differs from the ID of any character array that provides storage to the object.
Pointers and subobjects in BRiCk
A pointer identifies a “path” inside the complete object, where each step goes to a subobject; this is less common, but follows both Krebbers (2015) for C and Ramananandro for C++. For instance:
if pointer
p
points to astruct
instance, then pointerp ., _field field
points to the field identified byfield
.if pointer
p
points to an array of 10 integers (hence, also to its first element), then pointerp ., _sub Tint 1
points to the second element.
Above, p ,. o
represents the pointer resulting from applying the pointer offset o
to the pointer p
, and is a notation for _offset_ptr p o
.
To simplify reasoning about pointers, we provide an equational theory of pointer
equality, which helps us show that C++ snippets such as p + 2
and p + 1 + 1
produce the same pointer.
Pointer offsets form a monoid under concatenation, and _offset_ptr represent
their monoid action over pointers. That is, we can compose offsets (via
o_dot, also written .,
), this composition has an identity (o_id) and is
associative, and compositions with pointers is well-behaved. Moreover, specific
axioms allow us to collapse adjacent offsets, such as consecutive o_sub offsets.
Here are a few of the algebraic equations that apply to pointers and offsets.
offset_ptr_id : p ., o_id = p
offset_ptr_dot : p ., o1 ., o2 = p ., (o1 ., o2)
id_dot : o ., o_id = o
dot_id : o_id ., o = o
dot_assoc : o1 ., (o2 ., o3) = (o1 ., o2) ., o3
o_sub_0 : _sub ty 0 = o_id (* Under side conditions on [ty] *)
o_dot_sub : _sub T n1 ., _sub T n2 = _sub T (n1 + n2)
This is formalized in Coq in bedrock.lang.cpp.semantics.ptrs.
Integer-pointer casts
Beyond what is provided by the C++ standard, we assume useful semantics for integer-to-pointer casts, in particular, the PNVI-ae-udi model by the Cerberus project (as in the N2577 paper from the C standard committee). As in Cerberus:
Casting a pointer to an integer marks the allocation ID of the pointer as exposed.
Casting an integer to a pointer can produce any pointer with the same address so long as that pointer’s allocation ID has already been exposed.
However, some twists are required to account for the more complex memory model
from the C++ semantics. Unlike in Cerberus, more than two allocation IDs can cover
the same address. In C complete objects are generally disjoint, except that a past-the-end-pointer
can overlap with a pointer to another object. However, in C++ a complete object
with pointer p
(with provenance aid1 : alloc_id
) can be nested within a
character array that provides storage to it (with provenance aid2
), which can
be nested inside another character array providing storage to it (with
provenance aid3
), and so on. We assume that each of those provenances can be exposed
indipendently; casting the integer address of p
to a pointer follows the same
rules as above, so it can produce a pointer with any exposed allocation IDs.
In all cases, we assume the C++ abstract machine follows an extension of the PNVI-ae-udi model; in particular, the provenance remains ambiguous until such a point that all provenances except for one can be shown to produce undefined behavior.
Assumptions beyond the standard
As our goal is verifying low-level systems software, we make assumptions on our compilers, here and elsewhere:
We assume compilers do not zap pointers to deallocated objects, but might restrict operations on them (in particular equality comparisons). See Pointer lifetime-end zap (N2369), C memory object and value semantics: the space of de facto and ISO standards.
Support for effective types is also incomplete; similarly to Cerberus, we still assume users use options such as GCC/Clang’s
-fno-strict-aliasing
.
Further readings
For a crash course on formal models of pointers, consider also this blog post by Ralf Jung.
Footnotes
- 1
This appears to follow from intro.object#1 <https://eel.is/c++draft/intro.object#1>_, `basic.life#4 and basic.types.general#def:value. In particular, basic.life#4 licenses compilers to discard object contents outside their lifetime even in surprising scenarios; e.g. placement new over initialized memory is allowed to discard the initialization, even when the constructor is a no-op.
- 2
“Values of pointer type” are discussed in basic.compound#3.
- 3
The reason that this function is partial is because invalid pointers do not contain allocation IDs.