Pointers and pointer provenance

Surprisingly, C++ pointers are not just addresses; here we explain what they are in the C++ standard and in BRiCk.

Two pointers might differ despite representing the same address, depending on how they’re constructed, and C/C++ optimizers are allowed to treat them differently.

For instance, in the following snippet px1 and py are always different pointers, even when they often have the same address. In particular, since px1 is created from a pointer to x, it cannot be used to read or write to y; this is similar to the restriction we show in Undefined behavior and optimizations.

int foo() {
  int x, y;
  int *px = &x;
  int *py = &y;
  int *px1 = px + 1;
  //...
}

More generally, a pointer identifies the “object” to which it points, and pointer arithmetic is only allowed to produce pointers to unrelated objects in limited circumstances.

Objects in the C++ standard

Unlike in traditional object-oriented programming jargon, in the C++ and C standards the word “object” means an instance of some type (not necessarily a class type). Objects include variables and what is created by new expressions. These and other cases are introduced by [intro.object].

Objects form a hierarchy of complete objects and subobjects (intro.object#2):

Objects can contain other objects, called subobjects. A subobject can be a member subobject ([class.mem]), a base class subobject ([class.derived]), or an array element. An object that is not a subobject of any other object is called a complete object.

Moreover, to support custom memory allocators, C++ allows allocating a complete object inside a character array (intro.object#3); the underlying array object is said to provide storage for the new object:

If a complete object is created ([expr.new]) in storage associated with another object e of type “array of N unsigned char” or of type “array of N std​::​byte” ([cstddef.syn]), that array provides storage for the created object if: […]

Objects and values

The C++ standard also talks about values, at least for primitive objects; during its lifetime, an object of primitive type stores a primitive value1. The BRiCk semantics contains a type val of primitive values.

Pointers, pointer values and objects

BRiCk introduces an abstract type ptr of pointer values, which models the analogous concept in the standard2. BRiCk uses pointer values to identify objects: each object has an associated pointer value, but a pointer value might not point to an object. Specifically:

  • Whenever an object o is created, the BRiCk logic creates a pointer value p : ptr that describes where o lives. This pointer value need not exist in the C++ program, but it always exists in the logic.

  • All C++ pointers have an associated pointer value in ptr, but the latter might not point to an actual object.

For instance, in the following program, at line 3, we can say that the pointer value x_ptr points to an integer object with value 42. Note that the name of the variable used for the pointer is not important.

1 void foo() {
2   int x = 42;
3   // `x_ptr |-> intR 42 1`
4   return x;
5 }

BRiCk pointers and optional addresses

Pointers always contain an address, but BRiCk pointer values need not. In our last example, x might live in a register or be removed altogether by optimizations. But since BRiCk pointer values need not have an address, we can reason about x_ptr uniformly, irrespective of optimizations.

We use the function ptr_vaddr to compute the virtual address of a pointer.

Parameter ptr_vaddr : ptr -> option vaddr.

Pointer provenance in BRiCk

Each (valid) pointer value must contain an allocation ID. This ID identifies the complete object that the pointer refers to; similar concepts are common in modern formalizations of pointers, from CompCert onwards.

Notably, a single call to malloc might allocate storage for multiple objects: each such object will have a distinct allocation ID [#invalid-ptr-no-alloc-id].

Parameter ptr_alloc_id : ptr -> option alloc_id.

Importantly, the ID of a complete object differs from the ID of any character array that provides storage to the object.

Pointers and subobjects in BRiCk

A pointer identifies a “path” inside the complete object, where each step goes to a subobject; this is less common, but follows both Krebbers (2015) for C and Ramananandro for C++. For instance:

  • if pointer p points to a struct instance, then pointer p ., _field field points to the field identified by field.

  • if pointer p points to an array of 10 integers (hence, also to its first element), then pointer p ., _sub Tint 1 points to the second element.

Above, p ,. o represents the pointer resulting from applying the pointer offset o to the pointer p, and is a notation for _offset_ptr p o. To simplify reasoning about pointers, we provide an equational theory of pointer equality, which helps us show that C++ snippets such as p + 2 and p + 1 + 1 produce the same pointer.

Pointer offsets form a monoid under concatenation, and _offset_ptr represent their monoid action over pointers. That is, we can compose offsets (via o_dot, also written .,), this composition has an identity (o_id) and is associative, and compositions with pointers is well-behaved. Moreover, specific axioms allow us to collapse adjacent offsets, such as consecutive o_sub offsets.

Here are a few of the algebraic equations that apply to pointers and offsets.

offset_ptr_id : p ., o_id = p
offset_ptr_dot : p ., o1 ., o2 = p ., (o1 ., o2)

id_dot : o ., o_id = o
dot_id : o_id ., o = o
dot_assoc : o1 ., (o2 ., o3) = (o1 ., o2) ., o3

o_sub_0 : _sub ty 0 = o_id (* Under side conditions on [ty] *)
o_dot_sub : _sub T n1 ., _sub T n2 = _sub T (n1 + n2)

This is formalized in Coq in bedrock.lang.cpp.semantics.ptrs.

Integer-pointer casts

Beyond what is provided by the C++ standard, we assume useful semantics for integer-to-pointer casts, in particular, the PNVI-ae-udi model by the Cerberus project (as in the N2577 paper from the C standard committee). As in Cerberus:

  • Casting a pointer to an integer marks the allocation ID of the pointer as exposed.

  • Casting an integer to a pointer can produce any pointer with the same address so long as that pointer’s allocation ID has already been exposed.

However, some twists are required to account for the more complex memory model from the C++ semantics. Unlike in Cerberus, more than two allocation IDs can cover the same address. In C complete objects are generally disjoint, except that a past-the-end-pointer can overlap with a pointer to another object. However, in C++ a complete object with pointer p (with provenance aid1 : alloc_id) can be nested within a character array that provides storage to it (with provenance aid2), which can be nested inside another character array providing storage to it (with provenance aid3), and so on. We assume that each of those provenances can be exposed indipendently; casting the integer address of p to a pointer follows the same rules as above, so it can produce a pointer with any exposed allocation IDs.

In all cases, we assume the C++ abstract machine follows an extension of the PNVI-ae-udi model; in particular, the provenance remains ambiguous until such a point that all provenances except for one can be shown to produce undefined behavior.

Assumptions beyond the standard

As our goal is verifying low-level systems software, we make assumptions on our compilers, here and elsewhere:

Further readings

For a crash course on formal models of pointers, consider also this blog post by Ralf Jung.

Footnotes

1

This appears to follow from intro.object#1 <https://eel.is/c++draft/intro.object#1>_, `basic.life#4 and basic.types.general#def:value. In particular, basic.life#4 licenses compilers to discard object contents outside their lifetime even in surprising scenarios; e.g. placement new over initialized memory is allowed to discard the initialization, even when the constructor is a no-op.

2

“Values of pointer type” are discussed in basic.compound#3.

3

The reason that this function is partial is because invalid pointers do not contain allocation IDs.