Capability Revocation

Fine Details of Memory's Path Through free()

When an application calls free(), the contents of the region are best viewed as toxic junk of which we need to dispose before we can cycle it back to the application. There are two basic aspects to this:

before we can slate memory for reuse, there must be no capabilities to it outside the allocator.
when we give memory back to the application, there must be no tags within it.

Our promise is intended to be quite weak:

No live capability aliasing; if the application still has a tagged capability to memory, that memory has not yet been reused even if it has been free()'d. Contrapositively, memory that has been (slated for) reuse will have had all stale authorizing caps' tags cleared.

The basic flow is this:

On free(), memory is put into a quarantine.
At some point thereafter, the status of some/all quarantined allocations is translated into a bitmap for the kernel's consumption. This bitmap is currently chosen to have one bit per 16 bytes of memory, thereby defining the granularity of revoke-ability.

See Staging Revocation of an Allocation for the mechanism.

For correctness, entire allocations have to go in to the bitmap (see What is Revocation, Exactly?).

This step may be deferred arbitrarily, and there are perhaps good reasons to do so that we'll talk about in a minute (see Coalescing in Quarantine and Segmented Quarantines).
At yet some later point, the allocator requests that the kernel revoke access according to the bitmap generated earlier. This step itself may happen in passes that we'll talk about in another minute (see Multi-pass Revocation).
Any allocations that were translated to bitmaps prior to the previous step are now guaranteed to have no inward authorizing capabilities other than those held by the allocator or internal to the kernel (but not caps passed by the application to the kernel!). These regions can be made available for reuse.
Prior to handing out a reused region to the application, we must ensure that it has no tags. We should perhaps be kind and in fact zero the whole thing.

Despite the seeming simplicity, there're a number of options for an implementation!

When revoking, it would be good to have as few capabilities, and especially pages with capabilities, within dead regions of memory as possible, so it may make sense to zero upon free(), even though the application has access and could unzero it (by mistake or maliciously). Otherwise, the revoker will be visiting pages and cachelines that are certain to be zeroed anyway by the time addresses are reissued to the application.
Despite that we only promise to catch use-after-reallocation, we might want to catch UAF, too. While we could munmap(), we're hoping to have a dedicated MPROT_QUARANTINE implementation that doesn't need to be mated with a mmap() or mprotect() later. This would additionally let us avoid zeroing entire pages on free() and also post-revocation.
Revocation unto itself is expensive: it's at least one visit to every capability-bearing page in the system, and possibly many, depending on the concurrency choices made. Thus, we should attempt to do it as little as possible (and so, quarantine) as well as maximize the benefit of revocations done (see Segmented Quarantines).
But even requesting revocation can be expensive. It may make sense to defer generating the bitmaps rather than setting them on every free(); this is another exciting use of MPROT_QUARANTINE.

What Is Revocation, Exactly?

The revoker is a very simple thing, in rough sketch:

Given: the contents of memory &c and a bitmap of quarantine status

For each capability anywhere the application can access,
  If that capability's *base* is marked as quarantined,
    Replace it with its de-tagged, 0-permission form.

Specifically, "anywhere the application can access" means:

User-visible pages of memory
Processor register files
Kernel callback sources, specifically aio and kqueue

The base of the capability is chosen for two reasons:

It cannot be out of bounds, unlike the cursor.
If the allocator properly sets CHERI bounds, the base must, in fact, be within the original allocation. This depends on the fact that two adjacent capabilities cannot be fused.

We leave base, offset, and length of the capability alone, so as to be compatible with some (probably actually undefined behavior) C idioms.

And thus we see the need to set all bits corresponding to the "shadow" of each quarantined allocation: the application might have derived a capability to a sub-object such that the resulting base is not the base of the allocation. Dually, the top may not be the top of the allocation. Despite both possibilities, the cursor may be pointing not only out of bounds but into another allocation.

At present, we believe that managing the need to clear tags prior to reuse is best left to userland. While the kernel will absolutely test any capability found in dead memory (as well as all the live ones), forcing it to ask about the quarantined status of the memory containing every capability as well as the memory to which each capability points sounds more expensive than asking the allocators to manage this aspect.

Staging Revocation of an Allocation

In General

There are a few pieces to the "translated into a bitmap" step above. First, we're going to need access to the bitmap for a given piece of the virtual address space. To ensure that only the allocator can make (or clear) staging, this access is only given if software can demonstrate that it has a capability bearing VMMAP permission to the region in question, as returned by, for example, mmap().

Note

We expect that the allocator clears VMMAP permission on all objects it gives out; this was already true of the in-tree jemalloc prior to any design for revocation, and we have applied it to the allocators under study here.

So armed with such a capability, called "arena", a system call

void * __capability arena_shadow;

int res = caprevoke_shadow(CAPREVOKE_SHADOW_NOVMMAP, arena, &arena_shadow);

will return a suitable capability to the bitmap. Internally to the revoker, there are in fact other bitmaps (one, for sealing types, accessible via CAPREVOKE_SHADOW_OTYPE, and the other, internal, at page granularity, deliberately not exposed to userland, for the use of MPROT_QUARANTINE). Rather than go through all the tedium of finding the correct bit(s) for an allocation, this logic has been library-fied in libcheri_caprevoke. Before detailing the operation thereof, we must make one final note.

CHERIfied allocators (almost?) universally must be able to re-derive a VMMAP-bearing capability to an allocation given only their globals and the non-VMMAP-bearing capability held by the application. For somewhat tricky reasons (see Concurrent Revocation Safety of Bitmap Marking), libcheri_caprevoke's method to set bits takes both the application capability and this re-derived capability, upon which the spatial bounds must be appropriately set for the allocation in question, as the library sets bits from the base to top of this capability. So armed, the call

int res = caprev_shadow_nomap_set(arena_shadow, rederived_cap, app_cap);

will set the correct bits. Internally, it strives to write 64-bit words (see Are Wide Bitmap Stores Always Safe?) and uses up to two LL/SC loops to ensure that partial updates of words shared with other allocations are not clobbered. The caller must always check the result; a non-zero value indicates an attempted double-free() and should cause free() to return in a mostly-idempotent way (statistical counters are fine to adjust; changes to the heap are almost surely not fine). (See Concurrent Revocation Safety of Bitmap Marking for more details.)

After revocation, the same library can be used to clear the bits prior to reissue:

caprev_shadow_nomap_clear(arena_shadow, rederived_cap);

Assuming We Have MPROT_QUARANTINE

A feature still in the works is to have the kernel manage staging entire pages internally. The kernel knows about revocation and so should be able to do the right thing here. Once this is implemented, this section will be updated.

A 4096-byte page of 1-bit-per-16-byte flags covers 512 KiB of the address space. We believe that this, or a near power-of-two multiple, is a good threshold for distinguishing between "large" and "small" allocations when trading off between caprev_shadow_nomap_set and MPROT_QUARANTINE. The exact cross-over point has yet to be measured.

A little about the mechanism is probably in order. We intend that the allocator can, rather than calling caprev_shadow_nomap_set (or the _raw variant) for large regions, instead call mprotect(MPROT_QUARANTINE) on the region in question. This will remove all pages within the indicated region from the process's address space and (atomically) convert it to the equivalent of a MAP_GUARD region: impossible to back with pages and not reusable as address space for later (unhinted) mmap() calls. The kernel, internally, has a coarse-grained revocation bitmap, at one bit per page, that the revoker also tests during revocation scans; these pages will have their shadow set.

After revocation (so that the coarse bitmap is accurate), but before returning to userland, the kernel can quickly iterate all such regions and clear their bitmaps and revert the address space regions to anonymous memory.

Coalescing in Quarantine

Staging revocation requests early is advantageous: the next revocation allows them to be putback into circulation. However, small allocations incur overheads during staging: only a few bits get changed and the full cost of atomics is borne. Large allocations amortize atomics across potentially multiple full-word writes and, if sufficiently large, may (eventually) be better off being handled by MADV_QUARANTINE.

It is therefore a policy and allocator mechanism question: should free() regions be coalesced in quarantine prior to staging? If so, under what policy? Tentatively, we suggest that a dual-queue design may be the right one:

on free, small allocations that do not coalesce with existing unstaged allocations are placed at the end of the first LRU queue; large allocations that do not coalesce are placed at the end of the second LRU queue.
whenever two or three allocations coalesce in the first queue, if the result is still small, they are replaced with their merge at the end of the first queue. If the result is large, they are removed and their merge is added to the end of the second queue.
whenever an allocation coalesces with one or two objects in the second queue, they are removed and the merger is put on the end of the second queue.

When the quarantine revocation policy demands that memory be revoked for reuse, the second queue (of large objects) is drained in order (i.e., least-recently-coalesced large allocations are staged first) until the policy believes that quarantine will be below threshold. If that fails to satiate the policy, the first queue is drained in order.

We assume that whatever interlocking protects the quarantine coalescing metadata structures can be extended to cover staging to the bitmap. That is, only one thread will attempt to stage a coalesced collection of allocations to the bitmap. Allocators should avail themselves of the "raw" interface in libcheri_caprevoke. To eliminate concerns of representability of coalesced regions as capabilities, this raw interface acts directly on raw vaddr_t cursors into the heap; heap_start points at the first byte of the region and heap_end the last:

caprev_shadow_nomap_set_raw(arena_shadow, heap_start, heap_end);
caprev_shadow_nomap_clear_raw(arena_shadow, heap_start, heap_end);

caprev_shadow_nomap_set_raw returns void as it cannot fail (except to crash if improperly invoked).

These raw bitmap-accessing functions do not use atomics internally, as we presume that allocators obtain their backing stores in units of whole pages, and a 4K page has a 32-byte shadow in the bitmap, which is more than sufficient to ensure that each 64-bit word in the bitmap is associated with only one allocator.

Single-pass Revocation

Looking Up The Time

Once, at allocator startup, a revocation-aware allocator should request a capability to some kernel-maintained counters:

volatile const struct caprevoke_info *cri;

int res = caprevoke_shadow(CAPREVOKE_SHADOW_INFO_STRUCT, NULL, (void **)&cri);
assert(res == 0);

Performing Revocation

Having filled out the bitmap "form" of requests for revocation, revocation itself may be performed by the following sequence:

/* Ensure all cores see the ones in bitmap */
atomic_thread_fence(memory_order_acq_rel);
/* And ensure that this read definitely happens after the release */
caprevoke_epoch start_epoch = cri->epoch_enqueue;

while(!caprevoke_epoch_clears(cri->epoch_dequeue, start_epoch)) {
  struct caprevoke_stats crst;
  caprevoke(CAPREVOKE_LAST_PASS, start_epoch, &crst);
}

/* Zero the bitmap requests for our quarantined allocations */

/* Ensure all cores see the zeros; it's fine for our loads to move earlier */
atomic_thread_fence(memory_order_release);

caprevoke() is a synchronous call: it will not return until revocation is finished. Upon return, crst contains useful statistics about the work done (if the kernel is compiled with CHERI_CAPREVOKE_STATS). The names of the fields within struct caprevoke_info may be mysterious, but will hopefully become clear in the next section.

If it is possible that multiple threads are engaged in revocation, we need to defend against the possibility that one was doing so while we were busy filling out the bitmap, which is why we needed the memory fence: we had to ensure that our stores were visible to all CPU cores.

In almost all cases, the first caprevoke is sufficient and the while loop does not run its body more than once. However, if a revocation is now in progress or, as we will say later, "an epoch is open", the loop will walk the full caprevoke state machine to the desired point, where we can be sure that everything we have just written to the bitmap form has been revoked.

Segmented Quarantines

Suppose we have several allocators in a program; snmalloc is a good example as it has a dedicated allocator per-thread, but one could just have several truly different allocators encapsulating everything from mmap() to managing their objects' revocation. In such a case, we would like to share the work done by revocation: because revocation necessarily is a global (well, per-address-space) operation, if allocator A request revocation, any requests staged to the bitmap by allocator B will also be revoked! The challenge is for B to know that this has taken place, asynchronously.

Towards this end, revocation divides time into "epochs." There are two, related questions one might ask about epochs:

by which epoch had staging a revocation certainly finished?
given the current epoch, which staged revocations are certainly done?

In the earlier story, epochs did not matter so much: certainly all staging had been done prior to the invocation of caprevoke() and by the end of the call, all revocation had been done. Epoch counters are a kind of lazy synchronization with other threads.

To take advantage of epochs, we divide our quarantine into segments, filled sequentially. If we have just filled a filling segment of our quarantine, we may, rather than revoking now, merely label the segment structure with the current epoch and add it to the quarantine:

atomic_thread_fence(memory_order_acq_rel);
filling->epoch_enq = cri->epoch_enqueue;

/* append filling as the youngest quarantine node */

Because time has, presumably, elapsed whilest this segment was being filled, it makes seense to probe at the oldest quarantine segment(s) and assess their revocation status:

while(caprevoke_clears(cri->epoch_dequeue, quarantine->oldest->epoch_enq)) {
  done = quarantine->oldest;
  quarantine->oldest = quarantine->oldest->next;

  /* de-quarantine all allocations in done; free done itself */

}

/*
 * Ensure that all cores see the zeros written to the bitmap by
 * dequarantining.  This may be fused into existing allocator
 * serialization, if present.
 */
atomic_thread_fence(memory_order_release);

Pay careful attention to the use of epoch_enqueue to label the now-filled segment and epoch_dequeue to do the comparison for removing segments. You are welcome to treat caprevoke_epoch_clears() as a black box (and perhaps should; but see So What Are Epochs, Anyway? if you're curious.)

At this point, the allocator should re-evaluate its quarntine policy: it is possible that the while loop did not run and so the quarantine has merely grown by the addition of filling. If it finds that it is over quota, then it should revoke the oldest segment it holds:

caprevoke_epoch start_epoch = quarantine->oldest->epoch_enq;
while(!caprevoke_epoch_clears(cri->epoch_dequeue, start_epoch)) {
  struct caprevoke_stats crst;
  caprevoke(CAPREVOKE_LAST_PASS, start_epoch, &crst);
}

If invoked, this code fragment will, by its exit, ensure that some number, but not necessarily all, of segments of the quarantine are ready to be dequeued. The dequeing loop above will certainly then run its body at least once.

We now have a policy question! It is advantageous to frequently know the label we should use for the contents of our quarantine, as that lets us maximize the work we can piggy-back off other revocations. It's also advantageous to not ask too often, as there is overhead in the queue of segments.

Epochs and MADV_QUARANTINE

Upon marking a region of the address space as MADV_QUARANTINE'd, the kernel should obtain the (inerlocked!) epoch counter at the end of its operation (i.e., just before returning to userland) and mark the region with this value.

When the revoker encounters a MADV_QUARANTINE'd region whose epoch is sufficiently far in the past that this pass clears it, it may, after finishing this pass and before advancing the epoch clock, then reset the region to being ordinary, anonymous address space.

Multi-pass Revocation

The rough sketch of revocation makes sense as an atomic action, but that requires that we stop the world for the duration of a scan through all of memory, a rude proposal at best. We therefore divide revocation into a series of passes:

an "opening" pass, done by a single thread without the world stopped. This visits all capability-bearing pages.
zero or more "middle" passes, which visit "recently capability-dirty" pages (i.e., pages to which capabilities have been stored since they were visited by the opening pass).
a "closing" or "last" pass, which runs with the world stopped, but, like the middle passes, visits only the "recently capability-dirty" pages (but also visits some kernel state while the world is stopped).

If no flags are specified, caprevoke() does an opening or middle pass. CAPREVOKE_LAST_PASS indicates our desire to (open and) close revocation. If caprevoke() must open, as well as close, it will do the opening pass without the world stopped unless CAPREVOKE_LAST_NO_EARLY is also given. If CAPREVOKE_LAST_PASS encounters an open epoch, it will first do a "middle" pass without the world stopped, unless CAPREVOKE_LAST_NO_EARLY is also given. This CAPREVOKE_LAST_NO_EARLY flag is almost surely only useful for experimentation.

It is a policy question as to when an allocator should open and close a multi-pass revocation or run a middle pass. We believe the default policy, of opening and immediately closing, is a sensible one, but experimentation is merited. In all cases, the loop exhibited above suffices to gate segments' being de-quarantined.

We can, if we like, replace the earlier example of queueing of a full segment into quarantine with just about any operation we like. The flags CAPREVOKE_NO_WAIT_OK | CAPREVOKE_IGNORE_START | CAPREVOKE_LAST_NO_EARLY will be essentially equivalent to the shared-state access above: no revocation will take place. We can also "hurry along" an open revocation, but not open a new one, with CAPREVOKE_NO_WAIT_OK | CAPREVOKE_ONLY_IF_OPEN | CAPREVOKE_LAST_PASS, which will close an open revocation unless there is already a revoker doing work towards that end (though it may just be doing a "middle" pass).

Multi-pass and Epochs

Epochs are divided in twain to capture the semantics of multi-pass revocation: we refer to "open" epochs and "closed" ones. Their ordering is as you might expect, closed becomes open becomes the next closed becomes the next open, &c. caprevoke_epoch_clears() encapsulates the logic necessary to make sure that staging in an open epoch waits for the closing pass after the next opening pass.

Multi-pass and MADV_QUARANTINE

Resetting MADV_QUARANTINE regions must happen in a second pass over the address space after the "closing" pass: the bitmap must remain intact for the duration of the "closing" pass. Fortunately, this is merely iterating the map entries, and so should be quick; if necessary, the map entries can be specificially chained together for faster iteration.

So What Are Epochs, Anyway?

Epochs, like time in general in multi-threaded systems, are complicated things. An epoch identifier refers to a point in time, or, more properly, in the partial order of memory transactions, prior to which some thing(s) certainly had taken place. The intended use of struct caprevoke_info epoch_enqueue and epoch_dequeue are:

epoch_enqueue is the latest epoch guaranteed to be before all reads of the current or next revocation.
epoch_dequeue is the earliest epoch guaranteed to be after all past revocations.

The sole comparator we care about is caprevoke_epoch_clears(now, then). This is true if the epoch value now releases all staged revocation requests completed prior to then. This function encapsulates the following reasoning:

a staged revocation request is certainly done if revocation has both begun and then ended after its staging.

(Importantly, it does not suffice for one revocation to end and another to begin.) The implementation details are interesting, but this document is already too long.

Concurrent Revocation Safety of Bitmap Marking

All of this is encapsulated in libcheri_caprevoke and can be taken as black box.

There are several worrying possibilities that might arise when we are staging allocations to the kernel's bitmap:

Multiple allocations' shadows may occur within the same byte or word.
This is a double-free() and the allocation is already staged.
This is a free() using a capability whose allocation is staged for revocation.
A revocation pass might happen while we are in the middle of filling out the bitmap for a given structure. If we have filled in the bottom bit, then from the kernel's perspective, any capability we hold to the object whose base has not advanced is worthy of revocation.

caprev_shadow_nomap_set takes both a re-derived capability to the allocation to be staged, which should bear VMMAP permission so as to not be subject to revocation, and the capability provided by the application to free(). This re-derived capability is used to determine which bits to set, and so addresses the last point, regardless of how the revoker ultimately mangle capabilities it finds worthy of revocation.

The first point is addressed by using LL/SC to implement atomic ORs for the first and last 64-bit word occupied by the shadow of the allocation being staged. All intervening words are assumed to be property of this allocation and are simply written to with all bits set; this will be true of an allocator without placement bugs that generate overlapping objects.

The second point, of double-``free()``s is addressed by extending the LL/SC for the first word test, between LL and SC, that none of the bits about to be set are already set. If any are, the call signals a failure.

The third point is the most tricky. Just because the capability passed to free() has been staged does not mean that the revocation has already happened at the time of free() and so there is an inherent risk of a TOCTTOU race. Moreover, the revocation of this allocation and clearing of its shadow in the bitmap may take place between free() and the check inside the LL/SC mentioned above. In order to properly interlock with the revoker, the LL/SC for the first word is again extended to verify the tag of the application-sourced capability between the LL and SC instructions, thereby closing the TOCTTOU window and ensuring that double-``free()``s which straddle epochs remain idempotent.

Finally, because caprevoke() as we have used it here always performs interlocking against other caprevoke() calls, we can be certain that our stores to the bitmap are visible to other threads and CPUs engaged in revocation in time for the test in caprevoke_epoch_clears() to be correct. Closing ("last") passes additionally single-thread the world during their operation and so ensure that all threads on all cores see the tag-clearing as having taken place.

Are Wide Bitmap Stores Always Safe?

The libcheri_caprevoke helpers always write 64-bit words to the kernel's bitmap. A 64-bit word corresponds to an aligned, 1024-byte segment of the address space. As all mmap() results are at least 4096-byte aligned and at least 4096 bytes long, it is impossible that an aligned 64-bit bitmap word cross a permission boundary. While, in principle, one could request bitmap access capabilities for sub-page regions, we assume that our allocators do not.

Unsafe Experimental Options

The revoker tries very hard to not give callers sharp edges. However, for experimentation's sake, we expose some interesting behaviours. In particular, there are a series of CAPREVOKE_JUST_* flags that will cause caprevoke() to forego its usual operation, including interlocking with other callers of caprevoke(). While most of these are useful only to test various aspects of the implementation, one has proven tempting to use with general revocation and deserves special attention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly