Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metaissue] Memory allocators in Gramine, their performance and possible alternatives #1767

Open
13 tasks
dimakuv opened this issue Feb 9, 2024 · 1 comment
Open
13 tasks
Labels
enhancement New feature or request needs discussion Needs to be discussed on the maintainers call P: 3

Comments

@dimakuv
Copy link

dimakuv commented Feb 9, 2024

Description of the feature

Memory allocators in Gramine were written ~15 years ago. It's time to re-evaluate them and propose alternatives.

I want to start a discussion on this topic. In particular, the next posts will contain information on:

  • Description of current memory allocators:
    • MEMMGR fixed-size allocator: for specific objects in specific subsystems (currently only in LibOS)
    • SLAB variable-size allocator: generic backend for malloc and free in all other subsystems (both in LibOS and in PAL)
  • Users of current memory allocators:
    • Users of MEMMGR
    • Users of SLAB
  • Statistics using a set of common workloads:
    • Stats for MEMMGR
    • Stats for SLAB
  • Discussion on the need of two separate allocators
  • Conclusions: Open issues, unimplemented functionality, perf bottlenecks
  • Bonus: need for object caching? (i.e., keeping objects not just pre-allocated but also in initialized state)
  • Possible alternatives

Why Gramine should implement it?

We see more and more performance bottlenecks in memory/object allocation inside Gramine itself. They all stem from the fact that our memory allocators are old, ad-hoc, not scalable, and not CPU- and cache-friendly; also our allocators do not perform object caching.

Just a couple of recent examples:

For possible alternatives (in Rust), see this comment: #1723 (review)

@dimakuv
Copy link
Author

dimakuv commented Feb 9, 2024

Description of current memory allocators

There are two memory allocators in Gramine: MEMMGR and SLAB.

Both allocators rely on the following shared logic:

  • system_malloc() and system_free() are macros that are defined inside LibOS and in PAL

    • these macros/functions are "back-end" allocations of page-sized contiguous memory regions; i.e. these are actual host memory allocations
    • these macros/functions are used to dynamically grow (enlarge) and shrink (not really used!) the slabs/caches in MEMMGR and SLAB
    • defined in LibOS like this:
    • defined in PAL like this:
      static inline void* system_mem_alloc(size_t size) {
  • SYSTEM_LOCK() and SYSTEM_UNLOCK() are macros that are defined inside LibOS and in PAL

    • protect the specific MEMMGR "memory managers", each MEMMGR manager object has its own lock; this allows more parallelism as each LibOS subsystem that uses MEMMGR introduces its own lock
    • defined in LibOS like this:
      • for SLAB:
        static struct libos_lock slab_mgr_lock;
        #define SYSTEM_LOCK() lock(&slab_mgr_lock)
        #define SYSTEM_UNLOCK() unlock(&slab_mgr_lock)
        #define SYSTEM_LOCKED() locked(&slab_mgr_lock)
      • for MEMMGR, each LibOS subsystem has its own lock, e.g. for FS operations:
        static struct libos_lock dcache_mgr_lock;
        #define SYSTEM_LOCK() lock(&dcache_mgr_lock)
        #define SYSTEM_UNLOCK() unlock(&dcache_mgr_lock)
        #define SYSTEM_LOCKED() locked(&dcache_mgr_lock)
    • defined in PAL like this (only SLAB):

      gramine/pal/src/slab.c

      Lines 15 to 19 in e740728

      static spinlock_t g_slab_mgr_lock = INIT_SPINLOCK_UNLOCKED;
      #define SYSTEM_LOCK() spinlock_lock(&g_slab_mgr_lock)
      #define SYSTEM_UNLOCK() spinlock_unlock(&g_slab_mgr_lock)
      #define SYSTEM_LOCKED() spinlock_is_locked(&g_slab_mgr_lock)

MEMMGR fixed-size allocator

Used to allocate specific objects in specific subsystems. Currently used only in LibOS.

Each subsystem of LibOS that uses MEMMGR specifies its own (global to the subsystem) lock. Thus, MEMMGR object allocs/frees in the same subsystem are synchronized on this lock, but object allocs/frees in different subsystems can run in parallel.

The current users of MEMMGR:

Every managed object is wrapped into a MEM_OBJ_TYPE struct. This struct doesn't have additional fields (if not built with ASan), so it's the most compact representation possible. When the object is "freed" and its underlying memory is moved to the free list, MEM_OBJ_TYPE's list field is used instead.

Design and implementation are very simple:

  1. There is a single "memory manager" object created for each LibOS subsystem.
  2. This memory manager keeps two lists:
    • available areas (disjoint allocated memory regions)
    • free list of re-usable objects
  3. Each subsystem chooses the initial size (number of objects) this manager allocates at startup. The first area is allocated which can host this initial number of objects. This first area is also marked as active (when the first area becomes full, the second area will be allocated and will be marked as active, and first area becomes inactive).
  4. When the subsystem needs a new object (alloc), the manager first checks if its currently active area has any untouched memory left, and if yes, object is "allocated" in this memory. If no memory in active area, then the manager checks the free list, and if there is one free slot, object is "allocated" in this slot. If no free slots, then (if subsystem allowed it), the manager creates a new area, makes it active, and "allocates" the object at the base address of this new area.
  5. When the subsystem frees the object, there are two possibilities:
    • The to-be-freed object is inside the checkpoint memory region (i.e., the object was copied from the parent process). This corner case is not handled (checkpoint memory is never freed or reused), so the manager returns and the object is leaked.
    • Otherwise, the object is added to the free list.

The MEMMGR memory managers are never "reset", or shrunk, or deleted. Thus, if LibOS allocated a lot of MEMMGR objects initially, and then freed them all, then this MEMMGR memory is leaked. This should be a very rare and unimportant case though.

Backend-memory (page-sized) allocation happens via __system_malloc() declared here: https://github.com/gramineproject/gramine/blob/master/libos/src/libos_malloc.c

Backend-memory (page-sized) deallocation, as mentioned above, doesn't really happen. But if it would, then it would be via __system_free() also declared here: https://github.com/gramineproject/gramine/blob/master/libos/src/libos_malloc.c

Support for ASan

  • Allocated objects have an 8-byte redzone (padding)
  • When object is freed, ASan poisons the "freed" memory

Open issues

  • Objects in migrated memory (in the child) leak because their "slots" are never re-used:

    static inline void free_mem_obj_to_mgr(MEM_MGR mgr, OBJ_TYPE* obj) {
    if (memory_migrated(obj)) {
    return;
    }

    • TODO: create a separate GitHub issue about this.
  • Names are bad, in particular get_mem_obj_from_mgr_enlarge(MEM_MGR mgr, size_t size) -- the size here is actually "by how many bytes to increase the pool of available memory if there is not free memory in areas and no free slots in the free list".

  • Something is fishy with size arguments in functions. These args are in bytes (at least that's what the callers assume), but function implementations seem to treat the argument as count. TODO: verify and fix this; we may have a memory leak here.

SLAB variable-size allocator

Generic backend for malloc and free in all other subsystems. Used both in LibOS and in PAL.

When any (random size) object needs to be allocated/freed in LibOS or in PAL, the traditional malloc() and free() calls are used. They are wrappers around slab_alloc() and slab_free(). See:

Backend-memory (page-sized) allocation and deallocation is implemented via:

There is a single global slab manager and corresponding lock for LibOS and similarly a single global slab manager and corresponding lock for PAL. See these:

NOTE We have a struct libos_lock in LibOS. This lock is implemented via PalEventWait() and PalEventSet() which do have a fast path, but the slow path results in ocall_futex(), which is super-expensive in SGX. Maybe we could replace it with a spinlock()? Technically most of the time we'll hit the slab-allocator's cache, which is a fast operation; so it seems like no real need to ocall_futex() in this case.

Design and implementation is based on the MEMMGR allocator for the common case, and has a trivial fallback for the large-object case:

  • If the requested object's size is greater than the max allowed in the SLAB allocator's levels (currently ~2KB, but [common] Add encrypted-file-node-size slab into slab allocator #1763 changes it to ~8KB), then SLAB falls back to backend-memory allocation:
    if (level == (size_t)-1) {
    size = ALIGN_UP_POW2(size, MIN_MALLOC_ALIGNMENT);
    LARGE_MEM_OBJ mem = (LARGE_MEM_OBJ)system_malloc(sizeof(LARGE_MEM_OBJ_TYPE) + size);
    if (!mem)
    return NULL;
    mem->size = size;
    OBJ_LEVEL(mem) = (unsigned char)-1;
    #ifdef ASAN
    asan_unpoison_region((uintptr_t)OBJ_RAW(mem), size);
    #endif
    return OBJ_RAW(mem);
    }
  • If the requested object's size is less than one of the SLAB allocator's levels, then SLAB chooses the highest level this object fits in (to pack objects as dense as possible) and then proceeds in the same way as MEMMGR.
    • In other words, each SLAB level has the same rules as MEMMGR (search free space in active area, then fallback to free list, then fallback to enlarging the level by allocating a new area).

Deallocation happens similarly to the allocation description above:

  1. Get the SLAB allocator's metadata for the to-be-freed object. Part of this metadata is the level value.
  2. If level == -1, it means that the object's size is greater than the max allowed in SLAB, so it was allocated via backend-memory allocation, thus it must be deallocated via backend-memory free:
    if (level == (unsigned char)-1) {
    LARGE_MEM_OBJ mem = RAW_TO_OBJ(obj, LARGE_MEM_OBJ_TYPE);
    #ifdef DEBUG
    _real_memset(obj, 0xCC, mem->size);
    #endif
    #ifdef ASAN
    asan_unpoison_region((uintptr_t)mem, mem->size + sizeof(LARGE_MEM_OBJ_TYPE));
    #endif
    system_free(mem, mem->size + sizeof(LARGE_MEM_OBJ_TYPE));
    return;
    }
  3. Otherwise this object's memory is returned back to the SLAB allocator's level's free list:
    SYSTEM_LOCK();
    INIT_LIST_HEAD(mobj, __list);
    LISTP_ADD_TAIL(mobj, &mgr->free_list[level], __list);
    SYSTEM_UNLOCK();

Open issues

  • No realloc() implementation.

    • I'm not sure what's the technical reason for not having it.
    • If we provide realloc(), then we have many places in LibOS and PAL that could benefit from this function (currently they implement such realloc in ad-hoc ways combining malloc memcpy and free).
  • SLAB_CANARY is defined for LibOS, but not for PAL.

    • This leads to the SLAB object metadata (header) being 32B in LibOS,
    • and only 16B in PAL.

@dimakuv dimakuv added enhancement New feature or request needs discussion Needs to be discussed on the maintainers call P: 3 labels Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs discussion Needs to be discussed on the maintainers call P: 3
Projects
None yet
Development

No branches or pull requests

1 participant