Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshots overhaul #176

Merged
merged 53 commits into from
Dec 20, 2021
Merged

Snapshots overhaul #176

merged 53 commits into from
Dec 20, 2021

Conversation

Shillaker
Copy link
Collaborator

@Shillaker Shillaker commented Nov 18, 2021

Apologies, this PR has turned into a beast, but I think it's worth spending time on, as it's one of the more fiddly parts of the system.

The logic for creating, updating, diffing and merging snapshots is currently spread over several classes, quite difficult to understand because of the unclear ownership of data, and sprinkled with data races. In this PR I refactor all the snapshot-related code into something (hopefully) easier to understand, and safer to use.

The snapshot.h and snapshot.cpp files have gotten way too big and should be split up by class, however, that would make the diff for this PR even larger than it already is, so can be done in a separate PR.

Functionality changes:

  • Merge regions are cleared after pushing snapshot diffs, as these regions only apply to individual batches of threads.
  • SnapshotData now has a max size, and will allocate a contiguous portion of virtual memory into which it may expand in future (in the case of wasm this will be of size 4GB). If data is being copied into the snapshot that is larger than the current size, it will allocate new portions of memory.
  • SnapshotData and SnapshotDiff take ownership of their own data, and a new MemoryView class is used to represent views of memory (which was previously rolled into these two classes).
  • Threads on the main host now diff memory with their original snapshot and write back changes. Without this, changes from these threads are not visible in the snapshot as the mapping of memory is private, hence one-way.

Refactoring:

  • The logic for merging diffs into a snapshot was held in the SnapshotServer class, but is now in the SnapshotData class.
  • SnapshotData uses proper RAII to manage its contents.
  • SnapshotDiffs are now queued and applied only at explicit synchronisation points. Without this, changes to the snapshot on the main host may be interleaved and mess up the diffing process on that host. This has an additional benefit of making the process simpler to reason about.
  • Everything deals with std::shared_ptr<SnapshotData> rather than references. This reduces risk of confusion over copying and shared ownership of data (and we can delete the copy and assignment constructors).
  • All logic around memory provisioning and file descriptors is now in the SnapshotData class itself rather than spread between it and the SnapshotRegistry.
  • Low-level handling of memory allocation and mapping is now contained in utility functions to avoid having mmap and mprotect calls inline.
  • Snapshot getDirtyPages is now getDirtyRegions to support a minimal set of diffs (rather than one per page even if the pages are contiguous).

@Shillaker Shillaker marked this pull request as ready for review November 18, 2021 17:44
src/util/memory.cpp Outdated Show resolved Hide resolved

faabric::util::SnapshotData snap;
snap.size = initialSnapSize;
snap.data = faabric::util::allocateVirtualMemory(expandedSnapSize);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you introduce a std::unique_ptr-like (or even just std::unique_ptr with a custom deleter) wrapper for these allocations? I've been purging all uses of new/delete and using RAII memory management where possible, this feels like a step backwards. I know this could be impossible for the general case of applying snapshots onto faaslets, but at least in testing code the lifetime of these mappings is limited to a function call scope.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right re. test code, we can definitely do a better job there. As you say, we might not be able to avoid a raw pointer version of this everywhere, but I'll see what I can do.

Copy link
Collaborator Author

@Shillaker Shillaker Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two observations after a bit of fiddling:

  • To interface properly with Faasm, the snapshot API (SnapshotRegistry, SnapshotData) still needs to support raw pointers (as both WAMR and WAVM still use them for wasm memory), hence we can't go full RAII on a lot of the underlying snapshot logic. We could introduce alternative versions of the API/ objects that support smart pointers alongside raw pointers, but this will result in a fair bit of extra code.

  • I agree that using smart pointers in tests for memory provisioned on the fly is a good idea, and I can almost get something with a unique_ptr working, however, the custom deleter needs to call munmap and needs to know the size of the memory that was orinally mmaped, which makes defining the deleter a little more tricky. I got this far:

std::unique_ptr<uint8_t[]> doAlloc(size_t size, int prot, int flags)
{
    auto deleter = [size](uint8_t* u) { munmap(u, size); };
    std::unique_ptr<uint8_t[], decltype(deleter)> mem(
        (uint8_t*)::mmap(nullptr, size, prot, flags, -1, 0), deleter);
    return mem;
}

However, this won't compile as the return type of the function also needs to specify the type of the deleter, but I can't work out the syntax. Any ideas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use std::function<void(uint8_t*)> as the deleter type to store that lambda

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is to use a shared_ptr, which I think should be more flexible re. the custom deleter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a minimal working example with unique_ptr:

#include <memory>
#include <functional>
#include <sys/mman.h>

using OwnedMmapRegion = std::unique_ptr<uint8_t[], std::function<void(uint8_t*)>>;

OwnedMmapRegion doAlloc(size_t size, int prot, int flags)
{
    auto deleter = [size](uint8_t* u) { munmap(u, size); };
    OwnedMmapRegion mem(
        (uint8_t*)::mmap(nullptr, size, prot, flags, -1, 0), deleter);
    return mem;
}

https://godbolt.org/z/TdTzaz9rW

I'm not sure what you mean by shared_ptr being more flexible with the custom deleter, the only difference would be that you could use reference counting to keep track when the last user is gone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, cheers @KubaSz. What I meant re. shared_ptr being more flexible is that I don't think the deleter type needs to be included in the pointer type, unlike a unique_ptr, so the return type of the function could still be std::shared_ptr<uint8_t[]> even with the deleter (and I could dodge the original problem I was facing). Anywho, your version is great so I'll see how things turn out with that.

Comment on lines 100 to 104
// TODO - somehow avoid this copy?
std::memcpy(snap.data, r->contents()->Data(), snap.size);

reg.takeSnapshot(r->key()->str(), data, true);
// TODO - avoid a further copy in here when setting up fd
reg.takeSnapshot(r->key()->str(), snap, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one copy could be avoided here by adding an API to create a snapshot (including its memfd) of a given size without data, then making the caller responsible for filling in the data. The call would have to create the memfd, ftruncate it and mmap it, saving the mmap location in the data field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that might work, I'll see what I can do.

src/util/snapshot.cpp Outdated Show resolved Hide resolved
@Shillaker Shillaker marked this pull request as draft December 9, 2021 14:45
@@ -99,7 +99,7 @@ jobs:
${{ runner.os }}-
# --- Build with thread sanitising
- name: "Build tests with address sanitising"
run: inv dev.sanitise Address --noclean
run: inv dev.sanitise Address faabric_tests --noclean
Copy link
Collaborator Author

@Shillaker Shillaker Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a target to the dev.sanitise task rather than assuming it's always the tests.

@Shillaker Shillaker marked this pull request as ready for review December 10, 2021 17:33
@Shillaker Shillaker marked this pull request as draft December 15, 2021 18:11
@Shillaker Shillaker self-assigned this Dec 16, 2021
bin/workon.sh Outdated
# Check these match the CI build
export ASAN_OPTIONS = "verbosity=1:halt_on_error=1"
export TSAN_OPTIONS="halt_on_error=1:suppressions=${PROJ_ROOT}/thread-sanitizer-ignorelist.txt:history_size=7:second_deadlock_stack=1"
export UBSAN_OPTIONS="print_stacktrace=1:halt_on_error=1"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important for replicating CI problems locally (without halt_on_error=1 it's hard to tell if anything's breaking).

@Shillaker Shillaker marked this pull request as ready for review December 16, 2021 17:12
@Shillaker Shillaker marked this pull request as draft December 17, 2021 11:46
@Shillaker Shillaker marked this pull request as ready for review December 17, 2021 17:49
Copy link
Collaborator

@eigenraven eigenraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks very good now, just a few minor comments/questions left from me

src/snapshot/SnapshotRegistry.cpp Outdated Show resolved Hide resolved
src/snapshot/SnapshotRegistry.cpp Outdated Show resolved Hide resolved
src/snapshot/SnapshotServer.cpp Show resolved Hide resolved
src/util/snapshot.cpp Outdated Show resolved Hide resolved
src/util/snapshot.cpp Outdated Show resolved Hide resolved
tests/dist/DistTestExecutor.h Show resolved Hide resolved
tests/test/scheduler/test_scheduler.cpp Show resolved Hide resolved
include/faabric/scheduler/Scheduler.h Outdated Show resolved Hide resolved
include/faabric/util/snapshot.h Outdated Show resolved Hide resolved
include/faabric/util/snapshot.h Outdated Show resolved Hide resolved
@Shillaker Shillaker merged commit d25c14b into master Dec 20, 2021
@Shillaker Shillaker deleted the snap-sizes branch December 20, 2021 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants