-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-upload files if they are missing in storage #716
Conversation
aca2fa2
to
354d904
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #716 +/- ##
==========================================
- Coverage 80.17% 80.02% -0.15%
==========================================
Files 52 52
Lines 4726 4751 +25
Branches 961 970 +9
==========================================
+ Hits 3789 3802 +13
- Misses 908 920 +12
Partials 29 29
|
9c50993
to
09874d7
Compare
medusa/backup_node.py
Outdated
manifest_paths = map(lambda o: o.path, manifest_objects) | ||
missing_in_storage = list() | ||
# iterate through manifest paths and their corresponding local file names | ||
for manifest_path, src in zip(manifest_paths, srcs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: This can get a little expensive if we have a lot of manifests.
Instead of checking if the files we want to upload are present in the data directory, we're cycling through all past manifests to list files that would be missing.
The design I had in mind was to stop checking the past manifests at all, and replace our _cached_objects variable with the list of existing files in the data directory. The rest of the code would be left untouched then, for the most part.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit I have not fully understood how replace_or_remove_if_cached
actually caches things.
I just worked with the understanding that it takes the list of files to back up (srcs
) and splits them into two groups:
- the ones that
needs_backup
, which is just a list of strings, just likesrcs
. These are the paths on the local nodes. - the ones that are
already_backed_up
, which is a list ofManifestObject
. These are the paths in the storage, together with size, digest and a timestamp.
I've added a third group, with is a list of local paths, so strings like srcs
, but their order in the list corresponds to the already_backed_up
. I've done this because to re-upload them, I'd have to somehow convert the path in the storage back to the path on the node, which I couldn't easily figure out.
When we check_missing_files
we give it the last two groups - a list of ManifestObjects and their corresponding local paths - and a dict of files in the storage keyed by keyspace and table (into a set for quick lookup).
So we are dealing with
- all files in the data folder for the given node
- manifest objects for a given keyspace and table.
We don't deal with any manifests as such, we only _make_manifest_object()
in the replace_or_remove_if_cached
.
I would really like to avoid changing the caching code. However, I'm open to the idea of dropping it because it simplifies things.
6436c73
to
2e2b4b7
Compare
3ee38fb
to
f390cf4
Compare
Finalised the reimplementation to not even look at the manifests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions that you could tackle if you want, but things work perfectly fine here!
That's making Medusa much more resilient!
if str(path).startswith('/'): | ||
chunks = str(path).split('/') | ||
if path.parent.name.startswith('.') or path.parent.name.endswith('nodes'): | ||
k, t, index_name = chunks[-6], chunks[-5], chunks[-2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: it would be nice to add a check for the number of items in the array and return a proper error message (or plain skip the item) if we don't have enough chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, will push a fix shortly.
medusa/backup_node.py
Outdated
enable_md5_checks: bool, | ||
files_in_storage: t.Dict[str, t.Dict[str, t.Dict[str, ManifestObject]]], | ||
keyspace: str, | ||
columnfamily: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: columnfamily seems to be unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, fixing.
medusa/backup_node.py
Outdated
if len(needs_reupload) > 0: | ||
logging.info( | ||
f"Re-uploading {len(needs_reupload)} files in {fqtn}" | ||
) | ||
manifest_objects += storage.storage_driver.upload_blobs(needs_reupload, dst_path) | ||
|
||
# Reintroducing already backed up objects in the manifest in differential | ||
for obj in already_backed_up: | ||
manifest_objects.append(obj) | ||
if len(already_backed_up) > 0 and node_backup.is_differential: | ||
logging.info( | ||
f"Skipping upload of {len(already_backed_up)} files in {fqtn} because they are already in storage" | ||
) | ||
for obj in already_backed_up: | ||
manifest_objects.append(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Why separate the uploads of newly uploaded and re-uploaded files? I assume you can slightly improve the code here by concatenating the two lists and invoking upload_blobs()
just once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged the calls to uplaod_blobs()
but left a log line about fixing the backups.
Pushed a commit with the fixes. |
68087c8
to
9a813dc
Compare
9a813dc
to
64f9eb6
Compare
Quality Gate passedIssues Measures |
Fixes #709
Fixes #368
In short, the way we do this is to list all files in storage, in the data folder for the given prefix and node, after doing a snapshot but before we start backing up individual tables. We we then group the listed files into a dict->dict->set keyed by keyspace->table, so that we can look up any given file efficiently.
Tested manually on a node with ~11k LCS SSTables. Did not observe a particular increase in (differential) backup duration. There probably is an increased memory consumption, but I'm struggling a bit to quantify how much.