Re-upload files if they are missing in storage #716

rzvoncek · 2024-03-01T14:52:34Z

Fixes #709
Fixes #368

In short, the way we do this is to list all files in storage, in the data folder for the given prefix and node, after doing a snapshot but before we start backing up individual tables. We we then group the listed files into a dict->dict->set keyed by keyspace->table, so that we can look up any given file efficiently.

Tested manually on a node with ~11k LCS SSTables. Did not observe a particular increase in (differential) backup duration. There probably is an increased memory consumption, but I'm struggling a bit to quantify how much.

codecov · 2024-03-01T16:25:30Z

Codecov Report

Attention: Patch coverage is 97.34513% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 80.02%. Comparing base (49f88ea) to head (64f9eb6).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #716      +/-   ##
==========================================
- Coverage   80.17%   80.02%   -0.15%     
==========================================
  Files          52       52              
  Lines        4726     4751      +25     
  Branches      961      970       +9     
==========================================
+ Hits         3789     3802      +13     
- Misses        908      920      +12     
  Partials       29       29

Files	Coverage Δ
medusa/backup_node.py	`87.00% <100.00%> (-1.08%)`	⬇️
medusa/storage/abstract_storage.py	`92.36% <100.00%> (+0.03%)`	⬆️
medusa/storage/azure_storage.py	`99.15% <100.00%> (+<0.01%)`	⬆️
medusa/storage/google_storage.py	`89.55% <100.00%> (-4.44%)`	⬇️
medusa/storage/local_storage.py	`98.03% <100.00%> (+0.01%)`	⬆️
medusa/storage/s3_base_storage.py	`93.85% <100.00%> (+0.02%)`	⬆️
medusa/storage/s3_rgw.py	`75.00% <75.00%> (+3.57%)`	⬆️
medusa/storage/__init__.py	`94.56% <94.44%> (-0.02%)`	⬇️

... and 4 files with indirect coverage changes

adejanovski · 2024-03-11T08:42:31Z

medusa/backup_node.py

+    manifest_paths = map(lambda o: o.path, manifest_objects)
+    missing_in_storage = list()
+    # iterate through manifest paths and their corresponding local file names
+    for manifest_path, src in zip(manifest_paths, srcs):


Thought: This can get a little expensive if we have a lot of manifests.
Instead of checking if the files we want to upload are present in the data directory, we're cycling through all past manifests to list files that would be missing.
The design I had in mind was to stop checking the past manifests at all, and replace our _cached_objects variable with the list of existing files in the data directory. The rest of the code would be left untouched then, for the most part.
wdyt?

I have to admit I have not fully understood how replace_or_remove_if_cached actually caches things.

I just worked with the understanding that it takes the list of files to back up (srcs) and splits them into two groups:

the ones that needs_backup, which is just a list of strings, just like srcs. These are the paths on the local nodes.

the ones that are already_backed_up, which is a list of ManifestObject. These are the paths in the storage, together with size, digest and a timestamp.

I've added a third group, with is a list of local paths, so strings like srcs, but their order in the list corresponds to the already_backed_up. I've done this because to re-upload them, I'd have to somehow convert the path in the storage back to the path on the node, which I couldn't easily figure out.

When we check_missing_files we give it the last two groups - a list of ManifestObjects and their corresponding local paths - and a dict of files in the storage keyed by keyspace and table (into a set for quick lookup).

So we are dealing with

all files in the data folder for the given node

manifest objects for a given keyspace and table.

We don't deal with any manifests as such, we only _make_manifest_object() in the replace_or_remove_if_cached.

I would really like to avoid changing the caching code. However, I'm open to the idea of dropping it because it simplifies things.

rzvoncek · 2024-03-20T13:54:27Z

Finalised the reimplementation to not even look at the manifests.
DId a basic manual test with corrupting a file (rm then touch it).

adejanovski

A few suggestions that you could tackle if you want, but things work perfectly fine here!
That's making Medusa much more resilient!

adejanovski · 2024-03-21T12:18:13Z

medusa/storage/__init__.py

+        if str(path).startswith('/'):
+            chunks = str(path).split('/')
+            if path.parent.name.startswith('.') or path.parent.name.endswith('nodes'):
+                k, t, index_name = chunks[-6], chunks[-5], chunks[-2]


suggestion: it would be nice to add a check for the number of items in the array and return a proper error message (or plain skip the item) if we don't have enough chunks.

Ack, will push a fix shortly.

adejanovski · 2024-03-21T12:38:08Z

medusa/backup_node.py

+        enable_md5_checks: bool,
+        files_in_storage: t.Dict[str, t.Dict[str, t.Dict[str, ManifestObject]]],
+        keyspace: str,
+        columnfamily: str,


nit: columnfamily seems to be unused

good catch, fixing.

adejanovski · 2024-03-21T12:40:21Z

medusa/backup_node.py

+            if len(needs_reupload) > 0:
+                logging.info(
+                    f"Re-uploading {len(needs_reupload)} files in {fqtn}"
+                )
+                manifest_objects += storage.storage_driver.upload_blobs(needs_reupload, dst_path)
+
            # Reintroducing already backed up objects in the manifest in differential
-            for obj in already_backed_up:
-                manifest_objects.append(obj)
+            if len(already_backed_up) > 0 and node_backup.is_differential:
+                logging.info(
+                    f"Skipping upload of {len(already_backed_up)} files in {fqtn} because they are already in storage"
+                )
+                for obj in already_backed_up:
+                    manifest_objects.append(obj)


suggestion: Why separate the uploads of newly uploaded and re-uploaded files? I assume you can slightly improve the code here by concatenating the two lists and invoking upload_blobs() just once.

Merged the calls to uplaod_blobs() but left a log line about fixing the backups.

rzvoncek · 2024-03-25T09:55:06Z

Pushed a commit with the fixes.
Also rebased on master to fix merge conflicts.

sonarcloud · 2024-03-27T09:26:19Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

rzvoncek force-pushed the radovan/fix-missing-files branch 3 times, most recently from aca2fa2 to 354d904 Compare March 1, 2024 16:01

rzvoncek force-pushed the radovan/fix-missing-files branch 3 times, most recently from 9c50993 to 09874d7 Compare March 4, 2024 13:48

rzvoncek marked this pull request as ready for review March 4, 2024 13:53

rzvoncek mentioned this pull request Mar 5, 2024

Check actual file presence when performing differential backups #368

Closed

adejanovski reviewed Mar 11, 2024

View reviewed changes

rzvoncek force-pushed the radovan/fix-missing-files branch 4 times, most recently from 6436c73 to 2e2b4b7 Compare March 14, 2024 15:42

rzvoncek marked this pull request as draft March 14, 2024 16:16

rzvoncek force-pushed the radovan/fix-missing-files branch 14 times, most recently from 3ee38fb to f390cf4 Compare March 20, 2024 12:04

rzvoncek marked this pull request as ready for review March 20, 2024 13:54

adejanovski approved these changes Mar 21, 2024

View reviewed changes

rzvoncek force-pushed the radovan/fix-missing-files branch from 68087c8 to 9a813dc Compare March 25, 2024 09:58

Re-upload files if they are missing in storage

64f9eb6

rzvoncek force-pushed the radovan/fix-missing-files branch from 9a813dc to 64f9eb6 Compare March 27, 2024 09:25

rzvoncek merged commit 7be8cc1 into master Mar 27, 2024
31 checks passed

This was referenced Apr 4, 2024

medusa verify failing because files do not exist #541

Closed

sstable identifier is not globally unique #523

Closed

rzvoncek deleted the radovan/fix-missing-files branch June 12, 2024 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-upload files if they are missing in storage #716

Re-upload files if they are missing in storage #716

rzvoncek commented Mar 1, 2024 •

edited by adejanovski

Loading

codecov bot commented Mar 1, 2024 •

edited

Loading

adejanovski Mar 11, 2024

rzvoncek Mar 13, 2024 •

edited

Loading

rzvoncek commented Mar 20, 2024

adejanovski left a comment

adejanovski Mar 21, 2024

rzvoncek Mar 25, 2024

adejanovski Mar 21, 2024

rzvoncek Mar 25, 2024

adejanovski Mar 21, 2024

rzvoncek Mar 25, 2024

rzvoncek commented Mar 25, 2024 •

edited

Loading

sonarcloud bot commented Mar 27, 2024

Re-upload files if they are missing in storage #716

Re-upload files if they are missing in storage #716

Conversation

rzvoncek commented Mar 1, 2024 • edited by adejanovski Loading

codecov bot commented Mar 1, 2024 • edited Loading

Codecov Report

adejanovski Mar 11, 2024

Choose a reason for hiding this comment

rzvoncek Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

rzvoncek commented Mar 20, 2024

adejanovski left a comment

Choose a reason for hiding this comment

adejanovski Mar 21, 2024

Choose a reason for hiding this comment

rzvoncek Mar 25, 2024

Choose a reason for hiding this comment

adejanovski Mar 21, 2024

Choose a reason for hiding this comment

rzvoncek Mar 25, 2024

Choose a reason for hiding this comment

adejanovski Mar 21, 2024

Choose a reason for hiding this comment

rzvoncek Mar 25, 2024

Choose a reason for hiding this comment

rzvoncek commented Mar 25, 2024 • edited Loading

sonarcloud bot commented Mar 27, 2024

Quality Gate passed

rzvoncek commented Mar 1, 2024 •

edited by adejanovski

Loading

codecov bot commented Mar 1, 2024 •

edited

Loading

rzvoncek Mar 13, 2024 •

edited

Loading

rzvoncek commented Mar 25, 2024 •

edited

Loading