Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native support for incremental restore #13239

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mszeszko-meta
Copy link
Contributor

@mszeszko-meta mszeszko-meta commented Dec 20, 2024

Summary

As advertised, with this change we are adding native library support for incremental restores. When designing the solution we decided to follow 'tiered' approach where users can pick one of the three predefined, and for now, mutually exclusive restore modes (kKeepLatestDbSessionIdFiles, kVerifyChecksum and kPurgeAllFiles [default]) - trading CPU / operational velocity for the degree of certainty that their existing destination db files indeed match selected backup files contents. New mode option is exposed via existing RestoreOptions configuration, which by this time has been already well-baked into our APIs. Restore engine will consume this configuration and infer which of the existing destination db files are 'in policy' to be retained during restore.

Motivation

This work is motivated by internal customer who is running write-heavy, 1M+ QPS service and is using RocksDB restore functionality to scale up their fleet. Given already organically high QPS on their end, additional traffic from restores as-is today is causing prolonged spikes which lead the service to hit BLOB storage quotas, which finally results in slowing down the pace of the scaling. Please see T206217267 for more.

Impact

The expected impact of this work is dual: 1) reducing the library footprint on BLOB storage quotas, 2) speeding up the scale up procedure. After landing this, we should follow up with the customers on observed gains to attribute #s to above predictions.

Technical nuances

  1. According to prior investigations, the risk of collisions on [file #, db session id, file size] metadata triplets is low enough to the point that we can confidently use it to uniquely describe the file and its' perceived contents, which is the rationale behind the kKeepLatestDbSessionIdFiles mode. To find more about the risks / tradeoffs for using this mode, please check the related comment in backup_engine.cc.
  2. kVerifyChecksum mode requires a full blob / SST file scan (assuming backup file has its' checksum_hex metadata set appropriately, if not additional file scan for backup file). While it saves us on write IOs (if checksums match), it's still fairly complex and potentially CPU intensive operation.
  3. We're extending the WorkItemType enum introduced in Generalize work item definition in BackupEngineImpl #13228 to accommodate a new simple request to ComputeChecksum, which will enable us to run 2) concurrently.
  4. Note that it's necessary to compute the checksum on the restored file if corresponding backup file and existing destination db file checksums didn't match.

Test plan

  1. Manually testing using debugger: ✅
  2. Automated tests: WIP 🚧

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants