Native support for incremental restore #13239

mszeszko-meta · 2024-12-20T09:40:01Z

Summary

As advertised, with this change we are adding native library support for incremental restores. When designing the solution we decided to follow 'tiered' approach where users can pick one of the three predefined, and for now, mutually exclusive restore modes (kKeepLatestDbSessionIdFiles, kVerifyChecksum and kPurgeAllFiles [default]) - trading CPU / operational velocity for the degree of certainty that their existing destination db files indeed match selected backup files contents. New mode option is exposed via existing RestoreOptions configuration, which by this time has been already well-baked into our APIs. Restore engine will consume this configuration and infer which of the existing destination db files are 'in policy' to be retained during restore.

Motivation

This work is motivated by internal customer who is running write-heavy, 1M+ QPS service and is using RocksDB restore functionality to scale up their fleet. Given already organically high QPS on their end, additional traffic from restores as-is today is causing prolonged spikes which lead the service to hit BLOB storage quotas, which finally results in slowing down the pace of the scaling. Please see T206217267 for more.

Impact

The expected impact of this work is dual: 1) reducing the library footprint on BLOB storage quotas, 2) speeding up the scale up procedure. After landing this, we should follow up with the customers on observed gains to attribute #s to above predictions.

Technical nuances

According to prior investigations, the risk of collisions on [file #, db session id, file size] metadata triplets is low enough to the point that we can confidently use it to uniquely describe the file and its' perceived contents, which is the rationale behind the kKeepLatestDbSessionIdFiles mode. To find more about the risks / tradeoffs for using this mode, please check the related comment in backup_engine.cc.
kVerifyChecksum mode requires a full blob / SST file scan (assuming backup file has its' checksum_hex metadata set appropriately, if not additional file scan for backup file). While it saves us on write IOs (if checksums match), it's still fairly complex and potentially CPU intensive operation.
We're extending the WorkItemType enum introduced in Generalize work item definition in BackupEngineImpl #13228 to accommodate a new simple request to ComputeChecksum, which will enable us to run 2) concurrently.
Note that it's necessary to compute the checksum on the restored file if corresponding backup file and existing destination db file checksums didn't match.

Test plan

Manually testing using debugger: ✅
Automated tests: WIP 🚧

facebook-github-bot · 2024-12-20T09:42:25Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-12-20T09:43:22Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-12-20T09:44:34Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-12-20T10:03:01Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-12-20T10:06:53Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot added the CLA Signed label Dec 20, 2024

mszeszko-meta force-pushed the incremental_restore branch from b9cc02d to f9d0de2 Compare December 20, 2024 09:43

First logic draft

aab8ef8

mszeszko-meta force-pushed the incremental_restore branch from f9d0de2 to aab8ef8 Compare December 20, 2024 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native support for incremental restore #13239

Native support for incremental restore #13239

mszeszko-meta commented Dec 20, 2024 •

edited

Loading

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

Native support for incremental restore #13239

Are you sure you want to change the base?

Native support for incremental restore #13239

Conversation

mszeszko-meta commented Dec 20, 2024 • edited Loading

Summary

Motivation

Impact

Technical nuances

Test plan

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

mszeszko-meta commented Dec 20, 2024 •

edited

Loading