Native support for incremental restore #13239
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
As advertised, with this change we are adding native library support for incremental restores. When designing the solution we decided to follow 'tiered' approach where users can pick one of the three predefined, and for now, mutually exclusive restore modes (
kKeepLatestDbSessionIdFiles
,kVerifyChecksum
andkPurgeAllFiles
[default]) - trading CPU / operational velocity for the degree of certainty that their existing destination db files indeed match selected backup files contents. New mode option is exposed via existingRestoreOptions
configuration, which by this time has been already well-baked into our APIs. Restore engine will consume this configuration and infer which of the existing destination db files are 'in policy' to be retained during restore.Motivation
This work is motivated by internal customer who is running write-heavy, 1M+ QPS service and is using RocksDB restore functionality to scale up their fleet. Given already organically high QPS on their end, additional traffic from restores as-is today is causing prolonged spikes which lead the service to hit BLOB storage quotas, which finally results in slowing down the pace of the scaling. Please see T206217267 for more.
Impact
The expected impact of this work is dual: 1) reducing the library footprint on BLOB storage quotas, 2) speeding up the scale up procedure. After landing this, we should follow up with the customers on observed gains to attribute #s to above predictions.
Technical nuances
kKeepLatestDbSessionIdFiles
mode. To find more about the risks / tradeoffs for using this mode, please check the related comment inbackup_engine.cc
.kVerifyChecksum
mode requires a full blob / SST file scan (assuming backup file has its'checksum_hex
metadata set appropriately, if not additional file scan for backup file). While it saves us on write IOs (if checksums match), it's still fairly complex and potentially CPU intensive operation.WorkItemType
enum introduced in Generalize work item definition in BackupEngineImpl #13228 to accommodate a new simple request toComputeChecksum
, which will enable us to run 2) concurrently.Test plan