S3/GCS/Azure Source: Enhanced Data Reload Strategy for Specific Timeframes #1187
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This pull request addresses the challenge of efficiently reloading data from a specific point in time (e.g., last week, yesterday). Currently, the naive approach involves loading data from the beginning and filtering out records, which proves impractical and costly, especially with large data volumes.
Proposed Solution
To enhance data reloads for specific timeframes, this change introduces a more refined strategy. Instead of loading all stored data and filtering records, the solution leverages timestamp metadata associated with each data file. By utilizing this metadata, the source can selectively load relevant data files, minimizing unnecessary data transfer and processing overhead.
This PR prepares the key (filename) values to carry the earliest and latest offset for the records within each entry. This allows the source to discard files that should not be processed, making the initial seek time directly proportional to the listing of keys and associated filters for each topic-partition. Additionally, records returned from the file undergo further filtering, ensuring only relevant data within the specified timeframe is processed.
Changes
This change exclusively affects the envelope storage mode. In non-envelope storage, there is no inherent storage mechanism for the record timestamp. However, to maintain consistency across storage modes, we've introduced the TopicPartitionOffsetFileNamerV1.
It's important to note that the lexicographic order remains unaffected when storing both envelope and non-envelope data. This is because the padded-offset continues to determine the order in both scenarios.