Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3/GCS/Azure Source: Enhanced Data Reload Strategy for Specific Timeframes #1187

Merged
merged 7 commits into from
Apr 30, 2024

Conversation

stheppi
Copy link
Contributor

@stheppi stheppi commented Apr 29, 2024

Overview

This pull request addresses the challenge of efficiently reloading data from a specific point in time (e.g., last week, yesterday). Currently, the naive approach involves loading data from the beginning and filtering out records, which proves impractical and costly, especially with large data volumes.

Proposed Solution

To enhance data reloads for specific timeframes, this change introduces a more refined strategy. Instead of loading all stored data and filtering records, the solution leverages timestamp metadata associated with each data file. By utilizing this metadata, the source can selectively load relevant data files, minimizing unnecessary data transfer and processing overhead.

This PR prepares the key (filename) values to carry the earliest and latest offset for the records within each entry. This allows the source to discard files that should not be processed, making the initial seek time directly proportional to the listing of keys and associated filters for each topic-partition. Additionally, records returned from the file undergo further filtering, ensuring only relevant data within the specified timeframe is processed.

Changes

This change exclusively affects the envelope storage mode. In non-envelope storage, there is no inherent storage mechanism for the record timestamp. However, to maintain consistency across storage modes, we've introduced the TopicPartitionOffsetFileNamerV1.

It's important to note that the lexicographic order remains unaffected when storing both envelope and non-envelope data. This is because the padded-offset continues to determine the order in both scenarios.

stheppi and others added 6 commits April 25, 2024 16:39
…me including the earliest record timestamp.

Refactored a few parameter/field names around the object key name.
…sed. It is the only storage which guarantees the record timestamp to be preserved.

Sinks tests have been updated to reflect the object keys values.
Avoids Avro invalid sync as a result of concurrent tests writing the same file
…amp within the file.

This change would reduce the complexity of the initial seek when a request to load from a specific point in time is chosen
@stheppi stheppi marked this pull request as ready for review April 29, 2024 14:34
@stheppi stheppi changed the title S3/GCS/Azure Source: Support data reloads from a point in time S3/GCS/Azure Source: Enhanced Data Reload Strategy for Specific Timeframes Apr 29, 2024
For temp files/folders call the deleteOnExit
@stheppi stheppi merged commit 9b1722f into master Apr 30, 2024
154 checks passed
@stheppi stheppi deleted the feat/datalakes_key_namer branch April 30, 2024 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants