S3/GCS/Azure Source: Enhanced Data Reload Strategy for Specific Timeframes #1187

stheppi · 2024-04-29T13:07:30Z

Overview

This pull request addresses the challenge of efficiently reloading data from a specific point in time (e.g., last week, yesterday). Currently, the naive approach involves loading data from the beginning and filtering out records, which proves impractical and costly, especially with large data volumes.

Proposed Solution

To enhance data reloads for specific timeframes, this change introduces a more refined strategy. Instead of loading all stored data and filtering records, the solution leverages timestamp metadata associated with each data file. By utilizing this metadata, the source can selectively load relevant data files, minimizing unnecessary data transfer and processing overhead.

This PR prepares the key (filename) values to carry the earliest and latest offset for the records within each entry. This allows the source to discard files that should not be processed, making the initial seek time directly proportional to the listing of keys and associated filters for each topic-partition. Additionally, records returned from the file undergo further filtering, ensuring only relevant data within the specified timeframe is processed.

Changes

This change exclusively affects the envelope storage mode. In non-envelope storage, there is no inherent storage mechanism for the record timestamp. However, to maintain consistency across storage modes, we've introduced the TopicPartitionOffsetFileNamerV1.

It's important to note that the lexicographic order remains unaffected when storing both envelope and non-envelope data. This is because the padded-offset continues to determine the order in both scenarios.

…me including the earliest record timestamp. Refactored a few parameter/field names around the object key name.

…sed. It is the only storage which guarantees the record timestamp to be preserved. Sinks tests have been updated to reflect the object keys values.

Avoids Avro invalid sync as a result of concurrent tests writing the same file

…amp within the file. This change would reduce the complexity of the initial seek when a request to load from a specific point in time is chosen

...nnect-aws-s3/src/it/scala/io/lenses/streamreactor/connect/aws/s3/source/TempFileHelper.scala

...mon/src/main/scala/io/lenses/streamreactor/connect/cloud/common/sink/writer/WriteState.scala

For temp files/folders call the deleteOnExit

stheppi and others added 6 commits April 25, 2024 16:39

First commit

2c2c818

Adapted the S3 integration tests to account for the new object key na…

e5182d7

…me including the earliest record timestamp. Refactored a few parameter/field names around the object key name.

Object Key format version only applies when the envelope storage is u…

fd29ed8

…sed. It is the only storage which guarantees the record timestamp to be preserved. Sinks tests have been updated to reflect the object keys values.

Fixes endless loop in test in case of test failure.

bb43472

Avoids Avro invalid sync as a result of concurrent tests writing the same file

Expand the object key value to contain the min and max records timest…

85f9180

…amp within the file. This change would reduce the complexity of the initial seek when a request to load from a specific point in time is chosen

Merge branch 'master' into feat/datalakes_key_namer

e0dcb60

stheppi marked this pull request as ready for review April 29, 2024 14:34

stheppi changed the title ~~S3/GCS/Azure Source: Support data reloads from a point in time~~ S3/GCS/Azure Source: Enhanced Data Reload Strategy for Specific Timeframes Apr 29, 2024

davidsloan approved these changes Apr 29, 2024

View reviewed changes

...nnect-aws-s3/src/it/scala/io/lenses/streamreactor/connect/aws/s3/source/TempFileHelper.scala Show resolved Hide resolved

...mon/src/main/scala/io/lenses/streamreactor/connect/cloud/common/sink/writer/WriteState.scala Outdated Show resolved Hide resolved

Removes obsolete comment.

ac30b97

For temp files/folders call the deleteOnExit

stheppi merged commit 9b1722f into master Apr 30, 2024
154 checks passed

stheppi deleted the feat/datalakes_key_namer branch April 30, 2024 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3/GCS/Azure Source: Enhanced Data Reload Strategy for Specific Timeframes #1187

S3/GCS/Azure Source: Enhanced Data Reload Strategy for Specific Timeframes #1187

stheppi commented Apr 29, 2024 •

edited

Loading

S3/GCS/Azure Source: Enhanced Data Reload Strategy for Specific Timeframes #1187

S3/GCS/Azure Source: Enhanced Data Reload Strategy for Specific Timeframes #1187

Conversation

stheppi commented Apr 29, 2024 • edited Loading

Overview

Proposed Solution

Changes

stheppi commented Apr 29, 2024 •

edited

Loading