Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix potential data loss for shared source #19443

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

xxchan
Copy link
Member

@xxchan xxchan commented Nov 19, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Set up: Create a shared kafka source, and 1 MV on the source.

Data loss happens when:

  1. alter source set rate limit to 0
  2. push some new data to the kafka topic
  3. resume the source. Then we can find the MV don't get the data. If we push more data to the topic, only new data comes.

Reason:
In #16626 we introduced an optimization to let shared SourceExecutor start from latest, but the implementation is problematic. Specifically, hack_seek_to_latest will not only take effect at the beginning, but will also when rebuilding the source reader (which happens when rate limit is applied).

The new implementation in this PR:

  • Remove implicit hack_seek_to_latest flag, which is error prone.
  • Replaced with an explicit seek_to_latest call. At the same time, we also get the latest offsets. To make sure SplitImpl and SourceReader is consistent.
  • Make sure the splits are written to state store even if no messages come.
  • Do not pause the SourceExecutor. (feat: pause shared SourceExecutor until a downstream actor is created #16348) It seems not very useful, but just confusion. Seeking to latest should be good enough.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@github-actions github-actions bot added the type/fix Bug fix label Nov 19, 2024
Copy link
Member Author

xxchan commented Nov 19, 2024

@xxchan xxchan marked this pull request as ready for review November 19, 2024 06:55
@xxchan xxchan force-pushed the 11-18-fix_fix_potential_data_loss_for_shared_source branch from 53349f8 to a29b177 Compare November 19, 2024 06:56
@xxchan xxchan force-pushed the 11-18-fix_fix_potential_data_loss_for_shared_source branch 5 times, most recently from 39887aa to 4469944 Compare November 19, 2024 07:59
@tabVersion
Copy link
Contributor

Specifically, hack_seek_to_latest will not only take effect at the beginning, but will also when rebuilding the source reader (which happens when rate limit is applied).

So when receiving a new mutation on rate_limit, the source exec refreshes the high watermark to hw_1 but the source backfill exec keeps the original high watermark hw_0 as the end position of backfill. Is my understanding correct?

Copy link
Member Author

xxchan commented Nov 19, 2024

It's not related with backfill's position. We may assume backfill already finished, and it's just forwarding messages now.

Rebuilding will make source exec jump from hw_0 to hw_1.

@xxchan xxchan force-pushed the 11-18-fix_fix_potential_data_loss_for_shared_source branch from 4469944 to cf75bff Compare November 19, 2024 08:41
@graphite-app graphite-app bot requested a review from a team November 19, 2024 09:14
@lmatz lmatz added this pull request to the merge queue Nov 19, 2024
@lmatz lmatz removed this pull request from the merge queue due to a manual request Nov 19, 2024
@xxchan xxchan force-pushed the 11-18-fix_fix_potential_data_loss_for_shared_source branch from cf75bff to 2ff1259 Compare November 20, 2024 02:09
@xxchan xxchan requested a review from chenzl25 November 20, 2024 06:21
@xxchan
Copy link
Member Author

xxchan commented Nov 20, 2024

Want to wait a while for more reviews. Just in case.

@@ -232,7 +232,7 @@ impl ExecutorBuilder for SourceExecutorBuilder {
barrier_receiver,
system_params,
source.rate_limit,
is_shared,
is_shared && !source.with_properties.is_cdc_connector(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't impl seek to latest for CDC, and it will hit error

@@ -211,14 +211,17 @@ impl SourceReader {
}

/// Build `SplitReader`s and then `BoxChunkSourceStream` from the given `ConnectorState` (`SplitImpl`s).
///
/// If `seek_to_latest` is true, will also return the latest splits after seek.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we always return the splits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds OK to me. But ConnectorState is also Option<Vec<SplitImpl>> (which is also a little unnecessary to me), so perhaps we should refactor that together. NTFS

Copy link
Member Author

xxchan commented Nov 20, 2024

Merge activity

  • Nov 20, 3:57 PM GMT+8: A user started a stack merge that includes this pull request via Graphite.
  • Nov 20, 3:58 PM GMT+8: Graphite couldn't merge this PR because it failed for an unknown reason (This repository has GitHub's merge queue enabled, which is currently incompatible with Graphite).

@xxchan
Copy link
Member Author

xxchan commented Nov 20, 2024

will cherry pick the whole stack together

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 20, 2024
@xxchan xxchan added this pull request to the merge queue Nov 20, 2024
@xxchan xxchan removed this pull request from the merge queue due to a manual request Nov 20, 2024
@xxchan xxchan added this pull request to the merge queue Nov 20, 2024
Merged via the queue into main with commit b9c3f70 Nov 20, 2024
30 of 31 checks passed
@xxchan xxchan deleted the 11-18-fix_fix_potential_data_loss_for_shared_source branch November 20, 2024 10:39
github-merge-queue bot pushed a commit that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/fix Bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants