Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1700][FOLLOWUP] Fix flaky test RemoteShuffleMasterSuiteJ - testRegisterPartitionWithProducer #3025

Closed

Conversation

turboFei
Copy link
Member

@turboFei turboFei commented Dec 23, 2024

What changes were proposed in this pull request?

Increase celeborn.client.application.heartbeatInterval from default 10s to 30s to fix flaky test RemoteShuffleMasterSuiteJ.

Why are the changes needed?

Many flaky test failure for RemoteShuffleMasterSuiteJ when assert the lifecycleManager().shuffleCount() == 3.

Error:  Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 52.186 s <<< FAILURE! - in org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ
Error:  org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ.testRegisterPartitionWithProducer  Time elapsed: 10.05 s  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<0>
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.junit.Assert.assertEquals(Assert.java:633)
	at org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ.testRegisterPartitionWithProducer(RemoteShuffleMasterSuiteJ.java:146)

Assert.assertEquals(3, remoteShuffleMaster.lifecycleManager().shuffleCount().sum());

The lifecycleManager().shuffleCount() would reset when reporting application heartbeat, so the test would fail if its duration is more than default application heartbeat interval, 10s.

new ApplicationHeartbeater(
appUniqueId,
conf,
masterClient,
() => {
commitManager.commitMetrics() ->
(shuffleCount.sumThenReset(), resetShuffleFallbackCounts())
},
workerStatusTracker,
registeredShuffle,
reason => cancelAllActiveStages(reason))

So, in this PR, we increase the application heartbeat interval from defaults 10s to 30s to reduce the flaky test.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

GA.

@turboFei turboFei force-pushed the fix_RemoteShuffleMasterSuiteJ_failure branch from fc89363 to 16ffb39 Compare December 23, 2024 18:22
@turboFei turboFei changed the title [CELEBORN-1700][FOLLOWUP] Fix flaky test RemoteShuffleMasterSuiteJ: testRegisterPartitionWithProducer [CELEBORN-1700][FOLLOWUP] Fix flaky test RemoteShuffleMasterSuiteJ - testRegisterPartitionWithProducer Dec 23, 2024
Copy link
Contributor

@zaynt4606 zaynt4606 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@turboFei turboFei requested a review from SteNicholas December 24, 2024 03:17
@SteNicholas
Copy link
Member

Merged to main(v0.6.0).

@turboFei turboFei deleted the fix_RemoteShuffleMasterSuiteJ_failure branch December 24, 2024 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants