Skip to content

Commit

Permalink
[CELEBORN-1700][FOLLOWUP] Fix flaky test RemoteShuffleMasterSuiteJ - …
Browse files Browse the repository at this point in the history
…testRegisterPartitionWithProducer

### What changes were proposed in this pull request?
Increase `celeborn.client.application.heartbeatInterval` from default `10s` to `30s` to fix flaky test `RemoteShuffleMasterSuiteJ`.

### Why are the changes needed?
Many flaky test failure for `RemoteShuffleMasterSuiteJ` when assert the `lifecycleManager().shuffleCount() == 3`.

```
Error:  Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 52.186 s <<< FAILURE! - in org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ
Error:  org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ.testRegisterPartitionWithProducer  Time elapsed: 10.05 s  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<0>
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.junit.Assert.assertEquals(Assert.java:633)
	at org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ.testRegisterPartitionWithProducer(RemoteShuffleMasterSuiteJ.java:146)
```

https://github.com/apache/celeborn/blob/680b072b5bea852e8cf7733f0ec4d8aea127c51f/client-flink/flink-1.15/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleMasterSuiteJ.java#L146

The `lifecycleManager().shuffleCount()` would reset when reporting application heartbeat, so the test would fail if its duration is more than default application heartbeat interval, 10s.

https://github.com/apache/celeborn/blob/680b072b5bea852e8cf7733f0ec4d8aea127c51f/client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala#L210-L220

So, in this PR, we increase the application heartbeat interval from defaults `10s` to `30s` to reduce the flaky test.
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #3025 from turboFei/fix_RemoteShuffleMasterSuiteJ_failure.

Authored-by: Wang, Fei <[email protected]>
Signed-off-by: SteNicholas <[email protected]>
  • Loading branch information
turboFei authored and SteNicholas committed Dec 24, 2024
1 parent 27e34ec commit 6028a04
Show file tree
Hide file tree
Showing 7 changed files with 7 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ public void setUp() {
int startPort = Utils$.MODULE$.selectRandomInt(1024, 65535);
configuration.setInteger("celeborn.master.port", startPort);
configuration.setString("celeborn.master.endpoints", "localhost:" + startPort);
configuration.setString("celeborn.client.application.heartbeatInterval", "30s");
remoteShuffleMaster = createShuffleMaster(configuration);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ public void setUp() {
int startPort = Utils$.MODULE$.selectRandomInt(1024, 65535);
configuration.setInteger("celeborn.master.port", startPort);
configuration.setString("celeborn.master.endpoints", "localhost:" + startPort);
configuration.setString("celeborn.client.application.heartbeatInterval", "30s");
remoteShuffleMaster = createShuffleMaster(configuration);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ public void setUp() {
int startPort = Utils$.MODULE$.selectRandomInt(1024, 65535);
configuration.setInteger("celeborn.master.port", startPort);
configuration.setString("celeborn.master.endpoints", "localhost:" + startPort);
configuration.setString("celeborn.client.application.heartbeatInterval", "30s");
remoteShuffleMaster = createShuffleMaster(configuration);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ public void setUp() {
int startPort = Utils$.MODULE$.selectRandomInt(1024, 65535);
configuration.setInteger("celeborn.master.port", startPort);
configuration.setString("celeborn.master.endpoints", "localhost:" + startPort);
configuration.setString("celeborn.client.application.heartbeatInterval", "30s");
remoteShuffleMaster = createShuffleMaster(configuration);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ public void setUp() {
int startPort = Utils$.MODULE$.selectRandomInt(1024, 65535);
configuration.setInteger("celeborn.master.port", startPort);
configuration.setString("celeborn.master.endpoints", "localhost:" + startPort);
configuration.setString("celeborn.client.application.heartbeatInterval", "30s");
remoteShuffleMaster = createShuffleMaster(configuration);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ public void setUp() {
int startPort = Utils$.MODULE$.selectRandomInt(1024, 65535);
configuration.setInteger("celeborn.master.port", startPort);
configuration.setString("celeborn.master.endpoints", "localhost:" + startPort);
configuration.setString("celeborn.client.application.heartbeatInterval", "30s");
remoteShuffleMaster = createShuffleMaster(configuration);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ public void setUp() {
int startPort = Utils$.MODULE$.selectRandomInt(1024, 65535);
configuration.setInteger("celeborn.master.port", startPort);
configuration.setString("celeborn.master.endpoints", "localhost:" + startPort);
configuration.setString("celeborn.client.application.heartbeatInterval", "30s");
remoteShuffleMaster = createShuffleMaster(configuration);
}

Expand Down

0 comments on commit 6028a04

Please sign in to comment.