Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testDropPrimaryDuringReplication is flaky #8059

Closed
psychbot opened this issue Jun 14, 2023 · 11 comments · Fixed by #8431, #8715 or #8889
Assignees
Labels
bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@psychbot
Copy link
Member

psychbot commented Jun 14, 2023

Describe the bug
org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testDropPrimaryDuringReplication is flaky

https://build.ci.opensearch.org/job/gradle-check/17543/testReport/junit/org.opensearch.remotestore/SegmentReplicationUsingRemoteStoreIT/testDropPrimaryDuringReplication/

#8057 (comment)

Assertion Failure

java.lang.AssertionError: Expected search hits on node: node_t3 to be at least 111 but was: 110
	at __randomizedtesting.SeedInfo.seed([5E122859418B05B6:F3568B11F61BBC1C]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.opensearch.indices.replication.SegmentReplicationBaseIT.lambda$waitForSearchableDocs$0(SegmentReplicationBaseIT.java:122)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1084)
	at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:117)
	at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:112)
	at org.opensearch.indices.replication.SegmentReplicationIT.testDropPrimaryDuringReplication(SegmentReplicationIT.java:739)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)

To Reproduce

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testDropPrimaryDuringReplication" -Dtests.seed=5E122859418B05B6 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-CO -Dtests.timezone=Asia/Bishkek -Druntime.java=20
@psychbot psychbot added bug Something isn't working untriaged labels Jun 14, 2023
@psychbot
Copy link
Member Author

@ankitkala
Tagging IT author to take an initial look.

@ankitkala
Copy link
Member

I triggered more than 100 runs locally for this test and i was not able to reproduce this again. Let's revisit this again if we see another such occurrence.

I also went through the gradle check logs shared above couldn't find anything useful.

@sachinpkale
Copy link
Member

One more failure: https://build.ci.opensearch.org/job/gradle-check/18866/

Can we please mute these tests?

@ankitkala
Copy link
Member

I've added a minor change to hopefully avoid the flakyness: #8431

If the failure still persists after the change, will mute the test till we fix it.

@andrross
Copy link
Member

Another flaky failure here: #8667 (comment)

@mch2
Copy link
Member

mch2 commented Jul 15, 2023

This test is failing because this maybeRefresh call is failing to acquire a refresh lock and returning without updating the reader reference with the updated SegmentInfos. The refresh that holds the lock is the refresh triggered via API in the test here. The primary must be initiating copy internally on a scheduled refresh before the refresh is triggered that starts the replication cycle. When the request to refresh hits the replica it is forcing a refresh on NRTReplicationEngine directly here that acquires the lock and so that the maybeRefresh during segment update does nothing. We are left in a state where updateSegments completes and the reader has the updated segment ref, but has not internally refreshed.

There are a couple of things I think we should do here.

  1. The refresh from updateSegments should be blocking to ensure it is executed.
  2. No-Op internal refresh/mayberefresh methods in NRTReplicationEngine. Refresh on NRT Replicas should only be triggered from updateSegments. Any call to refresh outside of this method was intended to manually release any wait listeners with this change so that those listeners weren't held indefinitely during segment copy.
  3. Rather than triggering a refresh simply to release listeners, prevent NRT replicas from creating the listeners in the first place.

@mch2
Copy link
Member

mch2 commented Jul 17, 2023

PR'd to fix the immediate problem of not acquiring the refresh lock, will make a separate change to skip registering listeners.

@shwetathareja
Copy link
Member

shwetathareja commented Aug 9, 2023

This test org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testDropPrimaryDuringReplication failed again here - https://build.ci.opensearch.org/job/gradle-check/22223/console

@dreamer-89
Copy link
Member

Doc count mis-match assertion failure.

java.lang.AssertionError: Expected search hits on node: node_t3 to be at least 101 but was: 100

@mch2
Copy link
Member

mch2 commented Aug 31, 2023

Resolved with #9471. Please re-open if this test pops up again.

@mch2 mch2 closed this as completed Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
Status: Done