Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] testPrimaryStopped_ReplicaPromoted is flaky #8762

Closed
dblock opened this issue Jul 18, 2023 · 3 comments · Fixed by #10655
Closed

[BUG] testPrimaryStopped_ReplicaPromoted is flaky #8762

dblock opened this issue Jul 18, 2023 · 3 comments · Fixed by #10655
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@dblock
Copy link
Member

dblock commented Jul 18, 2023

Describe the bug

Failed in https://build.ci.opensearch.org/job/gradle-check/20433/consoleFull, 2.9 RC.

org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT > testPrimaryStopped_ReplicaPromoted FAILED
    java.lang.AssertionError: Unexpected ShardFailures: [[test-idx-1][0] failed, reason [BroadcastShardOperationFailedException[]; nested: RemoteTransportException[[node_t3][127.0.0.1:45855][indices:admin/refresh[s]]]; nested: RemoteTransportException[[node_t3][127.0.0.1:45855][indices:admin/refresh[s][p]]]; nested: RetryOnPrimaryException[shard is not in primary mode]; nested: ShardNotInPrimaryModeException[CurrentState[STARTED] shard is not in primary mode]; ], [test-idx-1][0] failed, reason [BroadcastShardOperationFailedException[]; nested: RemoteTransportException[[node_t3][127.0.0.1:45855][indices:admin/refresh[s]]]; nested: RemoteTransportException[[node_t3][127.0.0.1:45855][indices:admin/refresh[s][p]]]; nested: RetryOnPrimaryException[shard is not in primary mode]; nested: ShardNotInPrimaryModeException[CurrentState[STARTED] shard is not in primary mode]; ]]
    Expected: <0>
         but: was <2>
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.opensearch.test.hamcrest.OpenSearchAssertions.assertNoFailures(OpenSearchAssertions.java:377)
        at org.opensearch.test.OpenSearchIntegTestCase.refresh(OpenSearchIntegTestCase.java:1455)
        at org.opensearch.indices.replication.SegmentReplicationIT.testPrimaryStopped_ReplicaPromoted(SegmentReplicationIT.java:127)
        at java.****/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.****/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.****/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.****/java.lang.reflect.Method.invoke(Method.java:568)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at java.****/java.lang.Thread.run(Thread.java:833)

    java.lang.AssertionError: AcknowledgedResponse failed - not acked
    Expected: <true>
         but: was <false>
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.opensearch.test.hamcrest.OpenSearchAssertions.assertAcked(OpenSearchAssertions.java:125)
        at org.opensearch.test.hamcrest.OpenSearchAssertions.assertAcked(OpenSearchAssertions.java:113)
        at org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.teardown(SegmentReplicationUsingRemoteStoreIT.java:64)

Expected behavior

Tests to pass.

@dblock dblock added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run labels Jul 18, 2023
@Poojita-Raj
Copy link
Contributor

Unable to repro with seed on local - 500+ iterations.

@mch2
Copy link
Member

mch2 commented Aug 23, 2023

From this trace the test was failing right after index creation and not passing ensureYellowAndNoInitializingShards. The exception thrown is a started primary with non-pending operation term must be in primary mode which from a search looks like a bug with remote store that was fixed - similar issue #9036.

Closing this as not reproducible, please re-open if this happens again.

@mch2 mch2 closed this as completed Aug 23, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in Segment Replication Aug 23, 2023
@dreamer-89
Copy link
Member

Coming from #8279 (comment), I observed some recent gradle check failures due to this issue. Reopening.

4 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testPrimaryStopped_ReplicaPromoted (24139,24530,24751,24937)
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testPrimaryStopped_ReplicaPromoted" -Dtests.seed=F5F9AE786895E628 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nb -Dtests.timezone=Chile/EasterIsland -Druntime.java=20
...

java.lang.AssertionError: Expected search hits on node: node_t3 to be at least 1 but was: 0
	at __randomizedtesting.SeedInfo.seed([F5F9AE786895E628:E4F1AF1F77E71110]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.opensearch.indices.replication.SegmentReplicationBaseIT.lambda$waitForSearchableDocs$0(SegmentReplicationBaseIT.java:125)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1086)
	at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:120)
	at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:115)
	at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:132)
	at org.opensearch.indices.replication.SegmentReplicationIT.testPrimaryStopped_ReplicaPromoted(SegmentReplicationIT.java:134)

Observed AlreadyClosedException exception on engine flush which suggests engine was closed before flush operation and probably the cause of doc count mis-match errors.

[2023-09-07T10:15:41,608][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t2] [test-idx-1][0] Exception in runAfterRefreshExactlyOnce() method
org.apache.lucene.store.AlreadyClosedException: engine is closed
	at org.opensearch.index.shard.IndexShard.getEngine(IndexShard.java:3452) ~[main/:?]
	at org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:4945) ~[main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.runAfterRefreshExactlyOnce(RemoteStoreRefreshListener.java:133) [main/:?]
	at org.opensearch.index.shard.CloseableRetryableRefreshListener.afterRefresh(CloseableRetryableRefreshListener.java:62) [main/:?]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) [lucene-core-9.8.0-snapshot-4373c3b.jar:9.8.0-snapshot-4373c3b 4373c3b2612e54bc0c5b992d9423e83e6340fdd5 - 2023-07-24 17:45:44]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) [lucene-core-9.8.0-snapshot-4373c3b.jar:9.8.0-snapshot-4373c3b 4373c3b2612e54bc0c5b992d9423e83e6340fdd5 - 2023-07-24 17:45:44]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) [lucene-core-9.8.0-snapshot-4373c3b.jar:9.8.0-snapshot-4373c3b 4373c3b2612e54bc0c5b992d9423e83e6340fdd5 - 2023-07-24 17:45:44]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1769) [main/:?]
	at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1884) [main/:?]
	at org.opensearch.index.engine.Engine.flush(Engine.java:1198) [main/:?]
	at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:1973) [main/:?]
	at org.opensearch.index.shard.IndexShard.close(IndexShard.java:1938) [main/:?]
	at org.opensearch.index.IndexService.closeShard(IndexService.java:630) [main/:?]
	at org.opensearch.index.IndexService.removeShard(IndexService.java:606) [main/:?]
	at org.opensearch.index.IndexService.close(IndexService.java:380) [main/:?]
	at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1023) [main/:?]
	at org.opensearch.indices.IndicesService.lambda$doStop$3(IndicesService.java:520) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1623) [?:?]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
Status: Done
5 participants