Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Index management can hang cluster on master failure if history is enabled #68

Closed
adityaj1107 opened this issue Jun 3, 2021 · 0 comments
Labels
bug Something isn't working

Comments

@adityaj1107
Copy link
Contributor

Issue by mpoindexter
Saturday May 22, 2021 at 03:19 GMT
Originally opened as opendistro-for-elasticsearch/index-management#447


Describe the bug

When the active master terminates, and index management history is enabled, the cluster can hang. When the cluster hangs, all cluster tasks are blocked behind an election-to-master task that never completes. I believe https://discuss.opendistrocommunity.dev/t/killed-active-master-not-being-removed-from-the-cluster-state/5011 describes the same issue.

Other plugins installed
Using the standard ODFE install including security plugin

To Reproduce
Difficult to reproduce. We've had ILM enabled for a while and this did not happen until recently. I'm not clear on the exact conditions that cause this, but it seems to happen when network traffic is high.

Expected behavior
When a master node dies it is cleanly replaced.

Additional context
When index state management history is enabled, and the master node dies, it seems that the new master hangs processing a task and the cluster is left in a state where no further cluster operations can occur. This results in a cluster that is practically useless and cannot do any operations that require cluster updates. On the master node, the following stack traces appear:

"elasticsearch[us-west-2a-0][clusterApplierService#updateTask][T#1]" #28 daemon prio=5 os_prio=0 cpu=22914.96ms elapsed=2858.68s tid=0x00007f365728a0d0 nid=0x1247 waiting on condition  [0x00007f35b26f6000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x000000063c000020> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:714)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1046)
	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:259)
	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
	at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
	at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:37)
	at com.amazon.opendistroforelasticsearch.indexmanagement.indexstatemanagement.IndexStateManagementHistory.rolloverHistoryIndex(IndexStateManagementHistory.kt:125)
	at com.amazon.opendistroforelasticsearch.indexmanagement.indexstatemanagement.IndexStateManagementHistory.onMaster(IndexStateManagementHistory.kt:86)
	at org.elasticsearch.cluster.LocalNodeMasterListener.clusterChanged(LocalNodeMasterListener.java:42)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:526)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:516)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:484)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418)
	at org.elasticsearch.cluster.service.ClusterApplierService.access$000(ClusterApplierService.java:68)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:162)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:630)
	at java.lang.Thread.run([email protected]/Thread.java:832)
	
"elasticsearch[us-west-2a-0][masterService#updateTask][T#1]" #584 daemon prio=5 os_prio=0 cpu=111.37ms elapsed=2364.78s tid=0x00007f3604085390 nid=0x1513 waiting on condition  [0x00007f21bec69000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x000000063b800350> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:714)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1046)
	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:259)
	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
	at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
	at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:272)
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250)
	at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:630)
	at java.lang.Thread.run([email protected]/Thread.java:832)

If I disable Index State Management History, and delete the history write alias, the problem goes away, and masters are reelected normally.

@adityaj1107 adityaj1107 added the bug Something isn't working label Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants