You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Linux 5.15.167-112.165.amzn2.x86_64 #1 SMP Mon Sep 23 21:53:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Problem Description
I have a 3 node cluster where Node-1 has a hardware failure and the other two nodes detect the disconnect but do not hold an election and no failover occurs. Prometheus metrics on the other two nodes indicate that Node-2 and Node-3 still think there is GREEN cluster health and 3 nodes in the cluster.
Timeline:
11:44:00 Node-1 has a hardware failure and goes down. It has been encountering "ProcessClusterEventTimeoutException: failed to process cluster event" up until this point see [1] for examples. Node-2 and Node-3 also has ProcessClusterEventTimeoutException in their logs up to this point. Node-2 and Node-3 are already seeing ReceiveTimeoutTransportException when talking with Node-1.
11:57:47 Node-2 and Node-3 log starts to show "NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected". See [2] for log.
12:59 Node-1 recovers but Node-2 and Node-3 still do not have an election
13:01 Node-2 and Node-3 ES instances are restarted; Node-2 becomes master; Cluster health goes from RED -> YELLOW -> GREEN See [3] for log.
Node-2 and Node-3 ES instances only hold an election AFTER they are MANUALLY restarted and after having waited more than 1 hour for them to automatically failover (which they did not).
Even if no failover happens, prometheus exporter is still showing that Node-2 and Node-3 were reporting a cluster status of GREEN while Node-1 was down and that there are 3 nodes.
The settings cluster.fault_detection.leader_check.interval, cluster.fault_detection.leader_check.timeout, and cluster.fault_detection.leader_check.retry_count have been untouched, so they should be their default of 1s, 10s, and 3
[2024-10-15T11:43:00,998][WARN ][rest.suppressed ] [node-1] path: /news/_settings, params: {master_timeout=30s, index=news, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[news-8-0/JXXtmZtqSRm2GKNx1-2Gkw]]) within 30s
at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:840) ~[?:?]
[2024-10-15T11:43:56,870][WARN ][rest.suppressed ] [Node-1 path: /designer-objects-ia/_settings, params: {master_timeout=30s, index=designer-objects-ia, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[designer-objects-ia-8-6/jZvLf-kxTGSigmHZrFQI7A]]) within 30s
at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:840) ~[?:?]
2:
[2024-10-15T11:57:47,914][DEBUG][org.elasticsearch.action.support.nodes.TransportNodesAction] [node-2] failed to execute [cluster:monitor/nodes/stats] on node [{node-1}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]
...
...
[2024-10-15T11:57:47,917][WARN ][org.elasticsearch.cluster.InternalClusterInfoService] [node-2] failed to retrieve stats for node [UtQcAppJSkO-4BQc4b4avA]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]
3:
Log from Node-2 which becomes the new master
[2024-10-15T13:01:33,271][INFO ][org.elasticsearch.cluster.service.MasterService] [node-2] elected-as-master ([3] nodes joined in term 5)[_FINISH_ELECTION_, {node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election], term: 5, version: 68136, delta: master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}
[2024-10-15T13:01:33,468][INFO ][org.elasticsearch.cluster.service.ClusterApplierService] [node-2] master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}, term: 5, version: 68136, reason: Publication{term=5, version=68136}
[
...
...
024-10-15T13:01:41,829][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="YELLOW" message="Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="RED" reason="shards started [[user-activity-8-0][0]]"
[2024-10-15T13:01:41,864][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [MsnZMCyIRnmjkxcnhEtO-A]
[2024-10-15T13:01:42,186][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [9aeL3u_KQYOsqSSdgI6b1g]
[2024-10-15T13:01:44,994][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="YELLOW" reason="shards started [[user-activity-8-0][0]]"
The text was updated successfully, but these errors were encountered:
I replied on the forum, but there's nothing in the information shared that indicates what (if anything) might need to be changed in ES so we'll leave this issue closed.
Elasticsearch Version
8.14.1
Installed Plugins
n/a
Java Version
17
OS Version
Linux 5.15.167-112.165.amzn2.x86_64 #1 SMP Mon Sep 23 21:53:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Problem Description
I have a 3 node cluster where Node-1 has a hardware failure and the other two nodes detect the disconnect but do not hold an election and no failover occurs. Prometheus metrics on the other two nodes indicate that Node-2 and Node-3 still think there is GREEN cluster health and 3 nodes in the cluster.
Timeline:
Even if no failover happens, prometheus exporter is still showing that Node-2 and Node-3 were reporting a cluster status of GREEN while Node-1 was down and that there are 3 nodes.
The settings cluster.fault_detection.leader_check.interval, cluster.fault_detection.leader_check.timeout, and cluster.fault_detection.leader_check.retry_count have been untouched, so they should be their default of 1s, 10s, and 3
See also community forum for this issue https://discuss.elastic.co/t/node-fails-but-cluster-holds-no-election-and-no-failover-occurs/369391
Steps to Reproduce
n/a
Logs (if relevant)
1:
2:
3:
Log from Node-2 which becomes the new master
The text was updated successfully, but these errors were encountered: