Cluster health and number of nodes unchanged after a node failure #115875

jubui · 2024-10-29T19:10:07Z

Elasticsearch Version

8.14.1

Installed Plugins

n/a

Java Version

17

OS Version

Linux 5.15.167-112.165.amzn2.x86_64 #1 SMP Mon Sep 23 21:53:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

I have a 3 node cluster where Node-1 has a hardware failure and the other two nodes detect the disconnect but do not hold an election and no failover occurs. Prometheus metrics on the other two nodes indicate that Node-2 and Node-3 still think there is GREEN cluster health and 3 nodes in the cluster.

Timeline:

11:44:00 Node-1 has a hardware failure and goes down. It has been encountering "ProcessClusterEventTimeoutException: failed to process cluster event" up until this point see [1] for examples. Node-2 and Node-3 also has ProcessClusterEventTimeoutException in their logs up to this point. Node-2 and Node-3 are already seeing ReceiveTimeoutTransportException when talking with Node-1.
11:57:47 Node-2 and Node-3 log starts to show "NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected". See [2] for log.
12:59 Node-1 recovers but Node-2 and Node-3 still do not have an election
13:01 Node-2 and Node-3 ES instances are restarted; Node-2 becomes master; Cluster health goes from RED -> YELLOW -> GREEN See [3] for log.
Node-2 and Node-3 ES instances only hold an election AFTER they are MANUALLY restarted and after having waited more than 1 hour for them to automatically failover (which they did not).

Even if no failover happens, prometheus exporter is still showing that Node-2 and Node-3 were reporting a cluster status of GREEN while Node-1 was down and that there are 3 nodes.

The settings cluster.fault_detection.leader_check.interval, cluster.fault_detection.leader_check.timeout, and cluster.fault_detection.leader_check.retry_count have been untouched, so they should be their default of 1s, 10s, and 3

See also community forum for this issue https://discuss.elastic.co/t/node-fails-but-cluster-holds-no-election-and-no-failover-occurs/369391

Steps to Reproduce

n/a

Logs (if relevant)

1:

[2024-10-15T11:43:00,998][WARN ][rest.suppressed          ] [node-1] path: /news/_settings, params: {master_timeout=30s, index=news, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[news-8-0/JXXtmZtqSRm2GKNx1-2Gkw]]) within 30s
        at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
[2024-10-15T11:43:56,870][WARN ][rest.suppressed          ] [Node-1 path: /designer-objects-ia/_settings, params: {master_timeout=30s, index=designer-objects-ia, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[designer-objects-ia-8-6/jZvLf-kxTGSigmHZrFQI7A]]) within 30s
        at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]

2:

[2024-10-15T11:57:47,914][DEBUG][org.elasticsearch.action.support.nodes.TransportNodesAction] [node-2] failed to execute [cluster:monitor/nodes/stats] on node [{node-1}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]
...
...
[2024-10-15T11:57:47,917][WARN ][org.elasticsearch.cluster.InternalClusterInfoService] [node-2] failed to retrieve stats for node [UtQcAppJSkO-4BQc4b4avA]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]

3:
Log from Node-2 which becomes the new master

[2024-10-15T13:01:33,271][INFO ][org.elasticsearch.cluster.service.MasterService] [node-2] elected-as-master ([3] nodes joined in term 5)[_FINISH_ELECTION_, {node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election], term: 5, version: 68136, delta: master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}
[2024-10-15T13:01:33,468][INFO ][org.elasticsearch.cluster.service.ClusterApplierService] [node-2] master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}, term: 5, version: 68136, reason: Publication{term=5, version=68136}
[
...
...
024-10-15T13:01:41,829][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="YELLOW" message="Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="RED" reason="shards started [[user-activity-8-0][0]]"
[2024-10-15T13:01:41,864][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [MsnZMCyIRnmjkxcnhEtO-A]
[2024-10-15T13:01:42,186][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [9aeL3u_KQYOsqSSdgI6b1g]
[2024-10-15T13:01:44,994][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="YELLOW" reason="shards started [[user-activity-8-0][0]]"

The text was updated successfully, but these errors were encountered:

pxsalehi · 2024-10-30T13:56:51Z

I will close this as the community issue seems to be active and someone is already helping out.

jubui · 2024-11-09T17:52:51Z

To be clear, the community thread did not result in a resolution of this issue.

DaveCTurner · 2024-11-09T19:33:41Z

I replied on the forum, but there's nothing in the information shared that indicates what (if anything) might need to be changed in ES so we'll leave this issue closed.

jubui added >bug needs:triage Requires assignment of a team area label labels Oct 29, 2024

pxsalehi closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster health and number of nodes unchanged after a node failure #115875

Cluster health and number of nodes unchanged after a node failure #115875

jubui commented Oct 29, 2024

pxsalehi commented Oct 30, 2024

jubui commented Nov 9, 2024

DaveCTurner commented Nov 9, 2024

Cluster health and number of nodes unchanged after a node failure #115875

Cluster health and number of nodes unchanged after a node failure #115875

Comments

jubui commented Oct 29, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

pxsalehi commented Oct 30, 2024

jubui commented Nov 9, 2024

DaveCTurner commented Nov 9, 2024