Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shards did not rebalance immediately after hardware failure on data node #116203

Closed
Harif-Rahman opened this issue Nov 4, 2024 · 1 comment
Closed
Labels
>bug needs:triage Requires assignment of a team area label

Comments

@Harif-Rahman
Copy link

Elasticsearch Version

5.6.16

Installed Plugins

No response

Java Version

bundled

OS Version

Linux cluster-1-master-2 4.14.326-245.539.amzn2.x86_64 #1 SMP Tue Sep 26 09:59:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

We encountered an issue where Elasticsearch did not automatically rebalance shards after a hardware failure on one of our data nodes. This caused an extended period of degraded performance until manual intervention was performed.

EC2 instance went unhealthy on 2024-10-28 20:40 UTC
EC2 instance became healthy back on 2024-10-28 21:10 UTC

Elastic search rebalanced the shards at 2024-10-28 21:10 UTC only.

Wanted to know the reason for high time to rebalance the shard. Have attached the master node logs

es_oct_28.log

ES cluster settings

Cluster level settings.

cluster.name: fc-use1-00-conversation-cluster-1

Discovery settings.

discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: [XXX]

discovery.zen.ping.timeout: 15s

discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_timeout: 10s
discovery.zen.fd.ping_retries: 6

action.auto_create_index: true

indices.memory.index_buffer_size: 30%

indices.store.throttle.max_bytes_per_sec: 100mb

Node level settings.

node.data: false
node.master: true
node.name: cluster-1-master-1

cluster.routing.allocation.awareness.force.zone.values: zone-a,zone-b,zone-c,zone-d,zone-e,zone-f
cluster.routing.allocation.awareness.attributes: zone

http.enabled: true

Loopback interface

Steps to Reproduce

Steps to Reproduce:

  • Simulate a hardware failure on one of the data nodes by stopping or disconnecting the instance.
  • Observe the state of shard allocation and rebalancing.
  • Notice that Elasticsearch does not immediately initiate shard rebalancing across available nodes.

Logs (if relevant)

No response

@Harif-Rahman Harif-Rahman added >bug needs:triage Requires assignment of a team area label labels Nov 4, 2024
@DaveCTurner
Copy link
Contributor

Thank you very much for your interest in Elasticsearch. Unfortunately the issue you have reported relates to Elasticsearch version 5.6.16 which is very old and has passed end-of-life. We will not investigate issues related to unsupported versions here on Github, so I am closing this to indicate that no action is needed from the Elasticsearch development team. It's possible that you will find a volunteer to help you with this issue on the community forums, but our strong recommendation would be to upgrade to a supported version of Elasticsearch as a matter of some urgency. If you can reproduce your issue on a supported version then please open a fresh bug report.

Quoting the bug report form:

Please also check your OS is supported, and that the version of Elasticsearch has not passed end-of-life. If you are using an unsupported OS or an unsupported version then the issue is likely to be closed.

@DaveCTurner DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug needs:triage Requires assignment of a team area label
Projects
None yet
Development

No branches or pull requests

2 participants