-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not try to exclude a master node that never existed #1986
Comments
This is an interesting one. Some ideas:
In both cases, there's a race condition:
The current code retries over and over again, hitting the same error until the node finally joins the cluster. But this could never happen if the node stays Pending or bootlooping forever. Note we do remove only one master node at a time, which mitigates the risks introduced with the above race condition. I think we have the same sort of problem when setting allocation excludes in cluster settings, to migrate shards away from a data node before removing it. We have an easy way out though: it is possible to exclude a node that is not part of the cluster. The corresponding HTTP call does not fail. @ywelsch @DaveCTurner I would appreciate your thoughts on this. |
Ugh yes this is tricky. Unfortunately it's necessary to know the node ID (not just its name) before we can exclude it from the voting configuration. If it's not in the cluster we don't know its node ID so we cannot exclude it, hence the exception. Naively, if a node is not running then you don't need to play with the voting configuration to get rid of it safely. If the cluster is alive then the node in question wasn't needed for its votes, and if the cluster is dead then it's already too late. The main thing that worries me is that this node is still showing as Unfortunately, "will not run in future" isn't quite enough. Nodes that are not running cannot join a cluster, but they could remain in a cluster for a short while after their deaths. I think that after stopping the node from running we need to ensure it is certainly out of the cluster. I don't think we provide an API to do this today. I wonder if we should strengthen the voting config exclusions API to accept an unknown node name. |
I opened elastic/elasticsearch#47990 |
Ok the change to Elasticsearch is now merged to It will shortly be removed in |
Thanks for the heads up @DaveCTurner! I suggest we keep this issue open for pre-8.0 clusters (we may decide to do nothing about it though). |
I just realized that thanks to elastic/elasticsearch#50836 we could already fix this for Elasticsearch 7.8+, by changing our call from Raising priority on this issue. |
We have #2951 for the more focused fix of using the new query parameter |
To workaround this situation when running Elasticsearch < 7.8 it's possible to edit the StatefulSet and scale down manually the number of replicas:
|
So are we going to add this workaround to our troubleshooting docs for <7.8 and close this issue? |
I ran out of resources on a K8S cluster while doing an upscale of a set of MDI nodes.
I'm now in a situation where the
nodeSet
can't be downscaled because the operator is trying to exclude a master node which has never existed:The text was updated successfully, but these errors were encountered: