Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotMasterException with duplicate node ids and minimum_master_nodes not met #32904

Closed
PhaedrusTheGreek opened this issue Aug 16, 2018 · 4 comments
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement

Comments

@PhaedrusTheGreek
Copy link
Contributor

Elasticsearch version 6.3.1

When an instances with copied data directories cluster with each other, we should see IllegalArgumentException .. found existing node .. with the same id but is a different node instance

However when minimum_master_nodes is not met, we see this instead:

NotMasterException .. not master for join request

Steps to reproduce:

  1. start ES from tar.gz, then stop ES
  2. cp -rp $ES_HOME $ES_HOME2
  3. set minimum_master_nodes: 2 in each config
  4. start ES on both nodes
[2018-08-16T07:29:43,468][INFO ][o.e.d.z.ZenDiscovery     ] [gETA6zC] failed to send join request to master [{gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{vYiBS2AVQhSK0VITLJJ5tQ}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[gETA6zC][127.0.0.1:9301][internal:discovery/zen/join]]; nested: NotMasterException[Node [{gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{vYiBS2AVQhSK0VITLJJ5tQ}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}] not master for join request]; ], tried [3] times
  1. Remove minimum_master_nodes setting, then the more accurate error appears:
[2018-08-16T07:27:47,066][INFO ][o.e.d.z.ZenDiscovery     ] [gETA6zC] failed to send join request to master [{gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{iCnfSG48RTeWox4XqIN2cw}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[gETA6zC][127.0.0.1:9300][internal:discovery/zen/join]]; nested: IllegalArgumentException[can't add node {gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{mtW3yNGyQCq5znP5b-ZOZA}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, found existing node {gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{iCnfSG48RTeWox4XqIN2cw}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} with the same id but is a different node instance]; ]
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@javanna javanna added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement labels Aug 16, 2018
@DaveCTurner
Copy link
Contributor

First up, for the record, copying a data directory from one node to another is always a Very Bad Idea(:tm:). It's not supported, and Elasticsearch's behaviour is undefined if you do this.

The way you get a NotMasterException .. not master for join request here is if each node decides that the other one should be master, because they have equal cluster state versions and equal IDs so the order of preference is formally undefined, but a series of coincidences mean that each node picks the other. Other things that can happen include:

  • each node thinks it's the best master candidate, and tries to join itself. This results in repeated election timeouts.

  • the nodes actually agree on the best one between them, and the election repeatedly succeeds but subsequent publishing fails:

[2018-08-17T12:18:15,704][INFO ][o.e.c.s.MasterService    ] [IMu-REN] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: master {new {IMu-REN}{IMu-RENTSSu3csqN4ObMaQ}{fK3mPtoKQhyWUBTqq_azZA}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}}, removed {{NGS64nm}{NGS64nmTSvKgC3c6mx7R_A}{cJWyql-aSECYsOpRSSfZfQ}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{IMu-REN}{IMu-RENTSSu3csqN4ObMaQ}{OrLYCx-DQFKjqaTfN2ZbMA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true},}
[2018-08-17T12:18:15,704][WARN ][o.e.d.z.ZenDiscovery     ] [IMu-REN] zen-disco-failed-to-publish, current nodes: nodes:
[2018-08-17T12:18:15,704][WARN ][o.e.c.s.MasterService    ] [IMu-REN] failing [zen-disco-elected-as-master ([1] nodes joined)[, ]]: failed to commit cluster state version [5]

I don't know that we want to try and handle all these cases in the existing discovery layer, but we should definitely do something better in #32006.

@bleskes
Copy link
Contributor

bleskes commented Sep 9, 2018

I agree that we shouldn't invest time here. Closing. @PhaedrusTheGreek are you happy with the explanation?

@DaveCTurner
Copy link
Contributor

I added a corresponding item to the list in #32006 so as not to lose this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement
Projects
None yet
Development

No branches or pull requests

5 participants