NotMasterException with duplicate node ids and minimum_master_nodes not met #32904

PhaedrusTheGreek · 2018-08-16T11:42:09Z

Elasticsearch version 6.3.1

When an instances with copied data directories cluster with each other, we should see IllegalArgumentException .. found existing node .. with the same id but is a different node instance

However when minimum_master_nodes is not met, we see this instead:

NotMasterException .. not master for join request

Steps to reproduce:

start ES from tar.gz, then stop ES
cp -rp $ES_HOME $ES_HOME2
set minimum_master_nodes: 2 in each config
start ES on both nodes

[2018-08-16T07:29:43,468][INFO ][o.e.d.z.ZenDiscovery     ] [gETA6zC] failed to send join request to master [{gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{vYiBS2AVQhSK0VITLJJ5tQ}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[gETA6zC][127.0.0.1:9301][internal:discovery/zen/join]]; nested: NotMasterException[Node [{gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{vYiBS2AVQhSK0VITLJJ5tQ}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}] not master for join request]; ], tried [3] times

Remove minimum_master_nodes setting, then the more accurate error appears:

[2018-08-16T07:27:47,066][INFO ][o.e.d.z.ZenDiscovery     ] [gETA6zC] failed to send join request to master [{gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{iCnfSG48RTeWox4XqIN2cw}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[gETA6zC][127.0.0.1:9300][internal:discovery/zen/join]]; nested: IllegalArgumentException[can't add node {gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{mtW3yNGyQCq5znP5b-ZOZA}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, found existing node {gETA6zC}{gETA6zC4T1ipiuSfOrpN_w}{iCnfSG48RTeWox4XqIN2cw}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} with the same id but is a different node instance]; ]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-08-16T11:44:19Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-08-17T11:25:12Z

First up, for the record, copying a data directory from one node to another is always a Very Bad Idea(:tm:). It's not supported, and Elasticsearch's behaviour is undefined if you do this.

The way you get a NotMasterException .. not master for join request here is if each node decides that the other one should be master, because they have equal cluster state versions and equal IDs so the order of preference is formally undefined, but a series of coincidences mean that each node picks the other. Other things that can happen include:

each node thinks it's the best master candidate, and tries to join itself. This results in repeated election timeouts.
the nodes actually agree on the best one between them, and the election repeatedly succeeds but subsequent publishing fails:

[2018-08-17T12:18:15,704][INFO ][o.e.c.s.MasterService    ] [IMu-REN] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: master {new {IMu-REN}{IMu-RENTSSu3csqN4ObMaQ}{fK3mPtoKQhyWUBTqq_azZA}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}}, removed {{NGS64nm}{NGS64nmTSvKgC3c6mx7R_A}{cJWyql-aSECYsOpRSSfZfQ}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{IMu-REN}{IMu-RENTSSu3csqN4ObMaQ}{OrLYCx-DQFKjqaTfN2ZbMA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true},}
[2018-08-17T12:18:15,704][WARN ][o.e.d.z.ZenDiscovery     ] [IMu-REN] zen-disco-failed-to-publish, current nodes: nodes:
[2018-08-17T12:18:15,704][WARN ][o.e.c.s.MasterService    ] [IMu-REN] failing [zen-disco-elected-as-master ([1] nodes joined)[, ]]: failed to commit cluster state version [5]

I don't know that we want to try and handle all these cases in the existing discovery layer, but we should definitely do something better in #32006.

bleskes · 2018-09-09T17:46:41Z

I agree that we shouldn't invest time here. Closing. @PhaedrusTheGreek are you happy with the explanation?

DaveCTurner · 2018-09-10T07:19:42Z

I added a corresponding item to the list in #32006 so as not to lose this.

javanna added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement labels Aug 16, 2018

bleskes closed this as completed Sep 9, 2018

ywelsch mentioned this issue Sep 9, 2018

A new cluster coordination layer #32006

Closed

61 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NotMasterException with duplicate node ids and minimum_master_nodes not met #32904

NotMasterException with duplicate node ids and minimum_master_nodes not met #32904

PhaedrusTheGreek commented Aug 16, 2018

elasticmachine commented Aug 16, 2018

DaveCTurner commented Aug 17, 2018

bleskes commented Sep 9, 2018

DaveCTurner commented Sep 10, 2018

NotMasterException with duplicate node ids and minimum_master_nodes not met #32904

NotMasterException with duplicate node ids and minimum_master_nodes not met #32904

Comments

PhaedrusTheGreek commented Aug 16, 2018

elasticmachine commented Aug 16, 2018

DaveCTurner commented Aug 17, 2018

bleskes commented Sep 9, 2018

DaveCTurner commented Sep 10, 2018