Nodes in process of shutting down should not respond to discovery pings #27328

jakommo · 2017-11-09T08:58:22Z

It looks like a node that is being shutdown will still reply to an discovery ping.

Master node-108 is gracefully shutdown, but still listed in the current nodes list. I guess this is because its just some milliseconds after the shutdown started.

[2017-11-08T15:01:00,779][INFO ][o.e.d.z.ZenDiscovery     ] [node-106] master_left [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}], reason [shut_down]
[2017-11-08T15:01:00,781][WARN ][o.e.d.z.ZenDiscovery     ] [node-106] master left (reason = shut_down), current nodes: nodes: 
...
   {node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}, master

Then 3 seconds later the ping responses coming back and node-108 is still listed. I checked the log on node-108, but there is nothing later than 2017-11-08T15:01:00,843, which makes me wonder why it is still listed in the ping list.

[2017-11-08T15:01:03,785][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] full ping responses:
...
        --> ping_response{node [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}], id[44], master [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}],cluster_state_version [420], cluster_name[my-cluster]}
...
[2017-11-08T15:01:03,789][WARN ][o.e.d.z.ZenDiscovery     ] [node-106] failed to connect to master [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}], retrying...

Only explanation I have is that ping was replied before 2017-11-08T15:01:00,843 already, which would match up with the ping starting : :00,843 vs :00,782

[2017-11-08T15:01:00,782][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] starting to ping
[2017-11-08T15:01:00,782][TRACE][o.e.d.z.ZenDiscovery     ] [node-107] starting to ping

Then another ping is sent and 3 seconds later the reply does not list 108 anymore and a new master is elected.

[2017-11-08T15:01:03,798][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] starting to ping
[2017-11-08T15:01:06,799][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] full ping responses:
...
        --> ping_response{node [{node-106}{668jPO1YSweDaqRyQxMRwA}{LwYDV2eBQAC3ialrV4qEbQ}{node106.home.lan}{10.10.10.13:9300}], id[39], master [null],cluster_state_version [420], cluster_name[my-cluster]}
[2017-11-08T15:01:06,800][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] candidate Candidate{node={node-106}{668jPO1YSweDaqRyQxMRwA}{LwYDV2eBQAC3ialrV4qEbQ}{node106.home.lan}{10.10.10.13:9300}, clusterStateVersion=420} won election
[2017-11-08T15:01:06,801][DEBUG][o.e.d.z.ZenDiscovery     ] [node-106] elected as master, waiting for incoming joins ([1] needed)
[2017-11-08T15:01:06,822][INFO ][o.e.c.s.ClusterService   ] [node-106] new_master {node-106}{668jPO1YSweDaqRyQxMRwA}{LwYDV2eBQAC3ialrV4qEbQ}{node106.home.lan}{10.10.10.13:9300}, reason: zen-disco-elected-as-master ([5] nodes joined)[{node-113}{1QzSsxO8ToqgUtfYycgmlw}{NUnBCQ_eSeSZjQV5PE1Jtg}{node113.home.lan}{10.10.10.20:9300}, {node-104}{59gFtGGvQGGy0LLVMIAgig}{Eg5gYe77SYuXB6vZfKKicg}{node104.home.lan}{10.10.10.214:9300}, {node-107}{_-P9koqwQAaBDpPyoflXbw}{j6BbufsTQq6kU-TFbAZW5Q}{node107.home.lan}{10.10.10.14:9300}, {node-105}{kt2uxSxNTDe9JhRuuWORYA}{VFf4A4bfTw6WoagiH6lpTw}{node105.home.lan}{10.10.10.215:9300}, {node-103}{TPSttSagTZq2VCDUhX5cGw}{puBEOLAARtmigK1y0lH5yw}{node103.home.lan}{10.10.10.213:9300}]

Had a quick chat with @ywelsch and it looks like a node in shutdown would still reply to a discovery ping request if the shutdown is not yet finished.
Since such a node will be down shortly, it should not reply to a discovery ping.
It could also speed up master election. I.e. in the above example an extra cycle of 3 seconds was added because the old master was still listed.

The text was updated successfully, but these errors were encountered:

When the current master node is shutting down, it sends a leave request to the other nodes so that they can eagerly start a fresh master election. Unfortunately, it was still possible for the master node that was shutting down to respond to ping requests, possibly influencing the election decision as it still appeared as an active master in the ping responses. This commit ensures that UnicastZenPing does not respond to ping requests once it's been closed. ZenDiscovery.doStop() continues to ensure that the pinging component is first closed before it triggers a master election. Closes #27328

jakommo added the :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure label Nov 9, 2017

ywelsch self-assigned this Nov 9, 2017

ywelsch mentioned this issue Nov 9, 2017

Stop responding to ping requests before master abdication #27329

Merged

colings86 added the >enhancement label Nov 9, 2017

ywelsch closed this as completed in #27329 Nov 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes in process of shutting down should not respond to discovery pings #27328

Nodes in process of shutting down should not respond to discovery pings #27328

jakommo commented Nov 9, 2017

Nodes in process of shutting down should not respond to discovery pings #27328

Nodes in process of shutting down should not respond to discovery pings #27328

Comments

jakommo commented Nov 9, 2017