Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes in process of shutting down should not respond to discovery pings #27328

Closed
jakommo opened this issue Nov 9, 2017 · 0 comments · Fixed by #27329
Closed

Nodes in process of shutting down should not respond to discovery pings #27328

jakommo opened this issue Nov 9, 2017 · 0 comments · Fixed by #27329
Assignees
Labels
:Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >enhancement

Comments

@jakommo
Copy link
Contributor

jakommo commented Nov 9, 2017

It looks like a node that is being shutdown will still reply to an discovery ping.

Master node-108 is gracefully shutdown, but still listed in the current nodes list. I guess this is because its just some milliseconds after the shutdown started.

[2017-11-08T15:01:00,779][INFO ][o.e.d.z.ZenDiscovery     ] [node-106] master_left [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}], reason [shut_down]
[2017-11-08T15:01:00,781][WARN ][o.e.d.z.ZenDiscovery     ] [node-106] master left (reason = shut_down), current nodes: nodes: 
...
   {node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}, master

Then 3 seconds later the ping responses coming back and node-108 is still listed. I checked the log on node-108, but there is nothing later than 2017-11-08T15:01:00,843, which makes me wonder why it is still listed in the ping list.

[2017-11-08T15:01:03,785][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] full ping responses:
...
        --> ping_response{node [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}], id[44], master [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}],cluster_state_version [420], cluster_name[my-cluster]}
...
[2017-11-08T15:01:03,789][WARN ][o.e.d.z.ZenDiscovery     ] [node-106] failed to connect to master [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}], retrying...

Only explanation I have is that ping was replied before 2017-11-08T15:01:00,843 already, which would match up with the ping starting : :00,843 vs :00,782

[2017-11-08T15:01:00,782][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] starting to ping
[2017-11-08T15:01:00,782][TRACE][o.e.d.z.ZenDiscovery     ] [node-107] starting to ping

Then another ping is sent and 3 seconds later the reply does not list 108 anymore and a new master is elected.

[2017-11-08T15:01:03,798][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] starting to ping
[2017-11-08T15:01:06,799][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] full ping responses:
...
        --> ping_response{node [{node-106}{668jPO1YSweDaqRyQxMRwA}{LwYDV2eBQAC3ialrV4qEbQ}{node106.home.lan}{10.10.10.13:9300}], id[39], master [null],cluster_state_version [420], cluster_name[my-cluster]}
[2017-11-08T15:01:06,800][TRACE][o.e.d.z.ZenDiscovery     ] [node-106] candidate Candidate{node={node-106}{668jPO1YSweDaqRyQxMRwA}{LwYDV2eBQAC3ialrV4qEbQ}{node106.home.lan}{10.10.10.13:9300}, clusterStateVersion=420} won election
[2017-11-08T15:01:06,801][DEBUG][o.e.d.z.ZenDiscovery     ] [node-106] elected as master, waiting for incoming joins ([1] needed)
[2017-11-08T15:01:06,822][INFO ][o.e.c.s.ClusterService   ] [node-106] new_master {node-106}{668jPO1YSweDaqRyQxMRwA}{LwYDV2eBQAC3ialrV4qEbQ}{node106.home.lan}{10.10.10.13:9300}, reason: zen-disco-elected-as-master ([5] nodes joined)[{node-113}{1QzSsxO8ToqgUtfYycgmlw}{NUnBCQ_eSeSZjQV5PE1Jtg}{node113.home.lan}{10.10.10.20:9300}, {node-104}{59gFtGGvQGGy0LLVMIAgig}{Eg5gYe77SYuXB6vZfKKicg}{node104.home.lan}{10.10.10.214:9300}, {node-107}{_-P9koqwQAaBDpPyoflXbw}{j6BbufsTQq6kU-TFbAZW5Q}{node107.home.lan}{10.10.10.14:9300}, {node-105}{kt2uxSxNTDe9JhRuuWORYA}{VFf4A4bfTw6WoagiH6lpTw}{node105.home.lan}{10.10.10.215:9300}, {node-103}{TPSttSagTZq2VCDUhX5cGw}{puBEOLAARtmigK1y0lH5yw}{node103.home.lan}{10.10.10.213:9300}]

Had a quick chat with @ywelsch and it looks like a node in shutdown would still reply to a discovery ping request if the shutdown is not yet finished.
Since such a node will be down shortly, it should not reply to a discovery ping.
It could also speed up master election. I.e. in the above example an extra cycle of 3 seconds was added because the old master was still listed.

@jakommo jakommo added the :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure label Nov 9, 2017
@ywelsch ywelsch self-assigned this Nov 9, 2017
ywelsch added a commit that referenced this issue Nov 13, 2017
When the current master node is shutting down, it sends a leave request to the other nodes so that they can eagerly start a fresh master election. Unfortunately, it was still possible for the master node that was shutting down to respond to ping requests, possibly influencing the election decision as it still appeared as an active master in the ping responses. This commit ensures that UnicastZenPing does not respond to ping requests once it's been closed. ZenDiscovery.doStop() continues to ensure that the pinging component is first closed before it triggers a master election.

Closes #27328
ywelsch added a commit that referenced this issue Nov 13, 2017
When the current master node is shutting down, it sends a leave request to the other nodes so that they can eagerly start a fresh master election. Unfortunately, it was still possible for the master node that was shutting down to respond to ping requests, possibly influencing the election decision as it still appeared as an active master in the ping responses. This commit ensures that UnicastZenPing does not respond to ping requests once it's been closed. ZenDiscovery.doStop() continues to ensure that the pinging component is first closed before it triggers a master election.

Closes #27328
ywelsch added a commit that referenced this issue Nov 13, 2017
When the current master node is shutting down, it sends a leave request to the other nodes so that they can eagerly start a fresh master election. Unfortunately, it was still possible for the master node that was shutting down to respond to ping requests, possibly influencing the election decision as it still appeared as an active master in the ping responses. This commit ensures that UnicastZenPing does not respond to ping requests once it's been closed. ZenDiscovery.doStop() continues to ensure that the pinging component is first closed before it triggers a master election.

Closes #27328
ywelsch added a commit that referenced this issue Nov 13, 2017
When the current master node is shutting down, it sends a leave request to the other nodes so that they can eagerly start a fresh master election. Unfortunately, it was still possible for the master node that was shutting down to respond to ping requests, possibly influencing the election decision as it still appeared as an active master in the ping responses. This commit ensures that UnicastZenPing does not respond to ping requests once it's been closed. ZenDiscovery.doStop() continues to ensure that the pinging component is first closed before it triggers a master election.

Closes #27328
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants