Ping a node that tries to join the cluster before letting it join #45

jsternberg · 2015-07-21T21:32:11Z

This problem happened in the context of Consul, but the issue is with memberlist inside of Consul.

When a single member of the cluster is unreachable to reach, but able to send messages, it can pollute the gossip cluster with inaccurate and irrelevant information. This can end up happening accidentally pretty easily with forgetting to open a port on an EC2 security group. This problem continues to get worse the more nodes in the cluster that aren't cooperating.

When the appropriate node doesn't refute the message fast enough, it gets kicked from the cluster. For our high availability components this isn't a big problem, albeit annoying, but for our single instance nodes this causes a big problem for DNS.

This gets even worse when this uncooperative node just starts telling the cluster that one of the servers is dead. This can end up causing a leader election if the server doesn't refute the message fast enough.

It would be nice if, on join, memberlist attempted to detect a misbehaving node and kicked it from the cluster before it can cause a cluster outage. This can be as simple as just performing a simple ping and getting a response. Since this would be done on joining the cluster, it's probably better to avoid probeNode (which sends other messages) and only send the ping message.

armon · 2015-07-22T19:52:24Z

Yes, a critical assumption of SWIM is a fully connected cluster. Any partially connected node like this can cause issues, and they do compound quite seriously. Fixing this or at least making it less painful is something we'd like to tackle but is fundamentally hard.

jsternberg · 2015-07-24T23:31:17Z

Our problem is more attempting to prevent an accidental misconfiguration rather than merely prevent a misbehaving node. While the idea of protecting against a partially connected node is one we would like to prevent, the biggest obstacle here is that the node was never a participating member of the cluster.

Given node A and B with B being the misbehaving node:

Node B attempts to join A
Node A, as it was contacted by B, assumes it can contact Node B at the advertised address and confirms that he joined the cluster
Node A later attempts to contact Node B and can't
Node B acts like the town fool

I think the proposed solution of sending a ping to any member that attempts to join the cluster could reduce possible misconfigurations. I think I know the part in the code where this would go. Does this plan sound reasonable and should I start working on a PR?

armon added the enhancement label Jul 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ping a node that tries to join the cluster before letting it join #45

Ping a node that tries to join the cluster before letting it join #45

jsternberg commented Jul 21, 2015

armon commented Jul 22, 2015

jsternberg commented Jul 24, 2015

Ping a node that tries to join the cluster before letting it join #45

Ping a node that tries to join the cluster before letting it join #45

Comments

jsternberg commented Jul 21, 2015

armon commented Jul 22, 2015

jsternberg commented Jul 24, 2015