Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ping a node that tries to join the cluster before letting it join #45

Open
jsternberg opened this issue Jul 21, 2015 · 2 comments
Open

Comments

@jsternberg
Copy link
Contributor

This problem happened in the context of Consul, but the issue is with memberlist inside of Consul.

When a single member of the cluster is unreachable to reach, but able to send messages, it can pollute the gossip cluster with inaccurate and irrelevant information. This can end up happening accidentally pretty easily with forgetting to open a port on an EC2 security group. This problem continues to get worse the more nodes in the cluster that aren't cooperating.

When the appropriate node doesn't refute the message fast enough, it gets kicked from the cluster. For our high availability components this isn't a big problem, albeit annoying, but for our single instance nodes this causes a big problem for DNS.

This gets even worse when this uncooperative node just starts telling the cluster that one of the servers is dead. This can end up causing a leader election if the server doesn't refute the message fast enough.

It would be nice if, on join, memberlist attempted to detect a misbehaving node and kicked it from the cluster before it can cause a cluster outage. This can be as simple as just performing a simple ping and getting a response. Since this would be done on joining the cluster, it's probably better to avoid probeNode (which sends other messages) and only send the ping message.

@armon
Copy link
Member

armon commented Jul 22, 2015

Yes, a critical assumption of SWIM is a fully connected cluster. Any partially connected node like this can cause issues, and they do compound quite seriously. Fixing this or at least making it less painful is something we'd like to tackle but is fundamentally hard.

@jsternberg
Copy link
Contributor Author

Our problem is more attempting to prevent an accidental misconfiguration rather than merely prevent a misbehaving node. While the idea of protecting against a partially connected node is one we would like to prevent, the biggest obstacle here is that the node was never a participating member of the cluster.

Given node A and B with B being the misbehaving node:

  • Node B attempts to join A
  • Node A, as it was contacted by B, assumes it can contact Node B at the advertised address and confirms that he joined the cluster
  • Node A later attempts to contact Node B and can't
  • Node B acts like the town fool

I think the proposed solution of sending a ping to any member that attempts to join the cluster could reduce possible misconfigurations. I think I know the part in the code where this would go. Does this plan sound reasonable and should I start working on a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants