You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This problem happened in the context of Consul, but the issue is with memberlist inside of Consul.
When a single member of the cluster is unreachable to reach, but able to send messages, it can pollute the gossip cluster with inaccurate and irrelevant information. This can end up happening accidentally pretty easily with forgetting to open a port on an EC2 security group. This problem continues to get worse the more nodes in the cluster that aren't cooperating.
When the appropriate node doesn't refute the message fast enough, it gets kicked from the cluster. For our high availability components this isn't a big problem, albeit annoying, but for our single instance nodes this causes a big problem for DNS.
This gets even worse when this uncooperative node just starts telling the cluster that one of the servers is dead. This can end up causing a leader election if the server doesn't refute the message fast enough.
It would be nice if, on join, memberlist attempted to detect a misbehaving node and kicked it from the cluster before it can cause a cluster outage. This can be as simple as just performing a simple ping and getting a response. Since this would be done on joining the cluster, it's probably better to avoid probeNode (which sends other messages) and only send the ping message.
The text was updated successfully, but these errors were encountered:
Yes, a critical assumption of SWIM is a fully connected cluster. Any partially connected node like this can cause issues, and they do compound quite seriously. Fixing this or at least making it less painful is something we'd like to tackle but is fundamentally hard.
Our problem is more attempting to prevent an accidental misconfiguration rather than merely prevent a misbehaving node. While the idea of protecting against a partially connected node is one we would like to prevent, the biggest obstacle here is that the node was never a participating member of the cluster.
Given node A and B with B being the misbehaving node:
Node B attempts to join A
Node A, as it was contacted by B, assumes it can contact Node B at the advertised address and confirms that he joined the cluster
Node A later attempts to contact Node B and can't
Node B acts like the town fool
I think the proposed solution of sending a ping to any member that attempts to join the cluster could reduce possible misconfigurations. I think I know the part in the code where this would go. Does this plan sound reasonable and should I start working on a PR?
This problem happened in the context of Consul, but the issue is with memberlist inside of Consul.
When a single member of the cluster is unreachable to reach, but able to send messages, it can pollute the gossip cluster with inaccurate and irrelevant information. This can end up happening accidentally pretty easily with forgetting to open a port on an EC2 security group. This problem continues to get worse the more nodes in the cluster that aren't cooperating.
When the appropriate node doesn't refute the message fast enough, it gets kicked from the cluster. For our high availability components this isn't a big problem, albeit annoying, but for our single instance nodes this causes a big problem for DNS.
This gets even worse when this uncooperative node just starts telling the cluster that one of the servers is dead. This can end up causing a leader election if the server doesn't refute the message fast enough.
It would be nice if, on join, memberlist attempted to detect a misbehaving node and kicked it from the cluster before it can cause a cluster outage. This can be as simple as just performing a simple ping and getting a response. Since this would be done on joining the cluster, it's probably better to avoid probeNode (which sends other messages) and only send the ping message.
The text was updated successfully, but these errors were encountered: