Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permanent cluster member #222

Open
champtar opened this issue Apr 30, 2020 · 3 comments
Open

Permanent cluster member #222

champtar opened this issue Apr 30, 2020 · 3 comments

Comments

@champtar
Copy link

Hi All,

I'm using MemberList to provide fast dead node detection in MetalLB, and I feel some feature that I'm writing around Memberlist should be included:

  • have a list of members that we should try to reconnect to forever
  • use this list to detect split brain

If I have 4 members on 4 nodes, and I have a network outage for 1 or 2 minutes, Memberlist communication will timeout and Memberlist will not recover, considering that the only member that is alive is the local member.

Would it make sense to you to:

  • make Join() a no-op for already joined and healthy members
  • add a PermanentJoin(hostlist) function, that make MemberList retry forever for the member in hostlist

The idea is to have the external code just call PermanentJoin(hostlist) when they see a change in K8S api

@mayuresh82
Copy link

Any workaround for this ? Can the client simply attempt to rejoin periodically as a workaround ?

@champtar
Copy link
Author

That is what we now do in MetalLB, periodic reJoin

@stilldavid
Copy link

I noticed this in a fairly simple implementation. If there's a network outage to a single node, the other nodes correctly kick it out of the list, but the single node kicks everyone else out of their list as well, becoming isolated and never rejoining, even after the network comes back.

Current workaround is to periodically check the list for a member count of 1 and rejoin if so. I'd love for Join() to be cheaper to call (a no-op for existing members, as @champtar recommended) so we can call it periodically without side effects (syncing state, which might be expensive and unnecessary), or have a better internal mechanism to detect a solo split brain as it seems like it might be a pretty common case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants