Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition preventing Vault from joining quorum #16919

Closed
weichuliu opened this issue Aug 29, 2022 · 3 comments
Closed

Race condition preventing Vault from joining quorum #16919

weichuliu opened this issue Aug 29, 2022 · 3 comments
Labels
bug Used to indicate a potential bug storage/raft

Comments

@weichuliu
Copy link

Describe the bug

We deploy Vault on AWS:

  1. We use AWSKMS to auto unseal
  2. A new Vault joins with no raft data to replace old Vault (immutable infrastructure)
  3. We set autopilot config as Cleanup Dead Servers=true, Last Contact Threshold=10s and Dead Server Last Contact Threshold=10s
  4. Our raft data is about 500MB

There are about 20%~30% of the time, when a new Vault instance starts up, it fails to join the Raft quorum. Even restarting the process won't make it rejoin.

Checking our server log, we found that:

  1. Auto unseal runs every 5 seconds
  2. The brand new Vault server managed to join the raft, then it installed snapshot from the leader during initialization. When the raft snapshot is big, it takes more than 5 seconds to transfer the data. So the Vault got unsealed by the second run of auto unseal, which is 10s.
  3. During the initialization, the new Vault does not send any heartbeat. So exactly at 10s, the leader thought new Vault is dead and removed it from raft.
  4. Depends on whether step 2/3 happens first, the new Vault will either join the quorum, or become a dead node.

To Reproduce

  1. Configure Vault with AWSKMS auto unseal
  2. Configure autopilot Cleanup Dead Servers=true, Last Contact Threshold=4s and Dead Server Last Contact Threshold=4s, so the leader will kick new node out before the first auto unseal.
  3. Join a new node to the quorum.
  4. The new node failed to join the quorum.

Expected behavior

The leader should not consider a new follower that is installing snapshot as a dead node.

This can be fixed by either

  • Follower sending heartbeats during the snapshot installation.
  • Follower be removed from quorum first, install snapshot and unseal, then rejoin the quorum.

Environment:

  • Vault Server Version (retrieve with vault status): 1.10.3+ent
  • Vault CLI Version (retrieve with vault version):
  • Server Operating System/Architecture:
@ncabatoff
Copy link
Collaborator

Hi @weichuliu,

Yes, we've encountered this elsewhere as well when dead server cleanup threshold is set too aggressively. The fix we have planned is hashicorp/raft-autopilot#17, though re-reading that I'm not sure it's sufficient in the non-voter case.

I suggest you increase the dead server cleanup threshold.

@weichuliu
Copy link
Author

@ncabatoff

Yes, I have already update our autopilot config and we don't have the issue anymore.
Still, I think this is something that should be fixed properly in the code.

Read through the hashicorp/raft-autopilot#17, I am not very sure if that is enough to fix what is described here.

@heatherezell heatherezell added bug Used to indicate a potential bug and removed waiting-for-response labels Aug 30, 2022
@heatherezell
Copy link
Contributor

@weichuliu I'm going to go ahead and close this issue now as you were able to get better behavior by changing the dead server cleanup threshold. The engineering team is discussing ways to fix this programatically in the future. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug storage/raft
Projects
None yet
Development

No branches or pull requests

3 participants