You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A new Vault joins with no raft data to replace old Vault (immutable infrastructure)
We set autopilot config as Cleanup Dead Servers=true, Last Contact Threshold=10s and Dead Server Last Contact Threshold=10s
Our raft data is about 500MB
There are about 20%~30% of the time, when a new Vault instance starts up, it fails to join the Raft quorum. Even restarting the process won't make it rejoin.
The brand new Vault server managed to join the raft, then it installed snapshot from the leader during initialization. When the raft snapshot is big, it takes more than 5 seconds to transfer the data. So the Vault got unsealed by the second run of auto unseal, which is 10s.
During the initialization, the new Vault does not send any heartbeat. So exactly at 10s, the leader thought new Vault is dead and removed it from raft.
Depends on whether step 2/3 happens first, the new Vault will either join the quorum, or become a dead node.
To Reproduce
Configure Vault with AWSKMS auto unseal
Configure autopilot Cleanup Dead Servers=true, Last Contact Threshold=4s and Dead Server Last Contact Threshold=4s, so the leader will kick new node out before the first auto unseal.
Join a new node to the quorum.
The new node failed to join the quorum.
Expected behavior
The leader should not consider a new follower that is installing snapshot as a dead node.
This can be fixed by either
Follower sending heartbeats during the snapshot installation.
Follower be removed from quorum first, install snapshot and unseal, then rejoin the quorum.
Environment:
Vault Server Version (retrieve with vault status): 1.10.3+ent
Vault CLI Version (retrieve with vault version):
Server Operating System/Architecture:
The text was updated successfully, but these errors were encountered:
Yes, we've encountered this elsewhere as well when dead server cleanup threshold is set too aggressively. The fix we have planned is hashicorp/raft-autopilot#17, though re-reading that I'm not sure it's sufficient in the non-voter case.
I suggest you increase the dead server cleanup threshold.
Yes, I have already update our autopilot config and we don't have the issue anymore.
Still, I think this is something that should be fixed properly in the code.
Read through the hashicorp/raft-autopilot#17, I am not very sure if that is enough to fix what is described here.
@weichuliu I'm going to go ahead and close this issue now as you were able to get better behavior by changing the dead server cleanup threshold. The engineering team is discussing ways to fix this programatically in the future. Thanks!
Describe the bug
We deploy Vault on AWS:
Cleanup Dead Servers=true
,Last Contact Threshold=10s
andDead Server Last Contact Threshold=10s
There are about 20%~30% of the time, when a new Vault instance starts up, it fails to join the Raft quorum. Even restarting the process won't make it rejoin.
Checking our server log, we found that:
To Reproduce
Cleanup Dead Servers=true
,Last Contact Threshold=4s
andDead Server Last Contact Threshold=4s
, so the leader will kick new node out before the first auto unseal.Expected behavior
The leader should not consider a new follower that is installing snapshot as a dead node.
This can be fixed by either
Environment:
vault status
): 1.10.3+entvault version
):The text was updated successfully, but these errors were encountered: