Race condition preventing Vault from joining quorum #16919

weichuliu · 2022-08-29T06:33:25Z

Describe the bug

We deploy Vault on AWS:

We use AWSKMS to auto unseal
A new Vault joins with no raft data to replace old Vault (immutable infrastructure)
We set autopilot config as Cleanup Dead Servers=true, Last Contact Threshold=10s and Dead Server Last Contact Threshold=10s
Our raft data is about 500MB

There are about 20%~30% of the time, when a new Vault instance starts up, it fails to join the Raft quorum. Even restarting the process won't make it rejoin.

Checking our server log, we found that:

Auto unseal runs every 5 seconds
The brand new Vault server managed to join the raft, then it installed snapshot from the leader during initialization. When the raft snapshot is big, it takes more than 5 seconds to transfer the data. So the Vault got unsealed by the second run of auto unseal, which is 10s.
During the initialization, the new Vault does not send any heartbeat. So exactly at 10s, the leader thought new Vault is dead and removed it from raft.
Depends on whether step 2/3 happens first, the new Vault will either join the quorum, or become a dead node.

To Reproduce

Configure Vault with AWSKMS auto unseal
Configure autopilot Cleanup Dead Servers=true, Last Contact Threshold=4s and Dead Server Last Contact Threshold=4s, so the leader will kick new node out before the first auto unseal.
Join a new node to the quorum.
The new node failed to join the quorum.

Expected behavior

The leader should not consider a new follower that is installing snapshot as a dead node.

This can be fixed by either

Follower sending heartbeats during the snapshot installation.
Follower be removed from quorum first, install snapshot and unseal, then rejoin the quorum.

Environment:

Vault Server Version (retrieve with vault status): 1.10.3+ent
Vault CLI Version (retrieve with vault version):
Server Operating System/Architecture:

The text was updated successfully, but these errors were encountered:

ncabatoff · 2022-08-29T11:58:56Z

Hi @weichuliu,

Yes, we've encountered this elsewhere as well when dead server cleanup threshold is set too aggressively. The fix we have planned is hashicorp/raft-autopilot#17, though re-reading that I'm not sure it's sufficient in the non-voter case.

I suggest you increase the dead server cleanup threshold.

weichuliu · 2022-08-30T08:00:39Z

@ncabatoff

Yes, I have already update our autopilot config and we don't have the issue anymore.
Still, I think this is something that should be fixed properly in the code.

Read through the hashicorp/raft-autopilot#17, I am not very sure if that is enough to fix what is described here.

heatherezell · 2022-08-30T16:45:15Z

@weichuliu I'm going to go ahead and close this issue now as you were able to get better behavior by changing the dead server cleanup threshold. The engineering team is discussing ways to fix this programatically in the future. Thanks!

heatherezell added storage/raft waiting-for-response labels Aug 29, 2022

heatherezell added bug Used to indicate a potential bug and removed waiting-for-response labels Aug 30, 2022

heatherezell closed this as completed Aug 30, 2022

peteski22 mentioned this issue Oct 24, 2022

Vault 6815/dont prune servers that are joining hashicorp/raft-autopilot#22

Closed

peteski22 mentioned this issue Nov 1, 2022

Vault 6815/respect min quorum hashicorp/raft-autopilot#23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition preventing Vault from joining quorum #16919

Race condition preventing Vault from joining quorum #16919

weichuliu commented Aug 29, 2022

ncabatoff commented Aug 29, 2022

weichuliu commented Aug 30, 2022

heatherezell commented Aug 30, 2022

Race condition preventing Vault from joining quorum #16919

Race condition preventing Vault from joining quorum #16919

Comments

weichuliu commented Aug 29, 2022

ncabatoff commented Aug 29, 2022

weichuliu commented Aug 30, 2022

heatherezell commented Aug 30, 2022