-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Vault sealing itself when a storage error occur while unsealing #19459
Open
remilapeyre
wants to merge
1
commit into
hashicorp:main
Choose a base branch
from
remilapeyre:unseal-shutdown
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When running in HA mode Vault will perform some operations right after getting leadership: 1. save the upgrades of the keyring to the storage backend so that they can be loaded by the standby nodes with no disruption 2. reload the root key 3. reload the entire keyring 4. reload the shaming keys 5. start a background task to cleanup the updates written in 1. to the storage backend after they have been loaded by the standby nodes If any error happens during those tasks Vault will suppose that it is because the keys it has loaded where not correct, it then relinquish the leadership lock so that another server can start active operation and finally it seals itself. This is a correct behavior in some cases but not always, for example if Vault loses the leadership while doing those operations they will be cancelled. The node should then go in standby mode but not get sealed because the error is not related to an issue with the keys. This case has been specifically handled by 8e93f59. There is other issues that can trigger Vault sealing itself like an issue with the storage backend. Most storage backends like Consul will fail if there is some network issues. This mean that Vault can seal itself when there is a network disruption and not unseal automatically when the disruption is fixed. This defeats the purpose of high-availability backends and auto-unseal. This patch does two things: first it improves the sys/health endpoint to return the specific status code 523 when an error has happened during the post-unseal operations with a new "post_unseal_failed" attribute in the body. It also improves the resiliency of the post-unseal process by not sealing the Vault when an error happens when accessing the storage backend. Those errors will just be logged, the leadership will be returned and sys/health will return 523 but the node will try again to take the leadership and resume operations if the underlying issue with the storage backend has been resolved. This is a compromise between keeping Vault alive despite small network issues, but still reporting clearly an error to the operators in case it requires manual intervention. Other errors like failing to decrypt or decode the keyring will continue to seal the Vault like before. Related to hashicorp#10552 Related to hashicorp#3896
remilapeyre
force-pushed
the
unseal-shutdown
branch
from
March 6, 2023 00:33
554b260
to
a417b3d
Compare
Tagging @schavis for docs review <3 |
Hi, @remilapeyre! Thanks for the PR. Would you be able to fix the conflicting files so we can review this? Thanks! |
Quick ping here! @remilapeyre if you can resolve the merge conflicts, I'll push this in front of an engineer. Thanks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When running in HA mode Vault will perform some operations right after getting leadership:
If any error happens during those tasks Vault will suppose that it is because the keys it has loaded where not correct, it then relinquish the leadership lock so that another server can start active operation and finally it seals itself.
This is a correct behavior in some cases but not always, for example if Vault loses the leadership while doing those operations they will be cancelled. The node should then go in standby mode but not get sealed because the error is not related to an issue with the keys. This case has been specifically handled by 8e93f59.
There is other issues that can trigger Vault sealing itself like an issue with the storage backend. Most storage backends like Consul will fail if there is some network issues. This mean that Vault can seal itself when there is a network disruption and not unseal automatically when the disruption is fixed. This defeats the purpose of high-availability backends and auto-unseal.
This patch does two things: first it improves the sys/health endpoint to return the specific status code 523 when an error has happened during the post-unseal operations with a new "post_unseal_failed" attribute in the body.
It also improves the resiliency of the post-unseal process by not sealing the Vault when an error happens when accessing the storage backend. Those errors will just be logged, the leadership will be returned and sys/health will return 523 but the node will try again to take the leadership and resume operations if the underlying issue with the storage backend has been resolved.
This is a compromise between keeping Vault alive despite small network issues, but still reporting clearly an error to the operators in case it requires manual intervention.
Other errors like failing to decrypt or decode the keyring will continue to seal the Vault like before.
Related to #10552
Related to #3896