vault: revert #18998 to fix potential deadlock #19963
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#19805 reports an issue where tasks get stuck in
pending
status. Logs shared by the user point that the problem is likely thatvault_hook.Prestart()
never completes.Following the code path along with the logs, the hook is likely getting stuck waiting for a Vault token:
nomad/client/allocrunner/taskrunner/vault_hook.go
Lines 216 to 221 in 1bde7a8
h.updater.updatedVaultToken(h.future.Get())
only contains one blocking call insetVaultToken
, but the lock is unlikely to be blocked since the task is still being created.nomad/client/allocrunner/taskrunner/task_runner_getters.go
Lines 65 to 67 in 1bde7a8
h.updater.updatedVaultToken(h.future.Get())
also triggers the the update hooks, so we would expect log entries that readrunning update hooks
, but none is emitted for the task.nomad/client/allocrunner/taskrunner/vault_hook.go
Lines 54 to 55 in 1bde7a8
In addition to deriving a Vault token for the task,
vault_took.Prestart()
also immediately renews it, blocking until the operation completes.nomad/client/allocrunner/taskrunner/vault_hook.go
Lines 296 to 306 in 1bde7a8
#18998 refactored the Vault token renewal loop in order to prevent a possible race condition where expired tokens are re-added to the renewal heap, but I suspect that implementation may have introduced new race conditions.
While trying to reproduce #19805 I bumped into another issue where the new implementation of
vaultclient.RenewToken()
can get stuck when renewing the same token concurrently. 43bec05 introduces a test for this bug.This scenario is unlikely to be triggered in a real Nomad cluster because each Vault token is owned by a specific entity (task runners), but it's still a bug and the new implementation may contain others that are responsible for the behaviour observed in #19805.
Since the main issue with Vault renewals was fixed in #18985 I think it's better to rollback this refactored code until we are able to guarantee its correctness.