Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vault: revert #18998 to fix potential deadlock #19963

Merged
merged 2 commits into from
Feb 13, 2024
Merged

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Feb 13, 2024

#19805 reports an issue where tasks get stuck in pending status. Logs shared by the user point that the problem is likely that vault_hook.Prestart() never completes.

{"log":"    2024-02-09T15:16:40.868Z [TRACE] client.alloc_runner.task_runner: running prestart hook: alloc_id=95eddd7d-*** task=*** name=vault start=\"2024-02-09 15:16:40.868679902 +0000 UTC m=+27.567861561\"\n","stream":"stdout","time":"2024-02-09T15:16:40.868767694Z"}
... alloc watcher healthy deadline reached ...
{"log":"    2024-02-09T15:21:40.746Z [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: deadline reached; setting unhealthy: alloc_id=95eddd7d-*** deadline=\"2024-02-09 15:21:40.746133303 +0000 UTC m=+327.445314950\"\n","stream":"stdout","time":"2024-02-09T15:21:40.746292101Z"}
{"log":"    2024-02-12T09:05:52.690Z [TRACE] client.alloc_runner.task_runner: running prestart hook: alloc_id=96dade39-*** task=*** name=vault start=\"2024-02-12 09:05:52.69095185 +0000 UTC m=+24.113560832\"\n","stream":"stdout","time":"2024-02-12T09:05:52.692175834Z"}
... user stops faulty alloc after it has been stuck for 5min...
{"log":"    2024-02-12T09:10:12.183Z [TRACE] client.alloc_runner.task_runner: finished prestart hook: alloc_id=96dade39-*** task=*** name=vault end=\"2024-02-12 09:10:12.183310806 +0000 UTC m=+283.605919788\" duration=4m19.492358956s\n","stream":"stdout","time":"2024-02-12T09:10:12.183382225Z"}

Following the code path along with the logs, the hook is likely getting stuck waiting for a Vault token:

// Block until we get a token
select {
case <-h.future.Wait():
case <-ctx.Done():
return nil
}

h.updater.updatedVaultToken(h.future.Get()) only contains one blocking call in setVaultToken, but the lock is unlikely to be blocked since the task is still being created.

func (tr *TaskRunner) setVaultToken(token string) {
tr.vaultTokenLock.Lock()
defer tr.vaultTokenLock.Unlock()

h.updater.updatedVaultToken(h.future.Get()) also triggers the the update hooks, so we would expect log entries that read running update hooks, but none is emitted for the task.

// Trigger update hooks with the new Vault token
tr.triggerUpdateHooks()

In addition to deriving a Vault token for the task, vault_took.Prestart() also immediately renews it, blocking until the operation completes.

// Start the renewal process.
//
// This is the initial renew of the token which we derived from the
// server. The client does not know how long it took for the token to
// be generated and derived and also wants to gain control of the
// process quickly, but not too quickly. We therefore use a hardcoded
// increment value of 30; this value without a suffix is in seconds.
//
// If Vault is having availability issues or is overloaded, a large
// number of initial token renews can exacerbate the problem.
renewCh, err := h.client.RenewToken(token, 30)

#18998 refactored the Vault token renewal loop in order to prevent a possible race condition where expired tokens are re-added to the renewal heap, but I suspect that implementation may have introduced new race conditions.

While trying to reproduce #19805 I bumped into another issue where the new implementation of vaultclient.RenewToken() can get stuck when renewing the same token concurrently. 43bec05 introduces a test for this bug.

This scenario is unlikely to be triggered in a real Nomad cluster because each Vault token is owned by a specific entity (task runners), but it's still a bug and the new implementation may contain others that are responsible for the behaviour observed in #19805.

Since the main issue with Vault renewals was fixed in #18985 I think it's better to rollback this refactored code until we are able to guarantee its correctness.

@lgfa29 lgfa29 added the backport/1.7.x backport to 1.7.x release line label Feb 13, 2024
@lgfa29 lgfa29 requested review from shoenig and tgross February 13, 2024 02:01
@lgfa29 lgfa29 added this to the 1.7.x milestone Feb 13, 2024
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we should probably take a second look at the overall design of the renewal loop -- the notion that we immediately renew when we derive the token seems bogus and it introduces extra places for us to block.

@tgross tgross merged commit 62b7d6f into main Feb 13, 2024
20 of 21 checks passed
@tgross tgross deleted the b-fix-vault-renew-deadlock branch February 13, 2024 14:50
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 19, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.7.x backport to 1.7.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants