vault: avoid continual renewal of invalid token #18985

lgfa29 · 2023-11-02T20:38:51Z

A series of errors may happen when a token is invalidated while the Vault client is waiting to renew it. The token may have been invalidated for several reasons, such as the alloc finished running and it's now terminal or the token may have been change directly on Vault out-of-band.

Most of the errors are caused by retries that will never succeed until Vault fully removes the token from its state.

This commit prevents the retries by making the error invalid lease ID a fatal error.

In earlier versions of Vault, this case was covered by the error lease not found or lease is not renewable, which is already considered to be a fatal error by Nomad:

https://github.com/hashicorp/vault/blob/2d0cde4ccc0323591d9414342cb15f5cb70271d7/vault/expiration.go#L636-L639

But hashicorp/vault#5346 introduced an earlier nil check that generates a different error message:

https://github.com/hashicorp/vault/blob/750ab337eaa0b049d9cf1535c00e860129e5e9a0/vault/expiration.go#L1362-L1364

Both errors happen for the same reason (le == nil) and so should be considered fatal on renewal.

Closes #12465

tgross

LGTM! Nice work on this @lgfa29!

tgross · 2023-11-03T13:19:48Z

client/vaultclient/vaultclient.go

+	// Add renewal request to the heap to start tracking the token.
+	c.lock.Lock()
+	c.heap.Push(renewalReq, time.Now())
+	c.lock.Unlock()


Should we remove the original heap.Push or at least change this comment?

// If the identifier is not already tracked, this is a first // renewal request. In this case, add an entry into the heap // with the next renewal time.

Actually, is that whole branch now an unreachable state because the if c.isTracked(req.id) conditional will always be true?

Hum...good point. Looking at the code again, I think this change can actually add a new racing condition where the token is renewed twice!

When RenewToken() pushes the token to heap the run() loop may pick it up for renewal right away, but then RenewToken() calls c.renew() for an immediate renewal, resulting in a double renew.

I don't think that's a problem but it's certainly unnecessary.

I think we could make it so only the run() loop actually renew tokens and RenewToken() blocks until the renewal request completes, but that could add some latency to task start.

Another option is to revert the new !c.isTracked check and just adjust the error so it's considered fatal and we only get one error max.

The select race condition that it tries to address seems rather unlikely to happen, so a single sporadic error may not be too bad?

Another option would be to have different behaviours if c.renew() is called from RenewToken() or run(), like a new renewUntracked() function that doesn't check the heap 🤔

I think we could make it so only the run() loop actually renew tokens and RenewToken() blocks until the renewal request completes, but that could add some latency to task start.

I like this idea, because trying to call renew from two different places seems to be the root of the race. The only potential added latency would come from another token/lease being renewed at the same time as the task starts, so that it hits the run loop while that loop is otherwise occupied. But I'm pretty sure we already have that same latency today, because only one token/lease can be renewed at the same time because renew takes the mutex.

Cool! Implemented this in #18998.

Since it's a fair amount of changes I think it may be better to avoid backporting it. I then updated this PR to only include the extra error check which will prevent the multiple renewals and is much safer to backport.

A series of errors may happen when a token is invalidated while the Vault client is waiting to renew it. The token may have been invalidated for several reasons, such as the alloc finished running and it's now terminal or the token may have been change directly on Vault out-of-band. Most of the errors are caused by retries that will never succeed until Vault fully removes the token from its state. This commit prevents the retries by making the error `invalid lease ID` a fatal error. In earlier versions of Vault, this case was covered by the error `lease not found or lease is not renewable`, which is already considered to be a fatal error by Nomad: https://github.com/hashicorp/vault/blob/2d0cde4ccc0323591d9414342cb15f5cb70271d7/vault/expiration.go#L636-L639 But hashicorp/vault#5346 introduced an earlier `nil` check that generates a different error message: https://github.com/hashicorp/vault/blob/750ab337eaa0b049d9cf1535c00e860129e5e9a0/vault/expiration.go#L1362-L1364 Both errors happen for the same reason (`le == nil`) and so should be considered fatal on renewal.

lgfa29 added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Nov 2, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui November 2, 2023 20:39 View deployment

lgfa29 added a commit that referenced this pull request Nov 2, 2023

changelog: add entry for #18985

1437a71

lgfa29 requested review from pkazmierczak and tgross November 2, 2023 20:40

lgfa29 added this to the 1.7.0 milestone Nov 2, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui November 2, 2023 20:42 View deployment

lgfa29 mentioned this pull request Nov 2, 2023

vault_hook: build context from task_runner shutdown context #13279

Closed

vercel bot deployed to Preview – nomad-storybook-and-ui November 2, 2023 21:52 View deployment

tgross approved these changes Nov 3, 2023

View reviewed changes

lgfa29 added 2 commits November 3, 2023 17:49

changelog: add entry for #18985

0a6593f

lgfa29 force-pushed the b-fix-vault-renew-race branch from 902093a to 0a6593f Compare November 3, 2023 21:50

vercel bot deployed to Preview – nomad-storybook-and-ui November 3, 2023 21:53 View deployment

lgfa29 mentioned this pull request Nov 3, 2023

vault: always renew tokens using the renewal loop #18998

Merged

lgfa29 merged commit ab36cf0 into main Nov 8, 2023

lgfa29 deleted the b-fix-vault-renew-race branch November 8, 2023 00:50

tgross mentioned this pull request Dec 1, 2023

Increase in "Vault: server failed to derive vault token: Allocation "xxx" does not exist" errors after 1.06 to 1.1.10 upgrade #12261

Closed

lgfa29 mentioned this pull request Feb 13, 2024

vault: revert #18998 to fix potential deadlock #19963

Merged

hc-github-team-nomad-core mentioned this pull request Feb 13, 2024

Backport of vault: revert #18998 to fix potential deadlock into release/1.7.x #19967

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vault: avoid continual renewal of invalid token #18985

vault: avoid continual renewal of invalid token #18985

lgfa29 commented Nov 2, 2023 •

edited

Loading

tgross left a comment

tgross Nov 3, 2023

lgfa29 Nov 3, 2023

tgross Nov 3, 2023

lgfa29 Nov 3, 2023

vault: avoid continual renewal of invalid token #18985

vault: avoid continual renewal of invalid token #18985

Conversation

lgfa29 commented Nov 2, 2023 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

tgross Nov 3, 2023

Choose a reason for hiding this comment

lgfa29 Nov 3, 2023

Choose a reason for hiding this comment

tgross Nov 3, 2023

Choose a reason for hiding this comment

lgfa29 Nov 3, 2023

Choose a reason for hiding this comment

lgfa29 commented Nov 2, 2023 •

edited

Loading