-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vault: avoid continual renewal of invalid token #18985
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work on this @lgfa29!
client/vaultclient/vaultclient.go
Outdated
// Add renewal request to the heap to start tracking the token. | ||
c.lock.Lock() | ||
c.heap.Push(renewalReq, time.Now()) | ||
c.lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the original heap.Push
or at least change this comment?
// If the identifier is not already tracked, this is a first
// renewal request. In this case, add an entry into the heap
// with the next renewal time.
Actually, is that whole branch now an unreachable state because the if c.isTracked(req.id)
conditional will always be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum...good point. Looking at the code again, I think this change can actually add a new racing condition where the token is renewed twice!
When RenewToken()
pushes the token to heap the run()
loop may pick it up for renewal right away, but then RenewToken()
calls c.renew()
for an immediate renewal, resulting in a double renew.
I don't think that's a problem but it's certainly unnecessary.
I think we could make it so only the run()
loop actually renew tokens and RenewToken()
blocks until the renewal request completes, but that could add some latency to task start.
Another option is to revert the new !c.isTracked
check and just adjust the error so it's considered fatal
and we only get one error max.
The select
race condition that it tries to address seems rather unlikely to happen, so a single sporadic error may not be too bad?
Another option would be to have different behaviours if c.renew()
is called from RenewToken()
or run()
, like a new renewUntracked()
function that doesn't check the heap 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could make it so only the run() loop actually renew tokens and RenewToken() blocks until the renewal request completes, but that could add some latency to task start.
I like this idea, because trying to call renew
from two different places seems to be the root of the race. The only potential added latency would come from another token/lease being renewed at the same time as the task starts, so that it hits the run
loop while that loop is otherwise occupied. But I'm pretty sure we already have that same latency today, because only one token/lease can be renewed at the same time because renew
takes the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Implemented this in #18998.
Since it's a fair amount of changes I think it may be better to avoid backporting it. I then updated this PR to only include the extra error check which will prevent the multiple renewals and is much safer to backport.
A series of errors may happen when a token is invalidated while the Vault client is waiting to renew it. The token may have been invalidated for several reasons, such as the alloc finished running and it's now terminal or the token may have been change directly on Vault out-of-band. Most of the errors are caused by retries that will never succeed until Vault fully removes the token from its state. This commit prevents the retries by making the error `invalid lease ID` a fatal error. In earlier versions of Vault, this case was covered by the error `lease not found or lease is not renewable`, which is already considered to be a fatal error by Nomad: https://github.com/hashicorp/vault/blob/2d0cde4ccc0323591d9414342cb15f5cb70271d7/vault/expiration.go#L636-L639 But hashicorp/vault#5346 introduced an earlier `nil` check that generates a different error message: https://github.com/hashicorp/vault/blob/750ab337eaa0b049d9cf1535c00e860129e5e9a0/vault/expiration.go#L1362-L1364 Both errors happen for the same reason (`le == nil`) and so should be considered fatal on renewal.
902093a
to
0a6593f
Compare
A series of errors may happen when a token is invalidated while the Vault client is waiting to renew it. The token may have been invalidated for several reasons, such as the alloc finished running and it's now terminal or the token may have been change directly on Vault out-of-band.
Most of the errors are caused by retries that will never succeed until Vault fully removes the token from its state.
This commit prevents the retries by making the error
invalid lease ID
a fatal error.In earlier versions of Vault, this case was covered by the error
lease not found or lease is not renewable
, which is already considered to be a fatal error by Nomad:https://github.com/hashicorp/vault/blob/2d0cde4ccc0323591d9414342cb15f5cb70271d7/vault/expiration.go#L636-L639
But hashicorp/vault#5346 introduced an earlier
nil
check that generates a different error message:https://github.com/hashicorp/vault/blob/750ab337eaa0b049d9cf1535c00e860129e5e9a0/vault/expiration.go#L1362-L1364
Both errors happen for the same reason (
le == nil
) and so should be considered fatal on renewal.Closes #12465