-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't treat a failed recover + successful destroy as a successful recover #10849
Conversation
recover This code just seems incorrect. As it stands today it reports a successful restore if RecoverTask fails and then DestroyTask succeeds. This can result in a really annoying bug where it then calls RecoverTask again, whereby it will probably get ErrTaskNotFound and call DestroyTask once more. I think the only reason this has not been noticed so far is because most drivers like Docker will return Success, then nomad will call RecoverTask, get an error (not found) and call DestroyTask again, and get a ErrTasksNotFound err.
There is a comment in this file that makes me hesitate and think this is by design But I do not understand how this can possibly work? It's just going to keep calling RecoverTask, which will fail, then DestroyTask, until it gets an error from DestroyTask. Why would you try to recover a task that you have successfully destroyed? |
@benbuzbee I'm not sure what you mean by "It's just going to keep calling RecoverTask". The So we call Can I ask what behavior you're seeing that you're trying to fix here? |
Here's a simple example, imagine an external that always successfully destroys a task but cannot recover them func (d *Driver) RecoverTask(handle *drivers.TaskHandle) (err error) {
return errors.New("Test - Recover Task Fails")
}
func (d *Driver) DestroyTask(taskID string, force bool) error {
return nil
} This emulates a real world scenario where the recovery cannot happen and destroying succeeds. Run a single job with that plugin, then emulate an external plugin crash/panic
As a result nomad will try to recover/destroy in a loop seemingly forever
Two interesting things from the logs
This isn't true, the driver does not shutdown after the first kill. I know this also because our own logs include the process ID and I can see that each time it calls RecoverTask and DestroyTask the process ID remains the same.
This makes no sense, recovery did not succeed, it is hard coded to fail
I think you might misunderstand the PR, the original code does NOT return false twice, it returns TRUE where (I believe) it should not In pseudocode the current code looks like this: func RestoreHandle() bool {
if RecoverTask fails {
if DestroyTask fails {
return err != drivers.ErrTaskNotFound
}
return true
}
return true
} What I think is the bug here is the "return true" that happens INSIDE "if RecoverTask fails". I'm guessing that's the root of that log line that didnt make sense to me "task successfully recovered on driver" |
Oops, no I got that... just a typo from rewriting my reply too many times 😀 Thank you for that detailed example! I did a little exploring of the original code that was introduced in 740ca8e and what's interesting there is we have the comment "Ignore ErrTaskNotFound errors" but no means to actually do so because the function has no return value at that time. The return value is introduced a few weeks later in @schmichael I want to ping you on this too just because you were involved in the original but this PR LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change LGTM as well but defer to @schmichael in case of missing context.
driver plugin has shutdown; attempting to recover task
This isn't true, the driver does not shutdown after the first kill
This is troubling and we should fix it. I wonder if it's related to #10494 . What version are you running?
As a meta comment for ongoing discussion, success boolean in restoreHandler
is either ill-defined or at least doesn't capture the information well. There are multiple outcomes: "successfully reattached", "failed to reattach because process died and no longer exists", "failed to reattach but succeeded in destroying process", "failed to reattach, and process got leaked and no longer managed". These failures should be handled and messaged to users differently, and a boolean indicator isn't sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a similarly weird line of code. I'm unsure if it's never hit and therefore doesn't matter, or if it's actually doing the right thing, or if it's doing the wrong thing but in a way that is rare or hard to notice.
Regardless @notnoop is absolutely right that (success bool)
is at best a misleadingly named return value. It seems more likely there are a few states (restored, nothing to restore, something to restore but its invalid) represented here collapsed into 2 (true|false).
On agent startup this change will cause waitOnServers=true if RecoverTask fails for a non-terminal allocation which I think is fine. I'm not sure it's necessary, but it feels like the conservative approach which I'm all for given the complexity of this logic.
Blah blah blah, all that to say I'm +1 on this PR. Thanks for your patience and determination @benbuzbee! If we find further complications around this logic, it will be because this func's results were nonsensical and just happened to be working. I'd rather keeping moving the code forward than reject the PR and force plugin authors to live with this incoherent behavior.
I pushed a changelog entry and will merge on Monday if there are no complaints. I tried to keep it succinct which unfortunately makes it awfully vague, but the audience (task driver plugin authors) is relatively small and probably feels comfortable hopping over to this PR and joining in our shared 🤔 |
Don't treat a failed recover + successful destroy as a successful recover
Don't treat a failed recover + successful destroy as a successful recover
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This code just seems incorrect. As it stands today it reports a
successful restore if RecoverTask fails and then DestroyTask succeeds.
This can result in a really annoying bug where it then calls RecoverTask
again, whereby it will probably get ErrTaskNotFound and call DestroyTask
once more.
I think the only reason this has not been noticed so far is because most
drivers like Docker will return Success, then nomad will call
RecoverTask, get an error (not found) and call DestroyTask again, and
get a ErrTasksNotFound err.