Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task runner to avoid running task if terminal #5890

Merged
merged 2 commits into from
Jul 2, 2019

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jun 26, 2019

This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that tr.Run() is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes #5883

@notnoop notnoop force-pushed the b-dont-start-completed-allocs-2 branch from 8603028 to f42aa1e Compare June 26, 2019 15:51
dead := tr.state.State == structs.TaskStateDead
tr.stateLock.RUnlock()

if dead {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I only check if the task itself is dead - I suspect we should be checking if the restore alloc had a terminated alloc state. I suspect that an alloc with tasks with mixed status causes some some complications?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is the right behavior. An alloc is considering running if one task completes, and all allocs will be killed if leader task dies or a task failed enough times. Until that happens, we should treat other tasks as running.

This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes #5883
@notnoop notnoop force-pushed the b-dont-start-completed-allocs-2 branch from f42aa1e to f3c944a Compare June 27, 2019 03:27
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!


// TestAllocRunner_Restore_Completed asserts that restoring a completed
// batch alloc doesn't run it again
func TestAllocRunner_Restore_CompletedBatch(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name/comment mismatch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good, but just to verify:

  • Does it fail without your fixes?
  • Does it pass with -race?

Copy link
Contributor Author

@notnoop notnoop Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it passes with -race and was failing before - here is a sample build failure [1] when adding test alone. The failure snippet is:

goroutine 87 [chan receive, 14 minutes]:
github.com/hashicorp/nomad/client/allocrunner.destroy(0xc000342780)
	/home/travis/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner_test.go:27 +0x54
runtime.Goexit()
	/home/travis/.gimme/versions/go1.12.6.linux.amd64/src/runtime/panic.go:406 +0xed
testing.(*common).FailNow(0xc000449b00)
	/home/travis/.gimme/versions/go1.12.6.linux.amd64/src/testing/testing.go:609 +0x39
github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/require.Fail(0x18348e0, 0xc000449b00, 0x15fc0e0, 0x1a, 0x0, 0x0, 0x0)
	/home/travis/gopath/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/require/require.go:285 +0xf0
github.com/hashicorp/nomad/client/allocrunner.TestAllocRunner_Restore_CompletedBatch(0xc000449b00)
	/home/travis/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner_unix_test.go:204 +0xb22
testing.tRunner(0xc000449b00, 0x1639ae0)
	/home/travis/.gimme/versions/go1.12.6.linux.amd64/src/testing/testing.go:865 +0xc0
created by testing.(*T).Run
	/home/travis/.gimme/versions/go1.12.6.linux.amd64/src/testing/testing.go:916 +0x35a

As seen in stack trace, we fail in line 204 [1] because AR.wait() times out, then times out again in destroy defer call.

I'll follow up in another PR to change the destroy defer call so that it errors rather than blocks indefinitely on failures to make tracking these errors better.

[1] https://travis-ci.org/hashicorp/nomad/jobs/553113545
[2] https://github.com/hashicorp/nomad/compare/b-dont-start-completed-allocs-2-test-only?expand=1
[3] https://github.com/hashicorp/nomad/compare/b-dont-start-completed-allocs-2-test-only?expand=1#diff-41decefd2f35059b5c0b95166e275653R204

if err := tr.stop(); err != nil {
tr.logger.Error("stop failed on terminal task", "error", err)
}
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may also want to call tr.TaskStateUpdated() since task states are persisted before the AR is notified. Therefore I think the following could happen:

  1. 2 tasks in an alloc start: a leader service, and a sidecar
  2. Leader task exits, persists TaskStateDead
  3. agent crashes before TaskStateUpdated is called
  4. agent restarts, returns here due to TaskStateDead

At this point I do not think anything will have told the sidecar service to exit despite its leader dying. If you call TaskStateUpdated here, then all of the leader died detection logic in AR will be run: https://github.com/hashicorp/nomad/blob/master/client/allocrunner/alloc_runner.go#L415-L438

This could be done in a followup PR as well since I think your changes improve the situation.

notnoop pushed a commit that referenced this pull request Jul 2, 2019
@notnoop notnoop merged commit bd7d60e into master Jul 2, 2019
@notnoop notnoop deleted the b-dont-start-completed-allocs-2 branch July 2, 2019 07:31
notnoop pushed a commit that referenced this pull request Aug 25, 2019
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes #5984
Related to #5890
@github-actions
Copy link

github-actions bot commented Feb 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restarting nomad agent restarts successfully completed allocs
2 participants