Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart unhealthy tasks #3105

Merged
merged 33 commits into from
Sep 18, 2017
Merged

Restart unhealthy tasks #3105

merged 33 commits into from
Sep 18, 2017

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Aug 26, 2017

Fixes #876

Unhealthy checks can now restart tasks. See docs in PR for intended behavior.

  • Code
  • Docs
  • Changelog
  • Tests

@schmichael schmichael requested a review from dadgar August 26, 2017 05:51
@samart
Copy link

samart commented Sep 4, 2017

I would vote to make grace period just grace period, not including interval. more straightforward. How long you want to wait before checking is very different from what interval you want once you are ready.

@@ -1674,6 +1693,25 @@ func (r *TaskRunner) Restart(source, reason string) {
}
}

// RestartBy deadline. Restarts a task iff the last time it was started was
// before the deadline. Returns true if restart occurs; false if skipped.
func (r *TaskRunner) RestartBy(deadline time.Time, source, reason string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't return anything?

@alxark
Copy link

alxark commented Sep 5, 2017

should we wait for this PR to be merged before next release? it's really very important feature and i'm sure it's blocking some users to start using docker.

@schmichael
Copy link
Member Author

schmichael commented Sep 6, 2017

should we wait for this PR to be merged before next release? it's really very important feature and i'm sure it's blocking some users to start using docker.
-- @alxark

We decided kind of the opposite. :) Because this is such an important feature we didn't want to rush it into the 0.6.1 point release. The delay since was caused by me being on vacation, but rest assured this feature is my highest priority.

@schmichael schmichael force-pushed the f-876-restart-unhealthy branch 2 times, most recently from 7d90933 to 7c85c05 Compare September 10, 2017 17:05
schmichael added a commit that referenced this pull request Sep 11, 2017
@schmichael schmichael force-pushed the f-876-restart-unhealthy branch from 43f5f1a to d43c3da Compare September 12, 2017 17:35
schmichael added a commit that referenced this pull request Sep 12, 2017
@@ -439,7 +439,8 @@ func (tm *TaskTemplateManager) handleTemplateRerenders(allRenderedTime time.Time
}

if restart {
tm.config.Hooks.Restart(consulTemplateSourceName, "template with change_mode restart re-rendered")
const failure = false
tm.config.Hooks.Restart(consulTemplateSourceName, "template with change_mode restart re-rendered", failure)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My little const pattern is pretty funky, and I'd be happy to remove it.

I do it because I hate seeing method calls with literal booleans in them and having no idea what those booleans do without looking at the method signature and/or docs.

@@ -1251,8 +1269,7 @@ func TestTaskRunner_Template_NewVaultToken(t *testing.T) {
})

// Error the token renewal
vc := ctx.tr.vaultClient.(*vaultclient.MockVaultClient)
renewalCh, ok := vc.RenewTokens[token]
renewalCh, ok := ctx.vault.RenewTokens[token]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the unrelated test changes. Just made the mocks easier to access.

@@ -25,3 +27,119 @@ func (m *MockCatalog) Service(service, tag string, q *api.QueryOptions) ([]*api.
m.logger.Printf("[DEBUG] mock_consul: Service(%q, %q, %#v) -> (nil, nil, nil)", service, tag, q)
return nil, nil, nil
}

// MockAgent is a fake in-memory Consul backend for ServiceClient.
type MockAgent struct {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new; moved from unit_test.go and exported for use in client/

// Must test >= because if limit=1, restartAt == first failure
if now.Equal(restartAt) || now.After(restartAt) {
// hasn't become healthy by deadline, restart!
c.logger.Printf("[DEBUG] consul.health: restarting alloc %q task %q due to unhealthy check %q", c.allocID, c.taskName, c.checkName)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFO or WARN level? Left at DEBUG since Task Events are probably a more accessible source for this information.

execs: make(chan int, 100),
}
}

// fakeConsul is a fake in-memory Consul backend for ServiceClient.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported as MockAgent

@schmichael schmichael changed the title WIP - Restart unhealthy tasks Restart unhealthy tasks Sep 13, 2017

- `limit` `(int: 0)` - Restart task after `limit` failing health checks. For
example 1 causes a restart on the first failure. The default, `0`, disables
healtcheck based restarts. Failures must be consecutive. A single passing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healtcheck -> health check


In this example the `mysqld` task has `90s` from startup to begin passing
healthchecks. After the grace period if `mysqld` would remain unhealthy for
`60s` (as determined by `limit * interval`) it would be restarted after `8s`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restarted after 8s isn't really accurate. It would be killed and then wait 8s till starting again. Where is the .25 coming from?

@@ -162,6 +168,72 @@ scripts.
- `tls_skip_verify` `(bool: false)` - Skip verifying TLS certificates for HTTPS
checks. Requires Consul >= 0.7.2.

#### `check_restart` Stanza
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably worth pulling out into its own page and side bar?

@@ -2916,6 +2988,9 @@ type Service struct {

Tags []string // List of tags for the service
Checks []*ServiceCheck // List of checks associated with the service

// CheckRestart will be propagated to Checks if set.
CheckRestart *CheckRestart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only exist on checks


// If CheckRestart is set propagate it to checks
if s.CheckRestart != nil {
for _, check := range s.Checks {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be done at the api layer. Internal structs should represent how it is actually used (check_restart only on checks). See how update stanza is handled.

@@ -196,6 +209,25 @@ func (r *RestartTracker) handleWaitResult() (string, time.Duration) {
return structs.TaskRestarting, r.jitter()
}

// handleFailure returns the new state and potential wait duration for
// restarting the task due to a failure like an unhealthy Consul check.
func (r *RestartTracker) handleFailure() (string, time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor the three handle methods to use one common base?

} else {
// Since the restart isn't from a failure, restart immediately
// and don't count against the restart policy
r.restartTracker.SetRestartTriggered()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner for SetRestartTriggered to just take the failure as a parameter?

}

// Update all watched checks as CheckRestart fields aren't part of ID
if check.Watched() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get a test ensuring this

func TestConsul_ChangeTags(t *testing.T) {
ctx := setupFake()

allocID := "allocid"
if err := ctx.ServiceClient.RegisterTask(allocID, ctx.Task, nil, nil); err != nil {
if err := ctx.ServiceClient.RegisterTask(allocID, ctx.Task, ctx.Restarter, nil, nil); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are any of the tests ensuring that the watch gets created?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to TestConsul_ChangeChecks and TestConsul_RegServices

t.Errorf("expected check 2 to not be restarted but found %d", n)
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test checking that Watch() with updated check works

remove: true,
}
select {
case w.watchCh <- &c:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create an unwatchCh and just pass the cid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would create a race between adding and removing watches since select cases are randomly selected.

Instead I'll make watchCh -> watchUpdateCh and create a small checkRestartUpdate struct to handle the add vs remove case.

for {
// Don't start watching until we actually have checks that
// trigger restarts.
for len(checks) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets just create a single select block. We can start the timer when we get a watch: https://github.com/hashicorp/nomad/blob/master/client/alloc_runner_health_watcher.go#L447

checks := map[string]*checkRestart{}

// timer for check polling
checkTimer := time.NewTimer(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A select on this will fire immediately. You need to stop it first and select from the channel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed below


// Begin polling
if !checkTimer.Stop() {
<-checkTimer.C
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this with a select with a default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is straight from the docs: https://golang.org/pkg/time/#Timer.Reset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but I always forget this caveat on Stop:

assuming the program has not received from t.C already

Oof this API gets me every time.

c.logger.Printf("[DEBUG] consul.health: alloc %q task %q check %q became unhealthy. Restarting in %s if not healthy",
c.allocID, c.taskName, c.checkName, c.timeLimit)
}
c.unhealthyStart = now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename unhealthyStart to unhealthyCheck?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At no other point do I refer to a single run of a check as a check in this file, so I'm not sure I understand the renaming. We don't even have visibility into individual runs of a check since we poll statuses at a fixed interval.

restartDelay time.Duration
grace time.Duration
interval time.Duration
timeLimit time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document non-obvious fields: interval, timeLimit

@schmichael schmichael force-pushed the f-876-restart-unhealthy branch from 912e5db to ab0cae0 Compare September 13, 2017 23:53
@schmichael schmichael force-pushed the f-876-restart-unhealthy branch from 764fdaa to 0052ec9 Compare September 13, 2017 23:57
Reusing checkRestart for both adds/removes and the main check restarting
logic was confusing.
@dadgar made the excellent observation in #3105 that TaskRunner removes
and re-registers checks on restarts. This means checkWatcher doesn't
need to do *any* internal restart tracking. Individual checks can just
remove themselves and be re-added when the task restarts.
Before this commit if a task had 2 checks cause restarts at the same
time, both would trigger restarts of the task! This change removes all
checks for a task whenever one of them is restarted.
Watched was a silly name
All 3 error/failure cases share restart logic, but 2 of them have
special cased conditions.
@schmichael schmichael force-pushed the f-876-restart-unhealthy branch from 9c8de7b to 1564e1c Compare September 14, 2017 23:50
api/tasks.go Outdated

s.CheckRestart.Canonicalize()

for _, c := range s.Checks {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place a comment

return r
}

// SetRestartTriggered is used to mark that the task has been signalled to be
// restarted
func (r *RestartTracker) SetRestartTriggered() *RestartTracker {
func (r *RestartTracker) SetRestartTriggered(failure bool) *RestartTracker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on the param

@@ -99,7 +99,7 @@ func TestClient_RestartTracker_RestartTriggered(t *testing.T) {
p := testPolicy(true, structs.RestartPolicyModeFail)
p.Attempts = 0
rt := newRestartTracker(p, structs.JobTypeService)
if state, when := rt.SetRestartTriggered().GetState(); state != structs.TaskRestarting && when != 0 {
if state, when := rt.SetRestartTriggered(false).GetState(); state != structs.TaskRestarting && when != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test the new case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cde908e

r.reason = ReasonDelay
return structs.TaskRestarting, r.getDelay()
// Handle restarts due to failures
if r.failure {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invert it:

if !r.failure {
  return "", 0
}
...

} else {
r.reason = ReasonDelay
return structs.TaskRestarting, r.getDelay()
if r.count > r.policy.Attempts {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment how you get to this case.

check_restart {
limit = 3
grace = "90s"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space


- `limit` `(int: 0)` - Restart task when a health check has failed `limit`
times. For example 1 causes a restart on the first failure. The default,
`0`, disables healtcheck based restarts. Failures must be consecutive. A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healtcheck. Can you run a spell check on the doc sections

```

The [`restart` stanza][restart_stanza] controls the restart behavior of the
task. In this case it will wait 10 seconds before restarting. Note that even if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it will stop the task and then wait 10 seconds before starting it again..
Delete subsequent sentence

}
}

// Unwatch a task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unwatch a check

}
}

// Watch a task and restart it if unhealthy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch a check and restart it's task if it becomes unhealthy

@@ -30,6 +30,8 @@ unhealthy for the `limit` specified in a `check_restart` stanza, it is
restarted according to the task group's [`restart` policy][restart_stanza]. The
`check_restart` settings apply to [`check`s][check_stanza], but may also be
placed on [`service`s][service_stanza] to apply to all checks on a service.
`check_restart` settings on `service` will only overwrite unset `check_restart`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If check_restart is set on both the check and service, the stanza's are merged with the check values taking precedence.

@schmichael schmichael merged commit c1cf162 into master Sep 18, 2017
@schmichael schmichael deleted the f-876-restart-unhealthy branch September 18, 2017 02:38
@jippi
Copy link
Contributor

jippi commented Sep 18, 2017

Amazing @schmichael !

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants