Fail deployments if promotion returns error after deadline #6740

tgross · 2019-11-20T14:12:16Z

Follow-up improvement coming out of #6407 (comment)

For each allocation that's part of a deployment, we start up a health hook on the client. Once that hook updates the deployment status of the alloc to healthy (or the deadline passes without doing so), the hook exits. If the allocation fails after it initially reports it's healthy, we're in a state where the deployment thinks all its placements are healthy, but when we promote the deployment to make that a reality, the server's state machine rejects the change because there aren't enough canary allocations that are actually healthy.

This is a pathological case which we're only likely to hit when promotions are made manually or when tasks are slow to start and flappy after start, and getting the behavior to be predictable and understandable to operators in that condition is difficult. There may be some future improvements we can make to the deployments (especially with L7 health checks via Connect on the horizon).

In the meantime, we're going to make a change so that promotions are marked as failed if the promotion hits an error past the promotion deadline.

tgross added type/enhancement theme/deployments labels Nov 20, 2019

tgross added this to the near-term milestone Nov 20, 2019

tgross mentioned this issue Nov 20, 2019

Canary promotion reports wrong healthy count, preventing promotion #6407

Closed

tgross modified the milestones: near-term, unscheduled Jan 9, 2020

tgross removed this from the unscheduled milestone Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail deployments if promotion returns error after deadline #6740

Fail deployments if promotion returns error after deadline #6740

tgross commented Nov 20, 2019

Fail deployments if promotion returns error after deadline #6740

Fail deployments if promotion returns error after deadline #6740

Comments

tgross commented Nov 20, 2019