You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For each allocation that's part of a deployment, we start up a health hook on the client. Once that hook updates the deployment status of the alloc to healthy (or the deadline passes without doing so), the hook exits. If the allocation fails after it initially reports it's healthy, we're in a state where the deployment thinks all its placements are healthy, but when we promote the deployment to make that a reality, the server's state machine rejects the change because there aren't enough canary allocations that are actually healthy.
This is a pathological case which we're only likely to hit when promotions are made manually or when tasks are slow to start and flappy after start, and getting the behavior to be predictable and understandable to operators in that condition is difficult. There may be some future improvements we can make to the deployments (especially with L7 health checks via Connect on the horizon).
In the meantime, we're going to make a change so that promotions are marked as failed if the promotion hits an error past the promotion deadline.
The text was updated successfully, but these errors were encountered:
Follow-up improvement coming out of #6407 (comment)
For each allocation that's part of a deployment, we start up a health hook on the client. Once that hook updates the deployment status of the alloc to healthy (or the deadline passes without doing so), the hook exits. If the allocation fails after it initially reports it's healthy, we're in a state where the deployment thinks all its placements are healthy, but when we promote the deployment to make that a reality, the server's state machine rejects the change because there aren't enough canary allocations that are actually healthy.
This is a pathological case which we're only likely to hit when promotions are made manually or when tasks are slow to start and flappy after start, and getting the behavior to be predictable and understandable to operators in that condition is difficult. There may be some future improvements we can make to the deployments (especially with L7 health checks via Connect on the horizon).
In the meantime, we're going to make a change so that promotions are marked as failed if the promotion hits an error past the promotion deadline.
The text was updated successfully, but these errors were encountered: