Preempted alloc during deploy guarantees failed deployment #8093

djenriquez · 2020-06-02T00:22:23Z

Nomad version

Output from nomad version
Nomad v0.10.4 (f750636ca68e17dcd2445c1ab9c5a34f9ac69345)

Operating system and Environment details

Amazon Linux 2

Issue

Hello, whenever a deploy is running a canary/bluegreen deploy, if an allocation is pre-empted before it reaches the healthy count, the entire deployment is doomed to fail by progress_deadline.

This is because the deployment will properly count the canary but when it pre-empts it, it does not replace it. So the final count would be N-1 running even though the deployment thinks all N are running. The deployment waits for that pre-empted allocation to get healthy, which of course, it never will since its no longer running, until the progress_deadline hits, then the red bar comes up and all the progress is rolled back.

This is very difficult to reproduce since I'm not too sure what causes a pre-empt action. But we do see it more often than not and is extremely problematic when it does happen.

Here is a deployment where that happened:

Here is the allocation that was redacted:

This may be related to an earlier issue I posted regarding deployment state falling out of sync that @tgross worked on a bit. Is it related? Should the pre-empted allocation have been detected and replaced by the deployment?

Thanks.

The text was updated successfully, but these errors were encountered:

djenriquez · 2020-08-31T23:32:36Z

Just wanted to update, our organization deploys a few times a day and we see this issue a few times a week. Any update on a potential fix? Seems serious if canary deploys aren't dependable.

djenriquez · 2021-10-20T02:07:18Z

Wanting to bump this issue. We're running 1.1.5 and can confirm that we still see this; saw it a handful of times in our orgs deployment attempts today.

djenriquez · 2021-11-16T18:03:57Z

The recent upgrade to 1.1.5 seems to have really kicked up the amount of times a pre-emption occurs during a deployment, which greatly increased the chance of hitting this bug. Now, we are experiencing this issue at a noticeable rate. It is incredibly difficult to run blue/green deployments via canaries with this issue.

When we watch deployments, if we catch it, we can "kick" the deployment by adjusting the count or any piece of metadata. However, when deployments occur from the autoscaler, likely mixed with migrations from nodes coming and going, all sorts of weird things occur.

You'll notice the deployment says it has been going on for a month. (Related: #11267) This is incorrect as this job was stable just last Friday. Also, you'll see the progress deadline of the 15th, yet the deployment is still going on today.

I managed to correct the job state by kicking off a new deployment by changing the job count, but this issue prevented the job from reaching its desired state, which also disabled autoscaling since the autoscaler does not run against jobs in deployment.

Any ideas on possible mitigations to this problem? Is it possible to disable pre-emption?

djenriquez · 2022-02-03T22:28:01Z

Wanted to give an update with a reproducible scenario for this problem:

During canary blue/green deployments, if there are not enough resources in the group of clients that fit a job's constraint, those allocations will be queued, as designed, and the deployment waits at a placement failure. In response to the placement failure, a new client is spun up.

When that new client becomes available, the pending jobs are assigned to the client BEFORE any system jobs that the client should also run. Because of this, one or more new allocations MUST BE pre-empted to make room for the system job. When these allocations are pre-empted, the deployment watcher fails to acknowledge that the pre-emption occurred, and thus fails to replace them.

Now, we have a deployment that will never succeed because the number of healthy allocations it expects can never reach the desired count. When a pre-emption happens during a blue/green deploy, 100% of deployments fail.

Given the dynamic activity of our clusters, this has become a pretty significant problem where 1 in 4 deployments end up failing due to progress deadline.

So in short if the follow conditions occur, a deployment will fail:

A system job exists that must be scheduled on nomad-clients.
Pending allocations exist from a deployment.
The pending allocations total resource required equal an amount that is scheduleable IF the system job is not considered.
The pending jobs are scheduled onto the host before the system job.

We appreciate the time spent on this issue.

tgross added theme/deployments stage/needs-investigation labels Jun 22, 2020

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preempted alloc during deploy guarantees failed deployment #8093

Preempted alloc during deploy guarantees failed deployment #8093

djenriquez commented Jun 2, 2020 •

edited

Loading

djenriquez commented Aug 31, 2020

djenriquez commented Oct 20, 2021

djenriquez commented Nov 16, 2021 •

edited

Loading

djenriquez commented Feb 3, 2022 •

edited

Loading

Preempted alloc during deploy guarantees failed deployment #8093

Preempted alloc during deploy guarantees failed deployment #8093

Comments

djenriquez commented Jun 2, 2020 • edited Loading

Nomad version

Operating system and Environment details

Issue

djenriquez commented Aug 31, 2020

djenriquez commented Oct 20, 2021

djenriquez commented Nov 16, 2021 • edited Loading

djenriquez commented Feb 3, 2022 • edited Loading

djenriquez commented Jun 2, 2020 •

edited

Loading

djenriquez commented Nov 16, 2021 •

edited

Loading

djenriquez commented Feb 3, 2022 •

edited

Loading