Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preempted alloc during deploy guarantees failed deployment #8093

Open
djenriquez opened this issue Jun 2, 2020 · 4 comments
Open

Preempted alloc during deploy guarantees failed deployment #8093

djenriquez opened this issue Jun 2, 2020 · 4 comments

Comments

@djenriquez
Copy link

djenriquez commented Jun 2, 2020

Nomad version

Output from nomad version
Nomad v0.10.4 (f750636ca68e17dcd2445c1ab9c5a34f9ac69345)

Operating system and Environment details

Amazon Linux 2

Issue

Hello, whenever a deploy is running a canary/bluegreen deploy, if an allocation is pre-empted before it reaches the healthy count, the entire deployment is doomed to fail by progress_deadline.

This is because the deployment will properly count the canary but when it pre-empts it, it does not replace it. So the final count would be N-1 running even though the deployment thinks all N are running. The deployment waits for that pre-empted allocation to get healthy, which of course, it never will since its no longer running, until the progress_deadline hits, then the red bar comes up and all the progress is rolled back.

This is very difficult to reproduce since I'm not too sure what causes a pre-empt action. But we do see it more often than not and is extremely problematic when it does happen.

Here is a deployment where that happened:
Screen Shot 2020-06-01 at 5 15 41 PM

Here is the allocation that was redacted:
Screen Shot 2020-06-01 at 4 51 45 PM

This may be related to an earlier issue I posted regarding deployment state falling out of sync that @tgross worked on a bit. Is it related? Should the pre-empted allocation have been detected and replaced by the deployment?

Thanks.

@djenriquez
Copy link
Author

Just wanted to update, our organization deploys a few times a day and we see this issue a few times a week. Any update on a potential fix? Seems serious if canary deploys aren't dependable.

@djenriquez
Copy link
Author

Wanting to bump this issue. We're running 1.1.5 and can confirm that we still see this; saw it a handful of times in our orgs deployment attempts today.

@djenriquez
Copy link
Author

djenriquez commented Nov 16, 2021

The recent upgrade to 1.1.5 seems to have really kicked up the amount of times a pre-emption occurs during a deployment, which greatly increased the chance of hitting this bug. Now, we are experiencing this issue at a noticeable rate. It is incredibly difficult to run blue/green deployments via canaries with this issue.

When we watch deployments, if we catch it, we can "kick" the deployment by adjusting the count or any piece of metadata. However, when deployments occur from the autoscaler, likely mixed with migrations from nodes coming and going, all sorts of weird things occur.

Screen Shot 2021-11-16 at 9 50 44 AM

Screen Shot 2021-11-16 at 9 50 12 AM

Screen Shot 2021-11-16 at 10 01 44 AM

Screen Shot 2021-11-16 at 10 01 40 AM

Screen Shot 2021-11-16 at 10 01 33 AM

You'll notice the deployment says it has been going on for a month. (Related: #11267) This is incorrect as this job was stable just last Friday. Also, you'll see the progress deadline of the 15th, yet the deployment is still going on today.

I managed to correct the job state by kicking off a new deployment by changing the job count, but this issue prevented the job from reaching its desired state, which also disabled autoscaling since the autoscaler does not run against jobs in deployment.

Any ideas on possible mitigations to this problem? Is it possible to disable pre-emption?

@djenriquez
Copy link
Author

djenriquez commented Feb 3, 2022

Wanted to give an update with a reproducible scenario for this problem:

During canary blue/green deployments, if there are not enough resources in the group of clients that fit a job's constraint, those allocations will be queued, as designed, and the deployment waits at a placement failure. In response to the placement failure, a new client is spun up.

When that new client becomes available, the pending jobs are assigned to the client BEFORE any system jobs that the client should also run. Because of this, one or more new allocations MUST BE pre-empted to make room for the system job. When these allocations are pre-empted, the deployment watcher fails to acknowledge that the pre-emption occurred, and thus fails to replace them.

Now, we have a deployment that will never succeed because the number of healthy allocations it expects can never reach the desired count. When a pre-emption happens during a blue/green deploy, 100% of deployments fail.

Given the dynamic activity of our clusters, this has become a pretty significant problem where 1 in 4 deployments end up failing due to progress deadline.

So in short if the follow conditions occur, a deployment will fail:

  1. A system job exists that must be scheduled on nomad-clients.
  2. Pending allocations exist from a deployment.
  3. The pending allocations total resource required equal an amount that is scheduleable IF the system job is not considered.
  4. The pending jobs are scheduled onto the host before the system job.

We appreciate the time spent on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants