-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canary promotion reports wrong healthy count, preventing promotion #6407
Comments
@endocrimes, I built a quick little service to test this around, and I'm finding interesting behaviors with deployment. Here is randfail which includes a nomad jobfile to reproduce the behavior. This is running on Nomad 0.9.5 (server + agent). To replicate the canary issue with Nomad 0.9.5:
|
Hi guys, any confirmation if this is a known issue? Maybe I'm doing something wrong? Would love to get blue/green deployments going for our org, devs feel rolling deploys are a bit painful in terms of speed 😉. |
Tested this on 0.10.0 and verified that the issue still exists. Given blue/green deploy is a fundamental feature marketed for Nomad, and this existing issue which makes the feature unusable, when do we foresee traction for this issue? Nomad 0.10.0 canary promotion failure: @dadgar are you able to comment? |
Interesting. It has been 50 minutes over the progress deadline and the deployment is still going waiting to be promoted, but it cannot be promoted because of the bug in this issue. So, this deployment is completely stuck, it does not fail. Looks like the only way to make it unstuck is run another deployment. |
@eveld, I forgot to ask about this in the webinar this morning 😅. I would really appreciate an update on this issue as I'd love to sell blue/green deploys to my org but can't promote it currently because of this issue... could this issue get some ❤️ ? |
Hi @djenriquez, sorry it took so long to get back to you on this. I've been trying to replicate the behavior you're showing here but unfortunately I haven't been able to do so. I've been using the It looks like the initial
I've assigned this issue to myself and I want to see if I can create a faster-converging test case, but first I need to be able to repro. I did make one modification to the jobs which was to reduce the amount of memory used so that I could fit the deployment on my test rig. And I did notice on one run that one of the previous deploy's tasks failed after the deployment had been marked healthy. Given the size of the jobs we're running here, is there any chance that some of the tasks are becoming unhealthy in your environment because of resource restrictions, and that's fouling the deployment? |
Hi Tim,
Thank you for getting this on your list! Hmm if you are unable to replicate it then maybe something is wrong in my environment. I can pretty much replicate this behavior every time.
We are using Nomad 0.10.0 with Consul 1.6.1. The jobs definitely have enough resources and it is not an issue after the deployment, always during the deployments as shown in the screenshots in my earlier posts.
We use a three Nomad server deployment and navigate to the UI via a load balancer that round robins the requests between the Nomad servers. I wonder if I try a single server deployment if I would see these issues.
@tgross did you try to replicate the issue with a multi-server environment?
… On Nov 8, 2019, at 1:06 PM, Tim Gross ***@***.***> wrote:
Hi @djenriquez, sorry it took so long to get back to you on this. I've been trying to replicate the behavior you're showing here but unfortunately I haven't been able to do so.
I've been using the randfail jobs you provided (thanks so much for providing test cases!). I checked through the spec pretty carefully along with the docs to make sure that this isn't just a misconfiguration or documentation confusion.
It looks like the initial randfail.hcl job can take a very long time to converge to a successful state if the random number generation is unkind. In my most recent run here I had to place 25 containers before I got 10 that were ok. But once it was running I ran the randfail.canary.hcl, waiting for 10 healthy canaries to appear, and was able to promote it with no problems.
▶ nomad deployment status b69f6388
ID = b69f6388
Job ID = randfail
Job Version = 1
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
randfail true 10 10 15 10 5 2019-11-08T21:02:55Z
I've assigned this issue to myself and I want to see if I can create a faster-converging test case, but first I need to be able to repro.
I did make one modification to the jobs which was to reduce the amount of memory used so that I could fit the deployment on my test rig. And I did notice on one run that one of the previous deploy's tasks failed after the deployment had been marked healthy. Given the size of the jobs we're running here, is there any chance that some of the tasks are becoming unhealthy in your environment because of resource restrictions, and that's fouling the deployment?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@tgross I verified. In a single server deployment, everything works well. I reduced the server count to 1 for our development environment and ran through the randfail tests: Promoting this deployment results in a successful deployment: However, I increase the server count to just 2 servers, and the issue returned: With two servers (notice the previous deployment were successful, which is the deployment shown above) So this issue only exists in multi-server environments, which I would think, would be every production environment that uses Nomad. It appears there is state synchronization issues. Bug in raft? |
Thanks for that test! That's really interesting and probably narrows down the set of places where this bug could be lurking. I'll continue my investigation and circle back here soon. |
Ok, I've been able to reproduce and found some interesting logs! I went through the process and waited for the deployment status to report 10 healthy, and then promoted it:
The job status is showing something strange, however. We're showing 10 healthy canaries for the deployment but only 15 allocations in the
Here's the alloc status of one of these jobs (it has no logs, however):
The server logs also show that each server has a different view of the task group health at the moment we try to do the promotion. Server 1:
Server 2:
Server 3:
But the weird thing is this is persistent. Even after I've waited long enough to write all this report up, I retried the deployment and they all show the same error message; they haven't converged on an opinion about task group health. I may need to dig into how they gets determined. |
It looks like it isn't the allocation exiting by itself that's causing them to be marked
The eval for that allocation was:
Which has the following associated log entries:
So now we need to figure out why Nomad decided that we were supposed to stop these allocations. |
I've been leaving this job running as I've been debugging, and a random failed health check on one of the tasks kicked off a new round of evaluations and a few more completed tasks. This brought the count of allegedly health allocations up to 8/10. And here's the allocations for our job, which is showing more than 10 of version 4 now:
Which the deployment status agrees with:
|
@tgross working through some of your questions right now. The jobs are rescheduled via the reschedule stanza timeouts, so setting those to lower values and the health checking could speed up the test case quite a bit. |
I'm running on a 2 server Nomad deployment, each server shows the same status (might be just my luck, i'll try to increase the fail percentage after this), but the status shows the same for both:
ip-10-179-158-186 is reporting as the leader. When I list the allocations, I do see an alloc of the previous (blue) job version that shows as complete:
It's evaluation:
|
The slow rotation of tasks onto version 4 made me suspicious. So I caused one of the version 3 tasks to lose its network connectivity and fail its health check, and in response Nomad created multiple version 4 tasks:
At this point, Nomad thinks we have 10/10 and a request to promote succeeds:
The version 3 jobs are then stopped, as are any version 4 jobs more than the 10 we want. @djenriquez at this point I've got a working reproduction and at least some theories as to where the problems could be occurring. I'm going to try to create a more minimal test case and start digging into the areas of code we're touching here next. |
@djenriquez I've run a number of tests changing some of the parameters to see if I can narrow the search space a bit. With a smaller size job that fails a bit less often, or when using a smaller number of canaries, I've found it's easier to get a successful or failed deployment than this mysterious stuck deployment. So worst case scenario for you in the meantime, you might want consider using a smaller set of canaries than full blue/green until we have this figured out. Here's where we are so far. I've done some scripting to parse through the evaluations API and unfortunately, it looks like there may be two different buggy behaviors to unpack here, and the obvious symptoms described as (1) below aren't necessarily the direct cause of the deploys getting stuck (2) because we can get stuck without this happening. But because I can't reproduce either of these behaviors without canaries, that at least narrows it down to the canary code.
That particular message only shows up in a few places in the code base, mostly around the This graph shows an example of an in-flight deployment. The blue allocations on the right hand side (
This is the stuck state graph, taken after the progress timer should has expired. The failed alloc is the red alloc |
With a slightly improved version of the script I'm using to process the data, I think I've got a clue. The graph below is the same data as (2) above, but with the addition of the blue arrows that show the canaries that the deployments are tracking. Note that our stuck deployment So it looks like the deployment is not tracking restarted canaries. |
The odd thing here is that we'd expect that this rescheduled allocation shouldn't have been scheduled at all: https://www.nomadproject.io/docs/job-specification/reschedule.html#rescheduling-during-deployments
But the raw data definitely says that |
Another workaround for you while we work on this @djenriquez is that if you have the |
This is incredibly impressive work on the investigation @tgross. I'm not sure where I can help you, but I'm more than happy to do so if there are things I'm able to do. With the work-around, that actually doesn't work well for us since the problem we experience that trigger this is related to #6567, which fails allocations immediately. It happens, unfortunately, more common than one would hope. There is also #6350, which was pre-0.10.0 but back then we had to register consul services on the task level, so we registered them on the sidecar proxy task since that represented the network for the service. With a |
Ah, right... well the good news is that my PR to the CNI plugins project was accepted and so hopefully they'll be cutting as release with that soon.
That issue is on my radar as well! |
After some more testing I have working single node tests and I've been breaking down the symptoms. There are two real problems and one false one:
One of the symptoms we've seen in the large scale tests is actually the correct behavior but it's emergent when the other bugs are happening. When a deployment job with canaries gets stuck, the next deployment will cause all its canaries to be stopped because they were never promoted. When this happens alongside (1), this looks like we're stopping previous job version allocs prematurely. |
The issue I've identified as (2) above turns out to be a known tradeoff in the scheduler, which is in our internal design docs but not documented in public. I've opened #6723 to make sure this gets documented. That leaves:
For each allocation that's part of a deployment, we start up an Which means if the allocation fails after it initially reports it's healthy, we're in a state where the deployment thinks everyone is healthy, but when we promote the deployment to make that a reality, the server's state machine rejects the change because there aren't enough canary allocations that are actually healthy. The "obvious" fix to do is to not exit the watcher loop, but allow subsequent health updates to change the deployment status. But it's not clear how this should interact with the progress deadline. If the progress deadline has passed with all canaries healthy, but one of the canaries fails, what should happen? Are there other implications to the deployment behavior I haven't thought of? Alternately (as a somewhat incomplete fix), we could have the promotion process fail the deployment if it hits an error past the promotion deadline. Minimal reproduction of the remaining issue: # start a job
nomad job run ./test.hcl
# wait for the deployment to succeed
nomad deployment status $id
# bump the env.version and redeploy
nomad job run ./test.hcl
# wait for all 3 allocs to be marked healthy, so that the
# deployment is pending manual approval
nomad deployment status $id
# note that we have 6 running allocs
nomad job status test
# kill one of the new containers
docker kill $(docker ps | awk '/nginx/{print $1}' | head -1)
# wait for progress deadline to expire, deployment will not be failed
# and promotion will be stuck
nomad job status test
nomad deployment status $id
nomad deployment promote $id jobspec: job "test" {
datacenters = ["dc1"]
group "webservers" {
count = 3
task "nginx" {
driver = "docker"
config {
image = "nginx:latest"
port_map = {
http = 80
}
}
env {
version = "0"
}
service {
name = "nginx"
port = "http"
check {
type = "http"
port = "http"
path = "/"
interval = "5s"
timeout = "3s"
check_restart {
limit = 1
grace = "5s"
ignore_warnings = false
}
}
}
resources {
memory = 64
network {
mbits = 10
port "http" {}
}
}
}
restart {
attempts = 0
delay = "10s"
}
update {
max_parallel = 3
health_check = "checks"
min_healthy_time = "5s"
healthy_deadline = "30s"
progress_deadline = "2m"
auto_revert = false
auto_promote = false
canary = 3
}
}
} |
@tgross, I think that as a system user, I would expect that during a deployment, if a canary goes healthy then fails, a new allocation needs to be added to make up for the failed allocation that happened during the deployment. As far as the The edge case I see in this is the possibility of an endless deployment, where an allocation always goes unhealthy and is replaced before the final allocation can be marked as healthy. In this case, the progress deadline will continually be reset as new allocations come in and become healthy for a period of time. Although, I don't think this is incorrect as that should be the responsibility of the
Also, taking a step back, I want to make sure the focus is on the correct issue. Yes, I do see the problematic scenario presented by this allocation watcher, but why is it not an issue when there is a single server? That to me seems like more of a shared state issue than it is a logical issue with how the system handles failed allocations during a deploy. Remember, with a single server, it didn't matter how many times the allocations failed during the deploy, once the healthy number was hit, we were able to promote. |
Yeah, that's a good point.
I know I ended up spewing a whole bunch of debugging info here so it would have been easy to miss 😀 , but it turned out starting in #6407 (comment) that I was able to reproduce the behavior with a single node, even just in dev mode. |
Ahhhh, I didn't know that. Interesting, so the problem goes deeper. Thanks for clarifying that for me. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.9.5, 0.10.0
Operating system and Environment details
Amazon Linux 2
Issue
When doing canary blue/green deployments, the promote API incorrectly report failed counts rather than the final healthy count, preventing promotions from occurring.
The screenshots below do not show the original deployment with failed allocations since I created a new deployment to try to mitigate the issue. However, even in this job version, the healthy count does not match what the promote API is reporting.
This issue is happening 100% of the time allocations fail on canary deploy, even if they are rescheduled and end up healthy.
Also an important detail: IT ALWAYS WORKS when there are no failed allocations. Only when allocations fail does this happen.
EDIT: Added Nomad 0.10.0 as version affected
The text was updated successfully, but these errors were encountered: