-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-ecs: ApplicationLoadBalancedFargateService generates stacks hung in "UPDATE_IN_PROGRESS" and failed health checks #30728
Comments
After some more testing, this absolutely has something to do with deploying when there's a new ECR image waiting to be picked up by the task. With some tweaks to health check grace period, I can deploy all day long with no issue, but as soon as a new container image is waiting everything goes bonkers and I have to trash the stack and scratch deploy to recover. This is extremely frustrating, would love to have a workaround. |
(Just saw this and maybe I'll give a helping hand - since I had a very similar issue with 10GB images) You probably have a large container image that takes long to provision (download from ECR) and too short healthchecks. Check ECS logs and Service Events tab, that could shed some light as well. |
Thanks for this - in this case these images are relatively small, 200-300MB. I seem to recall seeing log output indicating that they start successfully but I'll pay attention the next time I try this. Like I said in the bug report, the events tab just shows an endlessly repeating cycle of start, unhealthy, stop, de-register. I ground away on this all day yesterday and part of my problem seems to be that the defaults are a little asinine. By default, the deployment circuit breaker is disabled, and the minHealthyPercent value appears to be 100. Which seems to me like a recipe for a deadlocked deployment any time you have desiredCount > 1. I turned on the circuit breaker, set a generous grace period, and minHealthyPercent to 50:
And the situation is a little better - the circuit breaker did detect a deadlocked deployment and cancelled it... after 4 hours. At least the stack isn't stuck in an endless update, I guess? My last gasp here is experimenting with just deploying a dummy "hello world" image to get the infrastructure set, and pushing actual image updates in response to git pushes via a CLI script. Which is, frankly, precisely the kind of situation I look to CDK to help me avoid. If that doesn't work then I'll give up and look for some canned terraform. Edit to add, FWIW, I have a working cluster that was hand-configured and the images I'm deploying here work fine there, so this doesn't feel like an image problem. |
This just seems to be broken and unusable for me. If I build, push and tag an image to ECR and then force a deployment via If, however, I use |
Hi Let me explain a little bit about this. CDK deploys ECS services via cloudformation(CFN in short). In CFN, ECS service deployment has to enter a stable state before CFN enters the
With AWS CLI, when you run Looks like your initial deployment is good and it only fails on your update on the existing deployment? I would like to know:
Hope it helps! |
Hi, and thanks for the reply. I understand there's some very complex interaction between CDK and CFN and ECS and that both of the latter are by themselves extremely complex systems. I have kind of moved on here since I was not able to get deployments to work reliably. I'm now just using cdk to do initial environment setup, and using ecs cli commands to do all subsequent task updates. Which is far from ideal, but works. I believe I have narrowed things down to:
If I then modify anything which would cause a new task definition to be created, (e.g. change one of the task definition environment values via In both situations, I have been able to see output from running task images indicating to me they have started in both ECS and Cloudwatch logs, and they seem to be running before what I understand through experimentation to be the controlling metric (the health check grace period) has elapsed.
That's a terrible user experience, and frankly if this were my first attempt at provisioning infrastructure via CDK (I am, in fact, very successfully using it to manage a large cloud-native platform), I would have put it down, walked away and never looked back. |
Describe the bug
Initial deployments using ApplicationLoadBalancedFargateService from ecs-patterns complete successfully and generate working, healthy, reachable services. All subsequent deployments fail with a particular series of events:
The situation will not resolve itself over a duration of 6 hours.
If a user cancels the cdk deployment script, then:
However, of course the changes in the stack update haven't been applied.
Have reproduced in the following conditions:
This is pretty severe and it's preventing us from using CDK to manage any ECS infrastructure at all.
Expected Behavior
The CF stack to update successfully on subsequent deployments - and for ECS service updates to successfully happen only when they are necessary. Based on my testing and experimentation, I'm seeing ECS updates being made when nothing about the service has been changed in my code, which is confusing at best.
Current Behavior
As above. Deployments subsequent to the first fail with a hung "UPDATE_IN_PROGRESS" stack, apparently because ECS health checks are failing. Interesting that this occurs even if the changes do not impact any ecs services or tasks - just unrelated changes in the same stack - like an SSM parameter rename or value change.
Reproduction Steps
I'm using CDK through a wrapper package that supplies a bunch of boilerplate for consistent naming and whatnot. Happy to provide more info.
Sample reproduction code (typescript):
Sample CF template:
Possible Solution
No response
Additional Information/Context
Open to alternative suggestions or workarounds. Landed on ecs-patterns because it was the quickest way to get a service up and running from scratch, not married to it.
CDK CLI Version
2.139.1 (and also 2.147.2)
Framework Version
No response
Node.js Version
18 and 21
OS
Linux Ubuntu (real and github workflow runner image)
Language
TypeScript
Language Version
5.0.4 and 5.5.3
Other information
No response
The text was updated successfully, but these errors were encountered: