-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS Agent Disconnected becoming more common #985
Comments
I also had this problem. |
I also had this problem. VMs on our clusters keep disconnecting and do not automatically reconnect. |
@djenriquez @ricardson @aayushchugh07 we're sorry to learn that you're running into such an issue. @djenriquez as far as container instance
I cannot figure out what's causing the following error messages to be printed, especially as I could not find corresponding disconnection logs on our servers, which leads me to suspect that either the ECS Agent, or the websocket library or something else on the instance is resulting in the disconnection.
Can you please let us know how these instances are set up? Are these instances running behind a proxy by any chance? Is there some custom configuration that you're using to bring up these instances? We're trying to replicate this issue in our test setup and any information that you can share would be helpful for us in debugging this issue. It'd also help us a lot if you could use the ECS Logs Collector collect logs from such instances (at debug level if possible, with Thanks, |
Thanks for the reply @aaithal, these instances are running in a VPC behind a private subnet using the AWS NAT service for outbound traffic. Nothing fancy around them, no proxies or anything. There is no custom configuration, typical ECS config. Let me run the log collector for ya and i'll get them over soon. |
@djenriquez thanks for that information. It'd be super awesome if you could provide logs at |
@aaithal logs sent. I ran the script with |
@djenriquez apologies for the confusion. I meant debug level logs from the Agent, which should be collected if the Agent was started with |
Sure, that host is now collecting logs with |
Hi @djenriquez, the main thing that stands out in the log files that you shared is that there's no messages in the log file that correspond to The next ACS connection attempt happens 4 hours later, which is unreasonably long. An explanation for why the disconnection did not go through is that the Disconnect() method is unable to acquire the lock as it is held by other methods in the client. My primary suspicion was the Connect() method for the entity that would be holding the lock. However, when I added rules on the host to explicitly drop packets destined for ACS, the lock was released after 30s as the connection failed with an i/o timeout error. The only other place where this lock is held (that involves network io) is the WriteMessage() method. It's plausible that if the connection goes bad, this could hang. Looking at the websocket library, setting SetWriteDeadline() should offset this (Improvement [1]). But, we do not do that in the Agent today. We also hold the lock for the entire length of the Connect() method and we could reduce the scope of that to the access of the cs.conn object (Improvement [2]). Another improvement here would be to set the HandshakeTimeout in the websocket client to some non zero value as it's currently not set to anything (Improvement [3]). All of these proposed code changes should help in better handling of the websocket disconnection/reconnection logic. However, I've been unable to reproduce this error in my test cluster. So, it's hard to validate if these changes will actually solve the "lock being held for unreasonably long periods of time" issue. So far I've tried these:
Next, I'm going to try:
Can you share/think of anything else that might help us reproduce and validate the code changes that we'll be making will resolve this issue? Thanks, |
We configure all of our instances with the same immutable build for our development and production environments (except for some differences in metrics collections). I can tell you that this does not happen in our sandbox environments. So I wonder if the additional load contributes to the problem. There are some services we have that handle a lot of connections, though I don't think they're anything that break any worlds records or anything. We do run these
These, however, are also configured in our development env, which again, does not run into any problems. |
Just wanted to chime in that I have a cluster behaving like this right now. |
@antipax As per my earlier response, we'd really appreciate if you could share logs from the instance to help us debug this further. @djenriquez I have an update on my test setup and issue reproduction attempts here. I have a cluster of 100 instances constantly churning containers at the rate of roughly 4 tasks (8 containers) every 2 minutes in a private subnet. The instances are configured to drop 25% of outgoing packets to simulate networking issues. I have also lowered out inactivity time threshold to simulate Agent initiated disconnects. This has been running for 2 days now and I still haven't been able to reproduce the issue. Occasionally, there's a delay in establishing the websocket connection, but they do recover in a matter of minutes and reestablish the connection. That said, I'm going to go ahead and make the changes as per #985 (comment) anyway because I think all of those would help in better connection management on the ECS Agent, which should also help in future with such issues. Thanks, |
Thanks @aaithal I will try to grab some logs. If it helps, it seems to occur more frequently when a task is repeatedly getting restarted by the agent because it is failing. |
@aaithal I sent you more logs from a box that ran into this issue with debug mode enabled for both the ecs-agent and the docker daemon at the time of the event. Let me know if anything in there needs clarification. |
@djenriquez how/are you working around this currently? I'm seeing it a lot lately. |
@antipax Ansible ad-hoc command to look for the disconnected agents then run |
@djenriquez Ah, I have a lambda set up now that is checking every 15 minutes for instances with status disconnected and terminating them (as they are replaced by an autoscaling group). Will share it if it works well, just added it to our testing environments |
That should work if you have the proper immutability setup. The problem though is, as common as these things are, there is a likely chance multiple instances are terminated at the same time, and if a service doesn't have enough containers, it could kill the instances that a specific service is all running on == outage. |
@djenriquez in my particular set up, that is ok since the service in question is running on all container instances :) |
Fault-tolerance FTW. 😉 |
@djenriquez Thanks for sending those second set of logs. However, Agent was still logging at Thanks, |
Glad to know theres progress! Thanks for your efforts @aaithal and team! |
This commit aims to make the websocker connection management better by implementing the following improvements: 1. Set read and write deadlines for websocket ReadMessage and WriteMessage operations. This is to ensure that these methods do not hang and result in io timeout if there's issues with the connection 2. Reduce the scope of the lock in the Connect() method. The lock was being held for the length of Connect() method, which meant that it wouldn't be relnquished if there was any delay in establishing the connection. The scope of the lock has now been reduced to just accessing the cs.conn variable 3. Start ACS heartbeat timer after the connection has been established. The timer was being started before a call to Connect, which meant that the connection could be prematurely terminated for being idle if there was a delay in establishing the connection These changes should improve the disconnection behavior of the websocket connection, which should help with scenarios where the Agent never reconnects to ACS because it's forever waiting in Disconnect() method waiting to acquire the lock (aws#985)
This commit aims to make the websocker connection management better by implementing the following improvements: 1. Set read and write deadlines for websocket ReadMessage and WriteMessage operations. This is to ensure that these methods do not hang and result in io timeout if there's issues with the connection 2. Reduce the scope of the lock in the Connect() method. The lock was being held for the length of Connect() method, which meant that it wouldn't be relnquished if there was any delay in establishing the connection. The scope of the lock has now been reduced to just accessing the cs.conn variable 3. Start ACS heartbeat timer after the connection has been established. The timer was being started before a call to Connect, which meant that the connection could be prematurely terminated for being idle if there was a delay in establishing the connection These changes should improve the disconnection behavior of the websocket connection, which should help with scenarios where the Agent never reconnects to ACS because it's forever waiting in Disconnect() method waiting to acquire the lock (aws#985)
This commit aims to make the websocker connection management better by implementing the following improvements: 1. Set read and write deadlines for websocket ReadMessage and WriteMessage operations. This is to ensure that these methods do not hang and result in io timeout if there's issues with the connection 2. Reduce the scope of the lock in the Connect() method. The lock was being held for the length of Connect() method, which meant that it wouldn't be relnquished if there was any delay in establishing the connection. The scope of the lock has now been reduced to just accessing the cs.conn variable 3. Start ACS heartbeat timer after the connection has been established. The timer was being started before a call to Connect, which meant that the connection could be prematurely terminated for being idle if there was a delay in establishing the connection These changes should improve the disconnection behavior of the websocket connection, which should help with scenarios where the Agent never reconnects to ACS because it's forever waiting in Disconnect() method waiting to acquire the lock (aws#985)
This commit aims to make the websocker connection management better by implementing the following improvements: 1. Set read and write deadlines for websocket ReadMessage and WriteMessage operations. This is to ensure that these methods do not hang and result in io timeout if there's issues with the connection 2. Reduce the scope of the lock in the Connect() method. The lock was being held for the length of Connect() method, which meant that it wouldn't be relnquished if there was any delay in establishing the connection. The scope of the lock has now been reduced to just accessing the cs.conn variable 3. Start ACS heartbeat timer after the connection has been established. The timer was being started before a call to Connect, which meant that the connection could be prematurely terminated for being idle if there was a delay in establishing the connection These changes should improve the disconnection behavior of the websocket connection, which should help with scenarios where the Agent never reconnects to ACS because it's forever waiting in Disconnect() method waiting to acquire the lock (#985)
@djenriquez We've merged #993 which should make things better on the websocket connection management front. This will be included in the next Agent release. Thanks, |
@aaithal is there an ETA for the next release, or a schedule of some sort? Very much hoping this fix alleviates the issues we've been seeing. Thanks for the quick responses! |
@aaithal is it normal for the ECS agent to close a connection on instantiation? On new boxes, literally the first thing the agent does it start up then report it hasnt had any connections in a while, close:
Thats odd, unless activity means 1 minute, the timeout is being triggered on start? This makes using new ECS instances a pain because we have to wait for the agent to disconnect and reconnect before actually starting up new tasks.
Timestamp shows ~ 1.5 minutes. On some agents, it continues to report the Heres a good example:
This guy took ~5.5 minutes from that time the ECS agent started to when it began to pull the docker image. Is this something the fix for this PR could fix? If not, should I open a new issue? |
Hmm this doesnt look to have made the |
Actually, it looks like #993 was included, thanks, going to try it out now. |
@djenriquez Confirming that this did make the v1.14.5 release (See commits here). Also, I missed your previous post, sorry about that. The behavior that you mentioned in #985 (comment) should also be improved with #993. Please let us know if you continue running into this issue. Thanks, |
FYI ... I have seen this issue on 1.16.0 as of 12/20/2017. Don't have a root cause, but we saw it today. Work around was to terminate and then have ECS replace the nodes -- everything worked after that. |
Having similar issues atm. |
@russomi @fwirtz can you please send us logs from the ECS agent (hopefully at Thanks, |
@aaithal thanks for the fast response. I killed the respective instance and recreated it. Also, from taking another look at the agents logs I deduced that there were permissions missing ( |
Having this issue happening very regularly and taking down all 3 of our production instances our the course of a few hours. Even though we have multiple containers running it doesn't make a difference. I'm on the latest AMI and Agent Version 1.17.2 but I can't trust the stack to stay up for more than a day. I'm currently looking into other options as it appears ECS is not stable enough to be used for production as there is no way to setup a failover in case of ECS Agent disconnect. Is there any thing that can be done to tie in to auto scaling when an Agent Disconnects for X number of minutes? I know I need to send logs. What do you need me to do to turn on the instances to get the right log level? |
Hi @marcato15, we're sorry to hear that you're running into this issue. Can you please send us logs at the Thanks, |
@RobertCPhillips No, did not, works for me as of now. |
Any update on this guys: |
I've just opened a support ticket to Amazon for this problem that still occurs in 2020... I've been using 1.15.0 since the beginning and never had any problems until now (2+ years up and running). For the fourth day in a row now, the agent disconnects and my docker containers disappear. It even happened twice today. I need to manually restart and re-run ecs-agent for my application to work again. My next step will be to create a new instance and delete the "faulty" one. Any help or insights appreciated, even if my workaround solves the problem, because this kind of crash makes me lose some trust in the AWS services. |
This also happened out of the blue for me (21st Jan), and I'd never seen this before now... There should be an ECS health-check failure for agent disconnected (since it never recovers). This above had been running since December without any alterations. We had to cycle all the underlying instances to get things up and running. |
Summary
We're seeing more and more ecs-agents being disconnected recently, running on both 1.14.4 and 1.14.3, that do not recover on their own. We've been needing to connect to the boxes and run
stop ecs && start ecs
to which some will sustain, while others remain disconnected.Description
Expected Behavior
Agents stay connected, or reconnect on their own if issues arise.
Observed Behavior
Agents are disconnected and staying disconnected.
Environment Details
Supporting Log Snippets
Some instances say something along these lines:
While some say:
Any idea whats going on?
The text was updated successfully, but these errors were encountered: