ECS Agent Disconnected becoming more common #985

djenriquez · 2017-09-15T21:32:00Z

Summary

We're seeing more and more ecs-agents being disconnected recently, running on both 1.14.4 and 1.14.3, that do not recover on their own. We've been needing to connect to the boxes and run stop ecs && start ecs to which some will sustain, while others remain disconnected.

Description

Expected Behavior

Agents stay connected, or reconnect on their own if issues arise.

Observed Behavior

Agents are disconnected and staying disconnected.

Environment Details

$docker info
Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 15
Server Version: 17.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version:  (expected: 4ab9917febca54791c5f071a9d1f404867857fcc)
runc version: N/A (expected: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe)
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.38-16.35.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.03
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.677 GiB
Name: ip-10-7-18-149
ID: MBVV:K4P6:F7GF:LTJ3:LNAF:GZ63:RBTB:R2KE:EMMY:QARA:HF2M:7KOR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

$curl http://localhost:51678/v1/metadata
{"Cluster":"<REDACTED>","ContainerInstanceArn":"arn:aws:ecs:<REDACTED>:container-instance/d45c9085-a359-474c-ae5e-b04b2efb966a","Version":"Amazon ECS Agent - v1.14.3 (15de319)"}

Supporting Log Snippets

Some instances say something along these lines:

2017-09-15T21:08:58Z [INFO] Loading configuration
2017-09-15T21:08:58Z [INFO] Loading state! module="statemanager"
2017-09-15T21:08:58Z [INFO] Event stream ContainerChange start listening...
2017-09-15T21:08:58Z [INFO] Registering Instance with ECS
2017-09-15T21:09:09Z [INFO] Registered! module="api client"
2017-09-15T21:09:09Z [INFO] Registration completed successfully. I am running as <REDACTED>
2017-09-15T21:09:09Z [INFO] Saving state! module="statemanager"
2017-09-15T21:09:09Z [INFO] Beginning Polling for updates
2017-09-15T21:09:09Z [INFO] Event stream DeregisterContainerInstance start listening...
2017-09-15T21:09:09Z [INFO] Initializing stats engine
2017-09-15T21:09:09Z [INFO] NO_PROXY set:169.254.169.254,169.254.170.2,/var/run/docker.sock
2017-09-15T21:09:19Z [INFO] Saving state! module="statemanager"
2017-09-15T21:10:36Z [WARN] ACS Connection hasn't had any activity for too long; closing connection

While some say:

2017-09-15T21:03:39Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:05:06Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:07:30Z [INFO] Redundant container state change for task <REDACTED>:42 arn:aws:ecs:<REDACTED>:task/d70fa1a0-5b03-47cf-b590-694f053b02cb, Status: (RUNNING->RUNNING) Containers: [<REDACTED> (RUNNING->RUNNING),]: <REDACTED>(<REDACTED>:ec5903f) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2017-09-15T21:10:20Z [INFO] Begin building map of eligible unused images for deletion
2017-09-15T21:10:20Z [INFO] No eligible images for deletion for this cleanup cycle
2017-09-15T21:10:20Z [INFO] End of eligible images for deletion: No more eligible images for deletion; Still have 1 image states being managed
2017-09-15T21:15:07Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:17:30Z [INFO] Redundant container state change for task <REDACTED>:42 arn:aws:ecs:<REDACTED>:task/d70fa1a0-5b03-47cf-b590-694f053b02cb, Status: (RUNNING->RUNNING) Containers: [<REDACTED> (RUNNING->RUNNING),]: <REDACTED>(<REDACTED>:ec5903f) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2017-09-15T21:25:07Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:26:17Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:27:30Z [INFO] Redundant container state change for task <REDACTED>:42 arn:aws:ecs:<REDACTED>:task/d70fa1a0-5b03-47cf-b590-694f053b02cb, Status: (RUNNING->RUNNING) Containers: [<REDACTED> (RUNNING->RUNNING),]: <REDACTED>(<REDACTED>:ec5903f) (RUNNING->RUNNING) to RUNNING, but already RUNNING

Any idea whats going on?

The text was updated successfully, but these errors were encountered:

r1w1s1 · 2017-09-16T22:09:19Z

I also had this problem.

aayushchugh07 · 2017-09-18T11:16:49Z

I also had this problem. VMs on our clusters keep disconnecting and do not automatically reconnect.

aaithal · 2017-09-18T22:27:14Z

@djenriquez @ricardson @aayushchugh07 we're sorry to learn that you're running into such an issue.

@djenriquez as far as container instance d45c9085-a359-474c-ae5e-b04b2efb966a is concerned, our server logs indicate that we closed the connection to ECS Agent at the following times as a part of the regular disconnect-reconnect workflow during that hour (between 2017-09-15T21:00:00Z and 2017-09-16T00:00:00Z):

21:03:39,590: Closed connection
21:03:40,199: New connection
21:26:17,765: Closed connection
21:26:18,127: New connection
21:54:08,299: Closed connection
21:54:08,875: New connection

I cannot figure out what's causing the following error messages to be printed, especially as I could not find corresponding disconnection logs on our servers, which leads me to suspect that either the ECS Agent, or the websocket library or something else on the instance is resulting in the disconnection.

2017-09-15T21:05:06Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:15:07Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue

Can you please let us know how these instances are set up? Are these instances running behind a proxy by any chance? Is there some custom configuration that you're using to bring up these instances? We're trying to replicate this issue in our test setup and any information that you can share would be helpful for us in debugging this issue.

It'd also help us a lot if you could use the ECS Logs Collector collect logs from such instances (at debug level if possible, with ECS_LOGLEVEL=debug) and send them to aithal at amazon dot com. /cc @ricardson @aayushchugh07

Thanks,
Anirudh

djenriquez · 2017-09-19T15:58:37Z

Thanks for the reply @aaithal, these instances are running in a VPC behind a private subnet using the AWS NAT service for outbound traffic. Nothing fancy around them, no proxies or anything. There is no custom configuration, typical ECS config.

Let me run the log collector for ya and i'll get them over soon.

aaithal · 2017-09-19T15:59:55Z

@djenriquez thanks for that information. It'd be super awesome if you could provide logs at debug level.

djenriquez · 2017-09-19T16:07:34Z

@aaithal logs sent. I ran the script with --debug=mode, but surely that doesn't collect debug level logs right? It only enables debug mode for future runs of the collector?

aaithal · 2017-09-19T17:21:42Z

@djenriquez apologies for the confusion. I meant debug level logs from the Agent, which should be collected if the Agent was started with ECS_LOGLEVEL=debug config in ecs.config file. The reason for asking that was that the connection initiation messages are unfortunately printed at the debug level. Also, ack that we've received the logs. Thank you for sending those!

djenriquez · 2017-09-19T18:22:45Z

Sure, that host is now collecting logs with ECS_LOGLEVEL=debug, just need to wait for it to hit the issue again.

aaithal · 2017-09-20T18:14:48Z

Hi @djenriquez, the main thing that stands out in the log files that you shared is that there's no messages in the log file that correspond to "2017-09-18T19:48:42Z [WARN] ACS Connection hasn't had any activity for too long; closing connection", where the disconnection causes read error due to use of closed network connection. Typically, for Agent initiated disconnects, we expect to see this. I grepped the logs for the presence of the IP addresses that correspond to the Agent's Communication Service when Agent initiated this disconnection and couldn't find any (An example of this can be found earlier in the logs: 2017-09-18T19:33:55Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.11.243:36512->54.239.18.202:443: use of closed network connection], message: [], messageType: [-1]).

The next ACS connection attempt happens 4 hours later, which is unreasonably long.

An explanation for why the disconnection did not go through is that the Disconnect() method is unable to acquire the lock as it is held by other methods in the client.

My primary suspicion was the Connect() method for the entity that would be holding the lock. However, when I added rules on the host to explicitly drop packets destined for ACS, the lock was released after 30s as the connection failed with an i/o timeout error.

The only other place where this lock is held (that involves network io) is the WriteMessage() method. It's plausible that if the connection goes bad, this could hang. Looking at the websocket library, setting SetWriteDeadline() should offset this (Improvement [1]). But, we do not do that in the Agent today. We also hold the lock for the entire length of the Connect() method and we could reduce the scope of that to the access of the cs.conn object (Improvement [2]). Another improvement here would be to set the HandshakeTimeout in the websocket client to some non zero value as it's currently not set to anything (Improvement [3]).

All of these proposed code changes should help in better handling of the websocket disconnection/reconnection logic. However, I've been unable to reproduce this error in my test cluster. So, it's hard to validate if these changes will actually solve the "lock being held for unreasonably long periods of time" issue.

So far I've tried these:

Running a cluster with many instances and seeing if there's a disconnect (public subnet w/ internet gateway)
Explicitly dropping all packets to our websocket backend and seeing that inactivity timer and connection timeout kick in and reestablish the connection

Next, I'm going to try:

Running a cluster with many instances and seeing if there's a disconnect (private subnet w/ nat gateway)
Explicitly dropping a percentage of packets to our websocket backend

Can you share/think of anything else that might help us reproduce and validate the code changes that we'll be making will resolve this issue?

Thanks,
Anirudh

djenriquez · 2017-09-20T20:24:09Z

We configure all of our instances with the same immutable build for our development and production environments (except for some differences in metrics collections). I can tell you that this does not happen in our sandbox environments. So I wonder if the additional load contributes to the problem.

There are some services we have that handle a lot of connections, though I don't think they're anything that break any worlds records or anything.

We do run these sysctl commands to help optimize connection handling, not sure if any of these could be related maybe?

# Maximize network settings
sysctl -w net.core.somaxconn=16384
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.ipv4.tcp_max_tw_buckets=1440000
sysctl -w net.ipv4.ip_local_port_range='10240 65000'
sysctl -w fs.file-max=999999
sysctl -w net.ipv4.tcp_rmem='4096 4096 16777216'
sysctl -w net.ipv4.tcp_wmem='4096 4096 16777216'
sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait=1 && \
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=600 || echo "Netfilter is not yet ready"

These, however, are also configured in our development env, which again, does not run into any problems.

ericentin · 2017-09-22T17:04:52Z

Just wanted to chime in that I have a cluster behaving like this right now.

aaithal · 2017-09-22T18:49:03Z

@antipax As per my earlier response, we'd really appreciate if you could share logs from the instance to help us debug this further.

@djenriquez I have an update on my test setup and issue reproduction attempts here. I have a cluster of 100 instances constantly churning containers at the rate of roughly 4 tasks (8 containers) every 2 minutes in a private subnet. The instances are configured to drop 25% of outgoing packets to simulate networking issues. I have also lowered out inactivity time threshold to simulate Agent initiated disconnects. This has been running for 2 days now and I still haven't been able to reproduce the issue. Occasionally, there's a delay in establishing the websocket connection, but they do recover in a matter of minutes and reestablish the connection.

That said, I'm going to go ahead and make the changes as per #985 (comment) anyway because I think all of those would help in better connection management on the ECS Agent, which should also help in future with such issues.

Thanks,
Anirudh

ericentin · 2017-09-22T19:38:09Z

Thanks @aaithal I will try to grab some logs. If it helps, it seems to occur more frequently when a task is repeatedly getting restarted by the agent because it is failing.

djenriquez · 2017-09-22T21:13:34Z

@aaithal I sent you more logs from a box that ran into this issue with debug mode enabled for both the ecs-agent and the docker daemon at the time of the event.

Let me know if anything in there needs clarification.

ericentin · 2017-09-25T17:05:13Z

@djenriquez how/are you working around this currently? I'm seeing it a lot lately.

djenriquez · 2017-09-25T18:46:34Z

@antipax Ansible ad-hoc command to look for the disconnected agents then run stop ecs && start ecs. Not a fun approach, but it fixes the problem.

ericentin · 2017-09-25T18:49:48Z

@djenriquez Ah, I have a lambda set up now that is checking every 15 minutes for instances with status disconnected and terminating them (as they are replaced by an autoscaling group). Will share it if it works well, just added it to our testing environments

djenriquez · 2017-09-25T18:51:52Z

That should work if you have the proper immutability setup. The problem though is, as common as these things are, there is a likely chance multiple instances are terminated at the same time, and if a service doesn't have enough containers, it could kill the instances that a specific service is all running on == outage.

ericentin · 2017-09-25T18:52:19Z

@djenriquez in my particular set up, that is ok since the service in question is running on all container instances :)

djenriquez · 2017-09-25T18:52:45Z

Fault-tolerance FTW. 😉

aaithal · 2017-09-25T21:56:10Z

@djenriquez Thanks for sending those second set of logs. However, Agent was still logging at INFO when the issue actually happened. I did take a look at server side connection logs for when the issue happened (2017-09-22T19:42:29Z [WARN] ACS Connection hasn't had any activity for too long; closing connection). As far the Agent's Communication Service was concerned, Agent did not actually close the connection at this time, which confirms my earlier suspicion of the Disconnect method waiting on the lock to actually perform the disconnection. We've identified some problematic configurations in code that might be leading to this and are making changes to rectify the same. I'll post an update here/ link this to the PR when I have it ready for review.

Thanks,
Anirudh

djenriquez · 2017-09-25T22:20:49Z

Glad to know theres progress! Thanks for your efforts @aaithal and team!

This commit aims to make the websocker connection management better by implementing the following improvements: 1. Set read and write deadlines for websocket ReadMessage and WriteMessage operations. This is to ensure that these methods do not hang and result in io timeout if there's issues with the connection 2. Reduce the scope of the lock in the Connect() method. The lock was being held for the length of Connect() method, which meant that it wouldn't be relnquished if there was any delay in establishing the connection. The scope of the lock has now been reduced to just accessing the cs.conn variable 3. Start ACS heartbeat timer after the connection has been established. The timer was being started before a call to Connect, which meant that the connection could be prematurely terminated for being idle if there was a delay in establishing the connection These changes should improve the disconnection behavior of the websocket connection, which should help with scenarios where the Agent never reconnects to ACS because it's forever waiting in Disconnect() method waiting to acquire the lock (aws#985)

This commit aims to make the websocker connection management better by implementing the following improvements: 1. Set read and write deadlines for websocket ReadMessage and WriteMessage operations. This is to ensure that these methods do not hang and result in io timeout if there's issues with the connection 2. Reduce the scope of the lock in the Connect() method. The lock was being held for the length of Connect() method, which meant that it wouldn't be relnquished if there was any delay in establishing the connection. The scope of the lock has now been reduced to just accessing the cs.conn variable 3. Start ACS heartbeat timer after the connection has been established. The timer was being started before a call to Connect, which meant that the connection could be prematurely terminated for being idle if there was a delay in establishing the connection These changes should improve the disconnection behavior of the websocket connection, which should help with scenarios where the Agent never reconnects to ACS because it's forever waiting in Disconnect() method waiting to acquire the lock (#985)

aaithal · 2017-09-28T20:58:45Z

@djenriquez We've merged #993 which should make things better on the websocket connection management front. This will be included in the next Agent release.

Thanks,
Anirudh

ericentin · 2017-09-29T15:22:39Z

@aaithal is there an ETA for the next release, or a schedule of some sort? Very much hoping this fix alleviates the issues we've been seeing. Thanks for the quick responses!

djenriquez · 2017-09-29T22:05:37Z

@aaithal is it normal for the ECS agent to close a connection on instantiation?

On new boxes, literally the first thing the agent does it start up then report it hasnt had any connections in a while, close:

2017-09-29T21:58:26Z [INFO] Loading configuration
2017-09-29T21:58:26Z [INFO] Loading state! module="statemanager"
2017-09-29T21:58:26Z [INFO] Event stream ContainerChange start listening...
2017-09-29T21:58:26Z [INFO] Registering Instance with ECS
2017-09-29T21:58:26Z [INFO] Registered! module="api client"
2017-09-29T21:58:26Z [INFO] Registration completed successfully. I am running as 'arn:aws:ecs:us-east-1:568557440549:container-instance/50c81873-de65-4de0-91f4-f46810ad86c7' in cluster 'p-n0-core-frontdoor'
2017-09-29T21:58:26Z [INFO] Saving state! module="statemanager"
2017-09-29T21:58:26Z [INFO] Beginning Polling for updates
2017-09-29T21:58:26Z [INFO] Event stream DeregisterContainerInstance start listening...
2017-09-29T21:58:26Z [INFO] Initializing stats engine
2017-09-29T21:58:27Z [INFO] NO_PROXY set:169.254.169.254,169.254.170.2,/var/run/docker.sock
2017-09-29T21:58:36Z [INFO] Saving state! module="statemanager"
2017-09-29T21:59:27Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T21:59:27Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.11.79:20820->54.239.20.16:443: use of closed network connection], message: [], messageType: [-1]

Thats odd, unless activity means 1 minute, the timeout is being triggered on start? This makes using new ECS instances a pain because we have to wait for the agent to disconnect and reconnect before actually starting up new tasks.

2017-09-29T22:00:51Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:00:51Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.11.79:18364->54.239.19.128:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:00:52Z [ERROR] Task not found in the engine for the arn in credentials message, arn: arn:aws:ecs:us-east-1:<REDACTED>:task/d480604b-ad3b-452b-a9a5-a308c4f63c42, messageId: d294a750-0b25-4e83-ad08-fe8589de0307
2017-09-29T22:00:54Z [INFO] Saving state! module="statemanager"

Timestamp shows ~ 1.5 minutes. On some agents, it continues to report the ACS Connection hasn't had any activity for too long; closing connection --> Error messages over and over again and tasks are forever stuck in pending.

Heres a good example:

2017-09-29T22:10:15Z [INFO] Loading configuration
2017-09-29T22:10:15Z [INFO] Loading state! module="statemanager"
2017-09-29T22:10:15Z [INFO] Event stream ContainerChange start listening...
2017-09-29T22:10:15Z [INFO] Registering Instance with ECS
2017-09-29T22:10:15Z [INFO] Registered! module="api client"
2017-09-29T22:10:15Z [INFO] Registration completed successfully. I am running as 'arn:aws:ecs:us-east-1:<REDACTED>:container-instance/20a8bbe7-66fa-4b64-9a52-16401c218894' in cluster '<REDACTED>'
2017-09-29T22:10:15Z [INFO] Saving state! module="statemanager"
2017-09-29T22:10:15Z [INFO] Beginning Polling for updates
2017-09-29T22:10:15Z [INFO] Event stream DeregisterContainerInstance start listening...
2017-09-29T22:10:15Z [INFO] Initializing stats engine
2017-09-29T22:10:15Z [INFO] NO_PROXY set:169.254.169.254,169.254.170.2,/var/run/docker.sock
2017-09-29T22:10:25Z [INFO] Saving state! module="statemanager"
2017-09-29T22:11:30Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:11:30Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.19.157:57858->54.239.18.202:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:13:18Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:13:18Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.19.157:13990->54.239.20.16:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:14:42Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:14:42Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.19.157:58222->54.239.18.202:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:15:51Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:15:51Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.19.157:14244->54.239.20.16:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:15:56Z [INFO] Saving state! module="statemanager"
2017-09-29T22:15:56Z [INFO] Waiting for any previous stops to complete module="TaskEngine" task="<REDACTED>:47 arn:aws:ecs:us-east-1:<REDACTED>:task/b6020b9d-c0e6-4e05-99e3-f0111ba515bb, Status: (NONE->RUNNING) Containers: [<REDACTED> (NONE->RUNNING),]" seqnum="2"
2017-09-29T22:15:56Z [INFO] Wait over; ready to move towards status: RUNNING module="TaskEngine" task="<REDACTED>:47 arn:aws:ecs:us-east-1:<REDACTED>:task/b6020b9d-c0e6-4e05-99e3-f0111ba515bb, Status: (NONE->RUNNING) Containers: [<REDACTED> (NONE->RUNNING),]"
2017-09-29T22:15:56Z [INFO] Pulling container <REDACTED>(r<REDACTED>/<REDACTED>:62a7d9c) (NONE->RUNNING) concurrently. Task: <REDACTED>:47 arn:aws:ecs:us-east-1:<REDACTED>:task/b6020b9d-c0e6-4e05-99e3-f0111ba515bb, Status: (NONE->RUNNING) Containers: [<REDACTED> (NONE->RUNNING),]

This guy took ~5.5 minutes from that time the ECS agent started to when it began to pull the docker image.

Is this something the fix for this PR could fix? If not, should I open a new issue?

djenriquez · 2017-10-03T23:43:42Z

Hmm this doesnt look to have made the 1.14.5 release, can you confirm @aaithal ?

djenriquez · 2017-10-03T23:50:06Z

Actually, it looks like #993 was included, thanks, going to try it out now.

aaithal · 2017-10-03T23:51:06Z

@djenriquez Confirming that this did make the v1.14.5 release (See commits here).

Also, I missed your previous post, sorry about that. The behavior that you mentioned in #985 (comment) should also be improved with #993.

Please let us know if you continue running into this issue.

Thanks,
Anirudh

djenriquez · 2017-10-04T14:45:46Z

@aaithal @antipax FYI, our clusters have been running 1.14.5 for > 12 hours and for the first time in probably a month, i woke up this morning with no agents disconnected.

Great work! So far so good!

vsiddharth · 2017-10-11T19:03:39Z

#993 was released with v1.14.5

Please feel free to reopen if this issue recurs.

russomi · 2017-12-20T21:34:36Z

FYI ... I have seen this issue on 1.16.0 as of 12/20/2017. Don't have a root cause, but we saw it today. Work around was to terminate and then have ECS replace the nodes -- everything worked after that.

flowirtz · 2018-01-29T16:47:52Z

Having similar issues atm.
@russomi do you have an update on what's going on?

aaithal · 2018-01-29T16:51:13Z

@russomi @fwirtz can you please send us logs from the ECS agent (hopefully at debug level) if you're running into this issue? You can send them to aithal at amazon dot com. Please find instructions on how to do this here. I'd also recommend opening a new issue so that we don't lose track of this.

Thanks,
Anirudh

flowirtz · 2018-01-30T13:59:55Z

@aaithal thanks for the fast response. I killed the respective instance and recreated it. Also, from taking another look at the agents logs I deduced that there were permissions missing (ecs:SubmitTaskStateChange and ecs:DiscoverPollEndpoint)
I added an IAM policy and so far everything is working as expected, hasn't been 24 hours yet though.
Will keep you posted if issue occurs again.
Thanks!

marcato15 · 2018-04-07T13:19:10Z

Having this issue happening very regularly and taking down all 3 of our production instances our the course of a few hours. Even though we have multiple containers running it doesn't make a difference. I'm on the latest AMI and Agent Version 1.17.2 but I can't trust the stack to stay up for more than a day. I'm currently looking into other options as it appears ECS is not stable enough to be used for production as there is no way to setup a failover in case of ECS Agent disconnect. Is there any thing that can be done to tie in to auto scaling when an Agent Disconnects for X number of minutes?

I know I need to send logs. What do you need me to do to turn on the instances to get the right log level?

aaithal · 2018-04-09T14:53:57Z

Hi @marcato15, we're sorry to hear that you're running into this issue. Can you please send us logs at the debug level (instructions here and here) to help debug this issue? You can send them to aithal at amazon dot com. Also, can you please create a new issue so that it doesn't get conflated with any existing ones?

Thanks,
Anirudh

RobertCPhillips · 2018-04-24T11:47:13Z

@russomi @fwirtz Did you open a new issue as @aaithal suggested?

flowirtz · 2018-04-24T15:19:30Z

@RobertCPhillips No, did not, works for me as of now.
Destroying and restarting the respective instance "fixed" it.

imyashvinder · 2018-10-26T20:58:05Z

Any update on this guys:
If i create ec2 instance using ecs optimized ami and there is no cluster with the name mentioned in ecs.config, then ecs agent docker container tend to get destroyed after a while.

cfalletta · 2020-11-09T14:34:48Z

I've just opened a support ticket to Amazon for this problem that still occurs in 2020...

I've been using 1.15.0 since the beginning and never had any problems until now (2+ years up and running).
I'm running the latest version now (1.47.0 I believe), but still have the same problem.

For the fourth day in a row now, the agent disconnects and my docker containers disappear. It even happened twice today.

I need to manually restart and re-run ecs-agent for my application to work again.

My next step will be to create a new instance and delete the "faulty" one.

Any help or insights appreciated, even if my workaround solves the problem, because this kind of crash makes me lose some trust in the AWS services.

richardj-bsquare · 2021-02-09T12:54:01Z

This also happened out of the blue for me (21st Jan), and I'd never seen this before now... There should be an ECS health-check failure for agent disconnected (since it never recovers).

This above had been running since December without any alterations. We had to cycle all the underlying instances to get things up and running.

djenriquez mentioned this issue Sep 27, 2017

Refreshed Credentials set to expire in the past boto/boto3#1283

Closed

aaithal added kind/bug scope/ECS Agent labels Sep 28, 2017

aaithal added the pending release label Sep 28, 2017

vsiddharth closed this as completed Oct 11, 2017

ECS Agent Disconnected becoming more common #985

ECS Agent Disconnected becoming more common #985

Comments

djenriquez commented Sep 15, 2017

Summary

Description

Expected Behavior

Observed Behavior

Environment Details

Supporting Log Snippets

r1w1s1 commented Sep 16, 2017

aayushchugh07 commented Sep 18, 2017

aaithal commented Sep 18, 2017 • edited Loading

djenriquez commented Sep 19, 2017

aaithal commented Sep 19, 2017

djenriquez commented Sep 19, 2017 • edited Loading

aaithal commented Sep 19, 2017 • edited Loading

djenriquez commented Sep 19, 2017

aaithal commented Sep 20, 2017

djenriquez commented Sep 20, 2017

ericentin commented Sep 22, 2017

aaithal commented Sep 22, 2017

ericentin commented Sep 22, 2017

djenriquez commented Sep 22, 2017 • edited Loading

ericentin commented Sep 25, 2017

djenriquez commented Sep 25, 2017

ericentin commented Sep 25, 2017

djenriquez commented Sep 25, 2017

ericentin commented Sep 25, 2017 • edited Loading

djenriquez commented Sep 25, 2017

aaithal commented Sep 25, 2017

djenriquez commented Sep 25, 2017

aaithal commented Sep 28, 2017

ericentin commented Sep 29, 2017

djenriquez commented Sep 29, 2017 • edited Loading

djenriquez commented Oct 3, 2017

djenriquez commented Oct 3, 2017

aaithal commented Oct 3, 2017 • edited Loading

djenriquez commented Oct 4, 2017 • edited Loading

vsiddharth commented Oct 11, 2017

russomi commented Dec 20, 2017

flowirtz commented Jan 29, 2018

aaithal commented Jan 29, 2018

flowirtz commented Jan 30, 2018

marcato15 commented Apr 7, 2018

aaithal commented Apr 9, 2018

RobertCPhillips commented Apr 24, 2018

flowirtz commented Apr 24, 2018

imyashvinder commented Oct 26, 2018 • edited Loading

cfalletta commented Nov 9, 2020 • edited Loading

richardj-bsquare commented Feb 9, 2021

aaithal commented Sep 18, 2017 •

edited

Loading

djenriquez commented Sep 19, 2017 •

edited

Loading

aaithal commented Sep 19, 2017 •

edited

Loading

djenriquez commented Sep 22, 2017 •

edited

Loading

ericentin commented Sep 25, 2017 •

edited

Loading

djenriquez commented Sep 29, 2017 •

edited

Loading

aaithal commented Oct 3, 2017 •

edited

Loading

djenriquez commented Oct 4, 2017 •

edited

Loading

imyashvinder commented Oct 26, 2018 •

edited

Loading

cfalletta commented Nov 9, 2020 •

edited

Loading