Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS Agent Disconnected becoming more common #985

Closed
djenriquez opened this issue Sep 15, 2017 · 41 comments
Closed

ECS Agent Disconnected becoming more common #985

djenriquez opened this issue Sep 15, 2017 · 41 comments

Comments

@djenriquez
Copy link

Summary

We're seeing more and more ecs-agents being disconnected recently, running on both 1.14.4 and 1.14.3, that do not recover on their own. We've been needing to connect to the boxes and run stop ecs && start ecs to which some will sustain, while others remain disconnected.

Description

screen shot 2017-09-15 at 2 24 12 pm

Expected Behavior

Agents stay connected, or reconnect on their own if issues arise.

Observed Behavior

Agents are disconnected and staying disconnected.

Environment Details

$docker info
Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 15
Server Version: 17.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version:  (expected: 4ab9917febca54791c5f071a9d1f404867857fcc)
runc version: N/A (expected: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe)
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.38-16.35.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.03
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.677 GiB
Name: ip-10-7-18-149
ID: MBVV:K4P6:F7GF:LTJ3:LNAF:GZ63:RBTB:R2KE:EMMY:QARA:HF2M:7KOR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
$curl http://localhost:51678/v1/metadata
{"Cluster":"<REDACTED>","ContainerInstanceArn":"arn:aws:ecs:<REDACTED>:container-instance/d45c9085-a359-474c-ae5e-b04b2efb966a","Version":"Amazon ECS Agent - v1.14.3 (15de319)"}

Supporting Log Snippets

Some instances say something along these lines:

2017-09-15T21:08:58Z [INFO] Loading configuration
2017-09-15T21:08:58Z [INFO] Loading state! module="statemanager"
2017-09-15T21:08:58Z [INFO] Event stream ContainerChange start listening...
2017-09-15T21:08:58Z [INFO] Registering Instance with ECS
2017-09-15T21:09:09Z [INFO] Registered! module="api client"
2017-09-15T21:09:09Z [INFO] Registration completed successfully. I am running as <REDACTED>
2017-09-15T21:09:09Z [INFO] Saving state! module="statemanager"
2017-09-15T21:09:09Z [INFO] Beginning Polling for updates
2017-09-15T21:09:09Z [INFO] Event stream DeregisterContainerInstance start listening...
2017-09-15T21:09:09Z [INFO] Initializing stats engine
2017-09-15T21:09:09Z [INFO] NO_PROXY set:169.254.169.254,169.254.170.2,/var/run/docker.sock
2017-09-15T21:09:19Z [INFO] Saving state! module="statemanager"
2017-09-15T21:10:36Z [WARN] ACS Connection hasn't had any activity for too long; closing connection

While some say:

2017-09-15T21:03:39Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:05:06Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:07:30Z [INFO] Redundant container state change for task <REDACTED>:42 arn:aws:ecs:<REDACTED>:task/d70fa1a0-5b03-47cf-b590-694f053b02cb, Status: (RUNNING->RUNNING) Containers: [<REDACTED> (RUNNING->RUNNING),]: <REDACTED>(<REDACTED>:ec5903f) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2017-09-15T21:10:20Z [INFO] Begin building map of eligible unused images for deletion
2017-09-15T21:10:20Z [INFO] No eligible images for deletion for this cleanup cycle
2017-09-15T21:10:20Z [INFO] End of eligible images for deletion: No more eligible images for deletion; Still have 1 image states being managed
2017-09-15T21:15:07Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:17:30Z [INFO] Redundant container state change for task <REDACTED>:42 arn:aws:ecs:<REDACTED>:task/d70fa1a0-5b03-47cf-b590-694f053b02cb, Status: (RUNNING->RUNNING) Containers: [<REDACTED> (RUNNING->RUNNING),]: <REDACTED>(<REDACTED>:ec5903f) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2017-09-15T21:25:07Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:26:17Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:27:30Z [INFO] Redundant container state change for task <REDACTED>:42 arn:aws:ecs:<REDACTED>:task/d70fa1a0-5b03-47cf-b590-694f053b02cb, Status: (RUNNING->RUNNING) Containers: [<REDACTED> (RUNNING->RUNNING),]: <REDACTED>(<REDACTED>:ec5903f) (RUNNING->RUNNING) to RUNNING, but already RUNNING

Any idea whats going on?

@r1w1s1
Copy link

r1w1s1 commented Sep 16, 2017

I also had this problem.

@aayushchugh07
Copy link

I also had this problem. VMs on our clusters keep disconnecting and do not automatically reconnect.

@aaithal
Copy link
Contributor

aaithal commented Sep 18, 2017

@djenriquez @ricardson @aayushchugh07 we're sorry to learn that you're running into such an issue.

@djenriquez as far as container instance d45c9085-a359-474c-ae5e-b04b2efb966a is concerned, our server logs indicate that we closed the connection to ECS Agent at the following times as a part of the regular disconnect-reconnect workflow during that hour (between 2017-09-15T21:00:00Z and 2017-09-16T00:00:00Z):

  • 21:03:39,590: Closed connection

  • 21:03:40,199: New connection

  • 21:26:17,765: Closed connection

  • 21:26:18,127: New connection

  • 21:54:08,299: Closed connection

  • 21:54:08,875: New connection

I cannot figure out what's causing the following error messages to be printed, especially as I could not find corresponding disconnection logs on our servers, which leads me to suspect that either the ECS Agent, or the websocket library or something else on the instance is resulting in the disconnection.

2017-09-15T21:05:06Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue
2017-09-15T21:15:07Z [INFO] Connection closed for a valid reason: websocket: close 1000 (normal): ConnectionExpired: Reconnect to continue

Can you please let us know how these instances are set up? Are these instances running behind a proxy by any chance? Is there some custom configuration that you're using to bring up these instances? We're trying to replicate this issue in our test setup and any information that you can share would be helpful for us in debugging this issue.

It'd also help us a lot if you could use the ECS Logs Collector collect logs from such instances (at debug level if possible, with ECS_LOGLEVEL=debug) and send them to aithal at amazon dot com. /cc @ricardson @aayushchugh07

Thanks,
Anirudh

@djenriquez
Copy link
Author

Thanks for the reply @aaithal, these instances are running in a VPC behind a private subnet using the AWS NAT service for outbound traffic. Nothing fancy around them, no proxies or anything. There is no custom configuration, typical ECS config.

Let me run the log collector for ya and i'll get them over soon.

@aaithal
Copy link
Contributor

aaithal commented Sep 19, 2017

@djenriquez thanks for that information. It'd be super awesome if you could provide logs at debug level.

@djenriquez
Copy link
Author

djenriquez commented Sep 19, 2017

@aaithal logs sent. I ran the script with --debug=mode, but surely that doesn't collect debug level logs right? It only enables debug mode for future runs of the collector?

@aaithal
Copy link
Contributor

aaithal commented Sep 19, 2017

@djenriquez apologies for the confusion. I meant debug level logs from the Agent, which should be collected if the Agent was started with ECS_LOGLEVEL=debug config in ecs.config file. The reason for asking that was that the connection initiation messages are unfortunately printed at the debug level. Also, ack that we've received the logs. Thank you for sending those!

@djenriquez
Copy link
Author

Sure, that host is now collecting logs with ECS_LOGLEVEL=debug, just need to wait for it to hit the issue again.

@aaithal
Copy link
Contributor

aaithal commented Sep 20, 2017

Hi @djenriquez, the main thing that stands out in the log files that you shared is that there's no messages in the log file that correspond to "2017-09-18T19:48:42Z [WARN] ACS Connection hasn't had any activity for too long; closing connection", where the disconnection causes read error due to use of closed network connection. Typically, for Agent initiated disconnects, we expect to see this. I grepped the logs for the presence of the IP addresses that correspond to the Agent's Communication Service when Agent initiated this disconnection and couldn't find any (An example of this can be found earlier in the logs: 2017-09-18T19:33:55Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.11.243:36512->54.239.18.202:443: use of closed network connection], message: [], messageType: [-1]).

The next ACS connection attempt happens 4 hours later, which is unreasonably long.

An explanation for why the disconnection did not go through is that the Disconnect() method is unable to acquire the lock as it is held by other methods in the client.

My primary suspicion was the Connect() method for the entity that would be holding the lock. However, when I added rules on the host to explicitly drop packets destined for ACS, the lock was released after 30s as the connection failed with an i/o timeout error.

The only other place where this lock is held (that involves network io) is the WriteMessage() method. It's plausible that if the connection goes bad, this could hang. Looking at the websocket library, setting SetWriteDeadline() should offset this (Improvement [1]). But, we do not do that in the Agent today. We also hold the lock for the entire length of the Connect() method and we could reduce the scope of that to the access of the cs.conn object (Improvement [2]). Another improvement here would be to set the HandshakeTimeout in the websocket client to some non zero value as it's currently not set to anything (Improvement [3]).

All of these proposed code changes should help in better handling of the websocket disconnection/reconnection logic. However, I've been unable to reproduce this error in my test cluster. So, it's hard to validate if these changes will actually solve the "lock being held for unreasonably long periods of time" issue.

So far I've tried these:

  1. Running a cluster with many instances and seeing if there's a disconnect (public subnet w/ internet gateway)
  2. Explicitly dropping all packets to our websocket backend and seeing that inactivity timer and connection timeout kick in and reestablish the connection

Next, I'm going to try:

  1. Running a cluster with many instances and seeing if there's a disconnect (private subnet w/ nat gateway)
  2. Explicitly dropping a percentage of packets to our websocket backend

Can you share/think of anything else that might help us reproduce and validate the code changes that we'll be making will resolve this issue?

Thanks,
Anirudh

@djenriquez
Copy link
Author

We configure all of our instances with the same immutable build for our development and production environments (except for some differences in metrics collections). I can tell you that this does not happen in our sandbox environments. So I wonder if the additional load contributes to the problem.

There are some services we have that handle a lot of connections, though I don't think they're anything that break any worlds records or anything.

We do run these sysctl commands to help optimize connection handling, not sure if any of these could be related maybe?

# Maximize network settings
sysctl -w net.core.somaxconn=16384
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.ipv4.tcp_max_tw_buckets=1440000
sysctl -w net.ipv4.ip_local_port_range='10240 65000'
sysctl -w fs.file-max=999999
sysctl -w net.ipv4.tcp_rmem='4096 4096 16777216'
sysctl -w net.ipv4.tcp_wmem='4096 4096 16777216'
sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait=1 && \
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=600 || echo "Netfilter is not yet ready"

These, however, are also configured in our development env, which again, does not run into any problems.

@ericentin
Copy link

Just wanted to chime in that I have a cluster behaving like this right now.

@aaithal
Copy link
Contributor

aaithal commented Sep 22, 2017

@antipax As per my earlier response, we'd really appreciate if you could share logs from the instance to help us debug this further.

@djenriquez I have an update on my test setup and issue reproduction attempts here. I have a cluster of 100 instances constantly churning containers at the rate of roughly 4 tasks (8 containers) every 2 minutes in a private subnet. The instances are configured to drop 25% of outgoing packets to simulate networking issues. I have also lowered out inactivity time threshold to simulate Agent initiated disconnects. This has been running for 2 days now and I still haven't been able to reproduce the issue. Occasionally, there's a delay in establishing the websocket connection, but they do recover in a matter of minutes and reestablish the connection.

That said, I'm going to go ahead and make the changes as per #985 (comment) anyway because I think all of those would help in better connection management on the ECS Agent, which should also help in future with such issues.

Thanks,
Anirudh

@ericentin
Copy link

Thanks @aaithal I will try to grab some logs. If it helps, it seems to occur more frequently when a task is repeatedly getting restarted by the agent because it is failing.

@djenriquez
Copy link
Author

djenriquez commented Sep 22, 2017

@aaithal I sent you more logs from a box that ran into this issue with debug mode enabled for both the ecs-agent and the docker daemon at the time of the event.

Let me know if anything in there needs clarification.

@ericentin
Copy link

@djenriquez how/are you working around this currently? I'm seeing it a lot lately.

@djenriquez
Copy link
Author

@antipax Ansible ad-hoc command to look for the disconnected agents then run stop ecs && start ecs. Not a fun approach, but it fixes the problem.

@ericentin
Copy link

@djenriquez Ah, I have a lambda set up now that is checking every 15 minutes for instances with status disconnected and terminating them (as they are replaced by an autoscaling group). Will share it if it works well, just added it to our testing environments

@djenriquez
Copy link
Author

That should work if you have the proper immutability setup. The problem though is, as common as these things are, there is a likely chance multiple instances are terminated at the same time, and if a service doesn't have enough containers, it could kill the instances that a specific service is all running on == outage.

@ericentin
Copy link

ericentin commented Sep 25, 2017

@djenriquez in my particular set up, that is ok since the service in question is running on all container instances :)

@djenriquez
Copy link
Author

Fault-tolerance FTW. 😉

@aaithal
Copy link
Contributor

aaithal commented Sep 25, 2017

@djenriquez Thanks for sending those second set of logs. However, Agent was still logging at INFO when the issue actually happened. I did take a look at server side connection logs for when the issue happened (2017-09-22T19:42:29Z [WARN] ACS Connection hasn't had any activity for too long; closing connection). As far the Agent's Communication Service was concerned, Agent did not actually close the connection at this time, which confirms my earlier suspicion of the Disconnect method waiting on the lock to actually perform the disconnection. We've identified some problematic configurations in code that might be leading to this and are making changes to rectify the same. I'll post an update here/ link this to the PR when I have it ready for review.

Thanks,
Anirudh

@djenriquez
Copy link
Author

Glad to know theres progress! Thanks for your efforts @aaithal and team!

aaithal added a commit to aaithal/amazon-ecs-agent that referenced this issue Sep 27, 2017
This commit aims to make the websocker connection management
better by implementing the following improvements:

1. Set read and write deadlines for websocket ReadMessage and
WriteMessage operations. This is to ensure that these methods
do not hang and result in io timeout if there's issues with
the connection
2. Reduce the scope of the lock in the Connect() method. The
lock was being held for the length of Connect() method, which
meant that it wouldn't be relnquished if there was any delay
in establishing the connection. The scope of the lock has now
been reduced to just accessing the cs.conn variable
3. Start ACS heartbeat timer after the connection has been
established. The timer was being started before a call to
Connect, which meant that the connection could be prematurely
terminated for being idle if there was a delay in establishing
the connection

These changes should improve the disconnection behavior of the
websocket connection, which should help with scenarios where the
Agent never reconnects to ACS because it's forever waiting in
Disconnect() method waiting to acquire the lock (aws#985)
aaithal added a commit to aaithal/amazon-ecs-agent that referenced this issue Sep 27, 2017
This commit aims to make the websocker connection management
better by implementing the following improvements:

1. Set read and write deadlines for websocket ReadMessage and
WriteMessage operations. This is to ensure that these methods
do not hang and result in io timeout if there's issues with
the connection
2. Reduce the scope of the lock in the Connect() method. The
lock was being held for the length of Connect() method, which
meant that it wouldn't be relnquished if there was any delay
in establishing the connection. The scope of the lock has now
been reduced to just accessing the cs.conn variable
3. Start ACS heartbeat timer after the connection has been
established. The timer was being started before a call to
Connect, which meant that the connection could be prematurely
terminated for being idle if there was a delay in establishing
the connection

These changes should improve the disconnection behavior of the
websocket connection, which should help with scenarios where the
Agent never reconnects to ACS because it's forever waiting in
Disconnect() method waiting to acquire the lock (aws#985)
aaithal added a commit to aaithal/amazon-ecs-agent that referenced this issue Sep 27, 2017
This commit aims to make the websocker connection management
better by implementing the following improvements:

1. Set read and write deadlines for websocket ReadMessage and
WriteMessage operations. This is to ensure that these methods
do not hang and result in io timeout if there's issues with
the connection
2. Reduce the scope of the lock in the Connect() method. The
lock was being held for the length of Connect() method, which
meant that it wouldn't be relnquished if there was any delay
in establishing the connection. The scope of the lock has now
been reduced to just accessing the cs.conn variable
3. Start ACS heartbeat timer after the connection has been
established. The timer was being started before a call to
Connect, which meant that the connection could be prematurely
terminated for being idle if there was a delay in establishing
the connection

These changes should improve the disconnection behavior of the
websocket connection, which should help with scenarios where the
Agent never reconnects to ACS because it's forever waiting in
Disconnect() method waiting to acquire the lock (aws#985)
aaithal added a commit that referenced this issue Sep 28, 2017
This commit aims to make the websocker connection management
better by implementing the following improvements:

1. Set read and write deadlines for websocket ReadMessage and
WriteMessage operations. This is to ensure that these methods
do not hang and result in io timeout if there's issues with
the connection
2. Reduce the scope of the lock in the Connect() method. The
lock was being held for the length of Connect() method, which
meant that it wouldn't be relnquished if there was any delay
in establishing the connection. The scope of the lock has now
been reduced to just accessing the cs.conn variable
3. Start ACS heartbeat timer after the connection has been
established. The timer was being started before a call to
Connect, which meant that the connection could be prematurely
terminated for being idle if there was a delay in establishing
the connection

These changes should improve the disconnection behavior of the
websocket connection, which should help with scenarios where the
Agent never reconnects to ACS because it's forever waiting in
Disconnect() method waiting to acquire the lock (#985)
@aaithal
Copy link
Contributor

aaithal commented Sep 28, 2017

@djenriquez We've merged #993 which should make things better on the websocket connection management front. This will be included in the next Agent release.

Thanks,
Anirudh

@ericentin
Copy link

@aaithal is there an ETA for the next release, or a schedule of some sort? Very much hoping this fix alleviates the issues we've been seeing. Thanks for the quick responses!

@djenriquez
Copy link
Author

djenriquez commented Sep 29, 2017

@aaithal is it normal for the ECS agent to close a connection on instantiation?

On new boxes, literally the first thing the agent does it start up then report it hasnt had any connections in a while, close:

2017-09-29T21:58:26Z [INFO] Loading configuration
2017-09-29T21:58:26Z [INFO] Loading state! module="statemanager"
2017-09-29T21:58:26Z [INFO] Event stream ContainerChange start listening...
2017-09-29T21:58:26Z [INFO] Registering Instance with ECS
2017-09-29T21:58:26Z [INFO] Registered! module="api client"
2017-09-29T21:58:26Z [INFO] Registration completed successfully. I am running as 'arn:aws:ecs:us-east-1:568557440549:container-instance/50c81873-de65-4de0-91f4-f46810ad86c7' in cluster 'p-n0-core-frontdoor'
2017-09-29T21:58:26Z [INFO] Saving state! module="statemanager"
2017-09-29T21:58:26Z [INFO] Beginning Polling for updates
2017-09-29T21:58:26Z [INFO] Event stream DeregisterContainerInstance start listening...
2017-09-29T21:58:26Z [INFO] Initializing stats engine
2017-09-29T21:58:27Z [INFO] NO_PROXY set:169.254.169.254,169.254.170.2,/var/run/docker.sock
2017-09-29T21:58:36Z [INFO] Saving state! module="statemanager"
2017-09-29T21:59:27Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T21:59:27Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.11.79:20820->54.239.20.16:443: use of closed network connection], message: [], messageType: [-1] 

Thats odd, unless activity means 1 minute, the timeout is being triggered on start? This makes using new ECS instances a pain because we have to wait for the agent to disconnect and reconnect before actually starting up new tasks.

2017-09-29T22:00:51Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:00:51Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.11.79:18364->54.239.19.128:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:00:52Z [ERROR] Task not found in the engine for the arn in credentials message, arn: arn:aws:ecs:us-east-1:<REDACTED>:task/d480604b-ad3b-452b-a9a5-a308c4f63c42, messageId: d294a750-0b25-4e83-ad08-fe8589de0307
2017-09-29T22:00:54Z [INFO] Saving state! module="statemanager"

Timestamp shows ~ 1.5 minutes. On some agents, it continues to report the ACS Connection hasn't had any activity for too long; closing connection --> Error messages over and over again and tasks are forever stuck in pending.

Heres a good example:

2017-09-29T22:10:15Z [INFO] Loading configuration
2017-09-29T22:10:15Z [INFO] Loading state! module="statemanager"
2017-09-29T22:10:15Z [INFO] Event stream ContainerChange start listening...
2017-09-29T22:10:15Z [INFO] Registering Instance with ECS
2017-09-29T22:10:15Z [INFO] Registered! module="api client"
2017-09-29T22:10:15Z [INFO] Registration completed successfully. I am running as 'arn:aws:ecs:us-east-1:<REDACTED>:container-instance/20a8bbe7-66fa-4b64-9a52-16401c218894' in cluster '<REDACTED>'
2017-09-29T22:10:15Z [INFO] Saving state! module="statemanager"
2017-09-29T22:10:15Z [INFO] Beginning Polling for updates
2017-09-29T22:10:15Z [INFO] Event stream DeregisterContainerInstance start listening...
2017-09-29T22:10:15Z [INFO] Initializing stats engine
2017-09-29T22:10:15Z [INFO] NO_PROXY set:169.254.169.254,169.254.170.2,/var/run/docker.sock
2017-09-29T22:10:25Z [INFO] Saving state! module="statemanager"
2017-09-29T22:11:30Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:11:30Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.19.157:57858->54.239.18.202:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:13:18Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:13:18Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.19.157:13990->54.239.20.16:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:14:42Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:14:42Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.19.157:58222->54.239.18.202:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:15:51Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2017-09-29T22:15:51Z [ERROR] Error getting message from ws backend: error: [read tcp 10.7.19.157:14244->54.239.20.16:443: use of closed network connection], message: [], messageType: [-1] 
2017-09-29T22:15:56Z [INFO] Saving state! module="statemanager"
2017-09-29T22:15:56Z [INFO] Waiting for any previous stops to complete module="TaskEngine" task="<REDACTED>:47 arn:aws:ecs:us-east-1:<REDACTED>:task/b6020b9d-c0e6-4e05-99e3-f0111ba515bb, Status: (NONE->RUNNING) Containers: [<REDACTED> (NONE->RUNNING),]" seqnum="2"
2017-09-29T22:15:56Z [INFO] Wait over; ready to move towards status: RUNNING module="TaskEngine" task="<REDACTED>:47 arn:aws:ecs:us-east-1:<REDACTED>:task/b6020b9d-c0e6-4e05-99e3-f0111ba515bb, Status: (NONE->RUNNING) Containers: [<REDACTED> (NONE->RUNNING),]"
2017-09-29T22:15:56Z [INFO] Pulling container <REDACTED>(r<REDACTED>/<REDACTED>:62a7d9c) (NONE->RUNNING) concurrently. Task: <REDACTED>:47 arn:aws:ecs:us-east-1:<REDACTED>:task/b6020b9d-c0e6-4e05-99e3-f0111ba515bb, Status: (NONE->RUNNING) Containers: [<REDACTED> (NONE->RUNNING),]

This guy took ~5.5 minutes from that time the ECS agent started to when it began to pull the docker image.

Is this something the fix for this PR could fix? If not, should I open a new issue?

@djenriquez
Copy link
Author

Hmm this doesnt look to have made the 1.14.5 release, can you confirm @aaithal ?

@djenriquez
Copy link
Author

Actually, it looks like #993 was included, thanks, going to try it out now.

@aaithal
Copy link
Contributor

aaithal commented Oct 3, 2017

@djenriquez Confirming that this did make the v1.14.5 release (See commits here).

Also, I missed your previous post, sorry about that. The behavior that you mentioned in #985 (comment) should also be improved with #993.

Please let us know if you continue running into this issue.

Thanks,
Anirudh

@djenriquez
Copy link
Author

djenriquez commented Oct 4, 2017

@aaithal @antipax FYI, our clusters have been running 1.14.5 for > 12 hours and for the first time in probably a month, i woke up this morning with no agents disconnected.

Great work! So far so good!

@vsiddharth
Copy link
Contributor

#993 was released with v1.14.5

Please feel free to reopen if this issue recurs.

@russomi
Copy link

russomi commented Dec 20, 2017

FYI ... I have seen this issue on 1.16.0 as of 12/20/2017. Don't have a root cause, but we saw it today. Work around was to terminate and then have ECS replace the nodes -- everything worked after that.

@flowirtz
Copy link

Having similar issues atm.
@russomi do you have an update on what's going on?

@aaithal
Copy link
Contributor

aaithal commented Jan 29, 2018

@russomi @fwirtz can you please send us logs from the ECS agent (hopefully at debug level) if you're running into this issue? You can send them to aithal at amazon dot com. Please find instructions on how to do this here. I'd also recommend opening a new issue so that we don't lose track of this.

Thanks,
Anirudh

@flowirtz
Copy link

@aaithal thanks for the fast response. I killed the respective instance and recreated it. Also, from taking another look at the agents logs I deduced that there were permissions missing (ecs:SubmitTaskStateChange and ecs:DiscoverPollEndpoint)
I added an IAM policy and so far everything is working as expected, hasn't been 24 hours yet though.
Will keep you posted if issue occurs again.
Thanks!

@marcato15
Copy link

Having this issue happening very regularly and taking down all 3 of our production instances our the course of a few hours. Even though we have multiple containers running it doesn't make a difference. I'm on the latest AMI and Agent Version 1.17.2 but I can't trust the stack to stay up for more than a day. I'm currently looking into other options as it appears ECS is not stable enough to be used for production as there is no way to setup a failover in case of ECS Agent disconnect. Is there any thing that can be done to tie in to auto scaling when an Agent Disconnects for X number of minutes?

I know I need to send logs. What do you need me to do to turn on the instances to get the right log level?

@aaithal
Copy link
Contributor

aaithal commented Apr 9, 2018

Hi @marcato15, we're sorry to hear that you're running into this issue. Can you please send us logs at the debug level (instructions here and here) to help debug this issue? You can send them to aithal at amazon dot com. Also, can you please create a new issue so that it doesn't get conflated with any existing ones?

Thanks,
Anirudh

@RobertCPhillips
Copy link

@russomi @fwirtz Did you open a new issue as @aaithal suggested?

@flowirtz
Copy link

@RobertCPhillips No, did not, works for me as of now.
Destroying and restarting the respective instance "fixed" it.

@imyashvinder
Copy link

imyashvinder commented Oct 26, 2018

Any update on this guys:
If i create ec2 instance using ecs optimized ami and there is no cluster with the name mentioned in ecs.config, then ecs agent docker container tend to get destroyed after a while.

@cfalletta
Copy link

cfalletta commented Nov 9, 2020

I've just opened a support ticket to Amazon for this problem that still occurs in 2020...

I've been using 1.15.0 since the beginning and never had any problems until now (2+ years up and running).
I'm running the latest version now (1.47.0 I believe), but still have the same problem.

For the fourth day in a row now, the agent disconnects and my docker containers disappear. It even happened twice today.

I need to manually restart and re-run ecs-agent for my application to work again.

My next step will be to create a new instance and delete the "faulty" one.

Any help or insights appreciated, even if my workaround solves the problem, because this kind of crash makes me lose some trust in the AWS services.

@richardj-bsquare
Copy link

This also happened out of the blue for me (21st Jan), and I'd never seen this before now... There should be an ECS health-check failure for agent disconnected (since it never recovers).

image

This above had been running since December without any alterations. We had to cycle all the underlying instances to get things up and running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests