-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Traffic sent to stale endpoints using Cloud Map service discovery and active health checks #213
Comments
Facing similar issue, where traffic is being routed to an endpoint by App Mesh even if the container is being stopped gracefully. We are using multiple replica mapped to a single virtual node, however it still doesn't ensure issues in connections from the the consumer service. Need ALB like functionality where proper de-registration takes place if ECS identifies a particular task is being STOPPED. |
Hey @Paritosh-Anand, are you using Cloud Map or DNS for service discovery? Also, what deployment strategy are you using? I'd like to get as much info as I can on this issue to make sure we solve all the edge cases. :) Some of the issues we see folks facing can be mitigated or eliminated with our documented best practices. We're also working on a number of other things (circuit breaking, outlier detection to name two) that will improve the default experience. |
Hi @bcelenza, Yes, we are using Cloud Map for service discovery. Deployment type is rolling updates with - minumum healthy percent = 60 We do blue/green deployment from our automation that is NOT using blue/green approach powered by code commit. If possible can you share the approach for circuit breaking, this seems to be an interesting for the solution. However as I mentioned that prime concern is to gracefully server in flight requests. So will these things help solving the problem at hand? Let me know if any more details are required or any other way I can help contribute towards solving this. |
We are actively designing and working on this. Currently slated for Q4 2020 release. |
The fix for this bug has been released and deployed in all regions. Feel free to reopen if you're still observing this issue. |
Summary
When using Cloud Map for service discovery and active health checks on a Virtual Node, any downstream Envoy routing traffic to that Virtual Node may continue to route traffic to endpoints (instances) which have been removed from Cloud Map and terminated. This will result in the downstream Envoys observing connection timeouts and request failures.
The root cause is intended behavior of Envoy when using active health checks and Envoy's Endpoint Discovery Service (EDS). When active health checks are configured for clusters which use EDS, and an endpoint removed, the endpoint will remain routable in downstream Envoys until the unhealthy threshold is met for the active health checks.
Steps to Reproduce
(10s + 10s) * 5
).Are you currently working around this issue?
There are three mitigations which will reduce or remove the request failures.
connection-error
is strongly recommended to mitigate against this issue.UnhealthyThreshold
,IntervalMillis
, andTimeoutMillis
.Additional context
The behavior of Envoy has been clarified in envoyproxy/envoy#11527.
The text was updated successfully, but these errors were encountered: