Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Traffic sent to stale endpoints using Cloud Map service discovery and active health checks #213

Closed
bcelenza opened this issue Jun 11, 2020 · 5 comments
Assignees
Labels
Bug Something isn't working

Comments

@bcelenza
Copy link
Contributor

bcelenza commented Jun 11, 2020

Summary
When using Cloud Map for service discovery and active health checks on a Virtual Node, any downstream Envoy routing traffic to that Virtual Node may continue to route traffic to endpoints (instances) which have been removed from Cloud Map and terminated. This will result in the downstream Envoys observing connection timeouts and request failures.

The root cause is intended behavior of Envoy when using active health checks and Envoy's Endpoint Discovery Service (EDS). When active health checks are configured for clusters which use EDS, and an endpoint removed, the endpoint will remain routable in downstream Envoys until the unhealthy threshold is met for the active health checks.

Steps to Reproduce

  1. Create two virtual nodes: (1) using Cloud Map for service discovery and with active health checks configured, (2) using any configuration, but specifying the first virtual node as a backend via a Virtual Service with a Virtual Node provider (or optionally with a Virtual Router provide).
  2. Launch a single replica (i.e. task or pod) for each virtual node.
  3. Send a request from the downstream Envoy to the upstream Envoy and assert that connectivity succeeds.
  4. Terminate the replica for the upstream Virtual Node that uses Cloud Map service discovery and replace it with a new one.
  5. Continue sending requests at a regular cadence from the downstream Envoy and observe that requests may continue to fail for the duration of the configured active health check period. For example, with an unhealthy threshold of 5, interval of 10000ms, and timeout of 10000ms, this period will be approximately 100 seconds ((10s + 10s) * 5).

Are you currently working around this issue?
There are three mitigations which will reduce or remove the request failures.

  1. Instrument all routes with retry policies per the App Mesh best practice. A TCP retry event of connection-error is strongly recommended to mitigate against this issue.
  2. Use lower values for active health checking's UnhealthyThreshold, IntervalMillis, and TimeoutMillis.
  3. Use more than a single replica for a given Virtual Node, which will ensure that a routable endpoint is always available.

Additional context
The behavior of Envoy has been clarified in envoyproxy/envoy#11527.

@Paritosh-Anand
Copy link

Facing similar issue, where traffic is being routed to an endpoint by App Mesh even if the container is being stopped gracefully. We are using multiple replica mapped to a single virtual node, however it still doesn't ensure issues in connections from the the consumer service.

Need ALB like functionality where proper de-registration takes place if ECS identifies a particular task is being STOPPED.

@bcelenza
Copy link
Contributor Author

bcelenza commented Jul 3, 2020

Hey @Paritosh-Anand, are you using Cloud Map or DNS for service discovery? Also, what deployment strategy are you using? I'd like to get as much info as I can on this issue to make sure we solve all the edge cases. :)

Some of the issues we see folks facing can be mitigated or eliminated with our documented best practices. We're also working on a number of other things (circuit breaking, outlier detection to name two) that will improve the default experience.

@Paritosh-Anand
Copy link

Hi @bcelenza,

Yes, we are using Cloud Map for service discovery. Deployment type is rolling updates with -

minumum healthy percent = 60
maximum healthy percent = 200

We do blue/green deployment from our automation that is NOT using blue/green approach powered by code commit.

If possible can you share the approach for circuit breaking, this seems to be an interesting for the solution. However as I mentioned that prime concern is to gracefully server in flight requests. So will these things help solving the problem at hand?

Let me know if any more details are required or any other way I can help contribute towards solving this.

@LancerRainier
Copy link

We are actively designing and working on this. Currently slated for Q4 2020 release.

@karanvasnani
Copy link

The fix for this bug has been released and deployed in all regions. Feel free to reopen if you're still observing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants