Bug: Traffic sent to stale endpoints using Cloud Map service discovery and active health checks #213

bcelenza · 2020-06-11T16:41:35Z

Summary
When using Cloud Map for service discovery and active health checks on a Virtual Node, any downstream Envoy routing traffic to that Virtual Node may continue to route traffic to endpoints (instances) which have been removed from Cloud Map and terminated. This will result in the downstream Envoys observing connection timeouts and request failures.

The root cause is intended behavior of Envoy when using active health checks and Envoy's Endpoint Discovery Service (EDS). When active health checks are configured for clusters which use EDS, and an endpoint removed, the endpoint will remain routable in downstream Envoys until the unhealthy threshold is met for the active health checks.

Steps to Reproduce

Create two virtual nodes: (1) using Cloud Map for service discovery and with active health checks configured, (2) using any configuration, but specifying the first virtual node as a backend via a Virtual Service with a Virtual Node provider (or optionally with a Virtual Router provide).
Launch a single replica (i.e. task or pod) for each virtual node.
Send a request from the downstream Envoy to the upstream Envoy and assert that connectivity succeeds.
Terminate the replica for the upstream Virtual Node that uses Cloud Map service discovery and replace it with a new one.
Continue sending requests at a regular cadence from the downstream Envoy and observe that requests may continue to fail for the duration of the configured active health check period. For example, with an unhealthy threshold of 5, interval of 10000ms, and timeout of 10000ms, this period will be approximately 100 seconds ((10s + 10s) * 5).

Are you currently working around this issue?
There are three mitigations which will reduce or remove the request failures.

Instrument all routes with retry policies per the App Mesh best practice. A TCP retry event of connection-error is strongly recommended to mitigate against this issue.
Use lower values for active health checking's UnhealthyThreshold, IntervalMillis, and TimeoutMillis.
Use more than a single replica for a given Virtual Node, which will ensure that a routable endpoint is always available.

Additional context
The behavior of Envoy has been clarified in envoyproxy/envoy#11527.

The text was updated successfully, but these errors were encountered:

Paritosh-Anand · 2020-06-28T11:22:56Z

Facing similar issue, where traffic is being routed to an endpoint by App Mesh even if the container is being stopped gracefully. We are using multiple replica mapped to a single virtual node, however it still doesn't ensure issues in connections from the the consumer service.

Need ALB like functionality where proper de-registration takes place if ECS identifies a particular task is being STOPPED.

bcelenza · 2020-07-03T21:31:47Z

Hey @Paritosh-Anand, are you using Cloud Map or DNS for service discovery? Also, what deployment strategy are you using? I'd like to get as much info as I can on this issue to make sure we solve all the edge cases. :)

Some of the issues we see folks facing can be mitigated or eliminated with our documented best practices. We're also working on a number of other things (circuit breaking, outlier detection to name two) that will improve the default experience.

Paritosh-Anand · 2020-07-06T15:06:42Z

Hi @bcelenza,

Yes, we are using Cloud Map for service discovery. Deployment type is rolling updates with -

minumum healthy percent = 60
maximum healthy percent = 200

We do blue/green deployment from our automation that is NOT using blue/green approach powered by code commit.

If possible can you share the approach for circuit breaking, this seems to be an interesting for the solution. However as I mentioned that prime concern is to gracefully server in flight requests. So will these things help solving the problem at hand?

Let me know if any more details are required or any other way I can help contribute towards solving this.

LancerRainier · 2020-09-30T20:58:28Z

We are actively designing and working on this. Currently slated for Q4 2020 release.

karanvasnani · 2020-11-18T20:30:26Z

The fix for this bug has been released and deployed in all regions. Feel free to reopen if you're still observing this issue.

bcelenza added the Bug Something isn't working label Jun 11, 2020

bcelenza mentioned this issue Jun 11, 2020

SSL23_GET_SERVER_HELLO:unknown protocol for openssl command from a different service in the same mesh #200

Closed

bcelenza added the Priority: High label Jun 12, 2020

tarunwadhwa13 mentioned this issue Jun 29, 2020

[ECS] [request]: de-register from Cloud Map / R53 when instance is draining aws/containers-roadmap#473

Open

jamsajones assigned bcelenza Jul 1, 2020

jamsajones assigned LancerRainier Jul 22, 2020

jamsajones added the Phase: Working on it label Sep 23, 2020

jamsajones assigned karanvasnani and unassigned bcelenza Sep 23, 2020

karanvasnani closed this as completed Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Traffic sent to stale endpoints using Cloud Map service discovery and active health checks #213

Bug: Traffic sent to stale endpoints using Cloud Map service discovery and active health checks #213

bcelenza commented Jun 11, 2020 •

edited

Loading

Paritosh-Anand commented Jun 28, 2020

bcelenza commented Jul 3, 2020

Paritosh-Anand commented Jul 6, 2020

LancerRainier commented Sep 30, 2020

karanvasnani commented Nov 18, 2020

Bug: Traffic sent to stale endpoints using Cloud Map service discovery and active health checks #213

Bug: Traffic sent to stale endpoints using Cloud Map service discovery and active health checks #213

Comments

bcelenza commented Jun 11, 2020 • edited Loading

Paritosh-Anand commented Jun 28, 2020

bcelenza commented Jul 3, 2020

Paritosh-Anand commented Jul 6, 2020

LancerRainier commented Sep 30, 2020

karanvasnani commented Nov 18, 2020

bcelenza commented Jun 11, 2020 •

edited

Loading