-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transient 503 UH "no healthy upstream" errors during CDS updates #13070
Transient 503 UH "no healthy upstream" errors during CDS updates #13070
Comments
This sounds like another repeat of #13009? cc @htuch @ramaraochavali @snowp |
I'm not sure if it's the same issue, if so I would expect the time stamps to indicate that we're hitting the EDS fetch timeout (15s) without a response, causing cluster warming prior to receiving any hosts. The logs seem to indicate that the hosts are being added to the cluster, then health checked during warming:
I wonder if the problem is that we create a TLS cluster and update its hosts in separate dispatcher events so they don't happen atomically? |
Warming shouldn't complete until the first round of health checking is complete, so not sure what is happening here. I can take a more detailed look later. |
The idea I mentioned in my last post isn't even necessarily about warming, but about the fact that we call
before we do
which means we get two dispatcher events: [update TLS cluster, set hosts], which could be interleaved with a dispatcher event that tries to route to this cluster. Maybe there's something I missed that prevents this from being an issue, I didn't try to reproduce the issue. If this is the case then any CDS update could possibly cause this, so the main reason I'm skeptical about this theory is that I would have expected it to have been discovered a long time ago. |
No I think you are right. This is definitely a race condition and broken. I'm guessing the reason this has not been reported before is that CDS updates are pretty rare and I'm guessing no one has noticed. We will need to fix this. |
I am fairly sure this is the cause of istio/istio#23029. I can reproduce this pretty reliably (not 100% of time, but within a few minutes). The test I am doing is updating a cluster every 2s, changing some field in metadata that does not matter. In the mean time, I send 2k QPS. Cluster, which notable has no TLS or health check:
note: 3s initial fetch timeout was playing around with #13009, it does not matter what I set this to, even disabling with 0s still gets this issue I provided some trace logging of 2x where the update triggered this and 2x where it did not. While the above test is synthetic, this behavior is consistently reproducible in real Knative workloads as well |
I can take a look at fixing this. |
I looked at this and the fix is not difficult. I will fix soon. |
Title: Envoy briefly fails to route requests during CDS updates
Description:
When a cluster is updated via CDS, we sometimes observe request failures in the form of HTTP 503 "no healthy upstream" with the
UH
response flag.The membership total for the cluster remains constant throughout the update.
The membership healthy falls down to 0 shortly after the CDS update begins, then switches back to normal shortly after the CDS update completed.
Request failures do occur regardless of the panic routing thresholds.
The docs around cluster warming (in https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/cluster_manager#cluster-warming) suggest there should be no traffic disruption during a CDS update.
Repro steps:
We observed and reproduced this behavior on static or EDS clusters, by sending a CDS update to change the health-checking timeout (eg. from 5s to 10s and back).
For example, we used an EDS cluster with 1% panic routing threshold configured around the lines of:
The cluster load assignment included a stable pool of 10 healthy backends.
Envoy does not appear to make any health-check request during the CDS update.
The backends are hard-coded to report healthy, and run locally on the same host to minimize possible network interferences.
Logs:
We've reproduced this using a version of Envoy 1.14.5-dev with some extra log statements:
Based on the trace above, it looks like a load-balancer of the updated cluster is used to route a request before its underlying host set is initialized.
I'm not familiar enough with Envoy internals and could use some help to understand if I'm somehow misconfiguring Envoy, or if something else is happening.
The text was updated successfully, but these errors were encountered: