-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sds: cluster not warming while certificates are being fetched; immediately marked active #11120
Comments
/cc @JimmyCYJ |
@Shikugawa would you be able to help on investigating this? |
@dio Yes, I'll investigate about this problem. |
I suspect this causes other issues as well. We are seeing that if we do not include sds config in the XDS connection, we eventually have SDS permanently broken - clusters that reference sds secrets will be stuck warming forever. We have 2 secrets throughout all config, the client cert and the root cert. If we add the cert to the xds cluster (enforcing SDS starts before XDS) then we never see the issue with the client cert. However, the root cert still sometimes gets stuck warming forever. more info istio/istio#22443. I am not sure if its the same root cause but seems related |
@howardjohn Got it. I think that |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
Its not just the SDS cluster being inactive, the SDS cluster could be active but not return any secrets |
Let me rephrase to see we are on the same page.
My understanding is that "warming cluster shows ready" is the only error. |
We may be saying the same thing, not sure. But I think its more like:
|
@howardjohn I see stat is the only liar. I think envoy is early declaring active at stat. Not only cluster but also listener However, the sequence is working as expected. The initial_fetch_timeout is supposed to unblock the initialization by announce itself "ready" or "active" @mandarjog I think the solution is to disable init_fetch_timeout in SDS if we cannot tolerate fake readiness. |
@lambdai I don't think stats is the only issue. See istio/istio#22443. I have less of a clear reproducer, but basically we get into a state where Envoy never sends an SDS request for one of the SDS resources in the XDS response. It seems there are larger problems then just the stat - but maybe its a completely different issue. Why is the solution to disable it rather than fix the stat? |
Get it. Reading the issue and see if I can help. |
Oh I see. Yeah actually we do not want initial fetch timeout for these in particular. But that is an Istio detail not envoy side. One thing I do wonder - does the initial_fetch_timeout stuff work differently for bootstrap vs dynamic? Because what we found is if we requested a secret |
@lambdai @mandarjog Sorry. I missed these discussions! For now, I think that the common ground is to add |
Why do we need disable_init_fetch_timeout? I think setting it to 0s disabled it already? Besides, the fact we want it disabled is an implementation detail of our usage of Envoy, it's still a general issue that will impact others |
@howardjohn Got it. Maybe this is caused by the state management of cluster manager. In general, if we started to create xDS subscription when all of clusters are not initialized, this problem will be occurred. So I think that to resolve this problem, we should fix the implementation of |
Any update on this? This is an intermittent problem on our istio deployment that stalls pods. |
@howardjohn Hey. I'd like to take confirmation about this. I don't completely understand what is the problem. Our problem is, Attached cluster will be immidiately active after SDS subscription on attached cluster sent DiscoveryRequest to SDS Cluster until init_fetch_timeout on attached cluster when DiscoveryResponse from SDS cluster didn't have CA. We expect to work that attached cluster is warming in this situation. Is this what you said? |
Yep. The cluster should be warming until the secret is fetched, but its active immediately |
We also run in to this problem intermittently. In our case we stream file based certs via SDS using Istio. When we have many clusters if SDS push is delayed for a cluster, since the cluster is incorrectly marked as Active, requests fail to that cluster with error "OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_UNKNOWN" immediately after initialization. The problem is more prominent when there are many clusters. |
Any update on this issue? |
/cc @incfly |
…cret entity (#13344) This PR highly depends on #12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster. Risk Level: Mid Testing: Unit Docs Changes: Release Notes: Added Fixes #11120, future work: #13777 Signed-off-by: Shikugawa <[email protected]>
…cret entity (envoyproxy#13344) This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster. Risk Level: Mid Testing: Unit Docs Changes: Release Notes: Added Fixes envoyproxy#11120, future work: envoyproxy#13777 Signed-off-by: Shikugawa <[email protected]>
* cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783) Signed-off-by: Shikugawa <[email protected]> * sds: keep warming when dynamic inserted cluster can't be extracted secret entity (envoyproxy#13344) This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster. Risk Level: Mid Testing: Unit Docs Changes: Release Notes: Added Fixes envoyproxy#11120, future work: envoyproxy#13777 Signed-off-by: Shikugawa <[email protected]> Co-authored-by: Rei Shimizu <[email protected]>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
@lizan Can we close this? |
@Shikugawa no this is not fixed, I'll have a fix soon. |
Fixes #11120, allows more than one init manager to watch the same SDS init target so clusters/listeners won't be marked active immediately. Additional Description: Risk Level: Medium Testing: integration test Docs Changes: N/A Release Notes: Added. Signed-off-by: Lizan Zhou <[email protected]>
Hey @howardjohn , I recollect you mentioning envoy not fetching SDS secrets dynamically for clusters that reference their transport socket to an SDS config source sometimes. I have seen this issue as well. Was this fixed or root caused, because it seems like this was addressed in Istio by providing the transport socket certs within the cluster config itself. |
When creating clusters that reference SDS certificates, the warming behavior does not seem correct. My expectation is that until the secret is sent, the cluster will be marked as "warming" until the initial_fetch_timeout, and block the rest of initialization from occuring.
What I am actually seeing is initialization is blocked, but there is nothing indicating the clusters are warming.
Using this config:
docker run -v $HOME/kube/local:/config -p 15000:15000 envoyproxy/envoy-dev -c /config/envoy-sds-lds.yaml --log-format-prefix-with-location 0 --reject-unknown-dynamic-fields
with
envoy version: 49efb9841a58ebdc43a666f55c445911c8e4181c/1.15.0-dev/Clean/RELEASE/BoringSSL
and config files:
cds.yaml:
envoy-sds-lds.yaml:
Basically what should happen here is we get a dynamic CDS cluster with SDS config. This SDS config fails, as the sds server is not setup. We have initial_fetch_timeout, so for 20s everything should be warming.
What we see instead:
We also see
init_fetch_timeout
is0
; this does not change after 20s(note - for simple testing I don't have a real LDS server, but we can see its not even attempted until 20s in)
dynamic_active_clusters
shows the cluster in cds.yaml. I would expect it to be "warming".This example above is meant to simplify it, I have originally seen this with a normal deployment using ADS gRPC server (Istio) not just files.
The text was updated successfully, but these errors were encountered: