-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GKE with externalTrafficPolicy Local causes massive timeouts #4121
Comments
I'd like to note that specific option you're toggling is K8s specific (https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/service.go), it is not about ingress-nginx. So maybe bring this up in kubernetes/kubernetes. FWIW I've observed a similar behaviour (in GKE as well) but only when we roll ingress-nginx pods where new ones end up in different nodes (this triggers reconfiguration of upstreams in GCP network load balancer resource because health check in the former nodes starts failing expectedly since there's no ingress-nginx pod running on them). Also in my case the issue goes away after some seconds. It happens only during ingress-nginx deploys. |
Thanks for the feedback, I'll make sure to post something there too. I did eventually resolve the issue, but not in a satisfactory or sensible way. What you said about pods ending up on different nodes triggered an idea. I ended up completely deleting the whole ingress controller and completely re-deploying it with the correct trafficpolicy and then... it worked. I just have no idea WHY - reason enough to post this on kubernetes/kubernetes as well. This ticket is closed for now. |
I have the same problem for quite a while. What do you mean by "correct trafficpolicy"? |
@ayushin I mean externalTrafficPolicy: Local - which was the correct one for my use-case. It basically "fixed itself" once I fully removed nginx-ingress (helm delete --purge) and re-deployed it with externalTrafficPolicy: Local. No more issues after that. Before I had attempted to UPDATE the externalTrafficPolicy to Local, but that caused the timeouts. I am not sure why this would be different. But ymmv. |
We had this exact problem. The default configuration is However, if you have more than one node in the cluster, it means if the requested service is not running in the server which receives the request, the request will hang until the client times out. So for this reason, and AFAIK with my limited kubernetes knowledge, the right configuration should be:
I consider this behaviour a bug. I hope it helps :) |
Just for the record, am I right stating that in order to keep |
No. That "defeats" the goal of |
Sorry, I should have given context. In my case, I had to change it to But you're right, in a classic setup, |
I managed to get it working by only deleting the nginx controller service, and recreating it with |
Is this a request for help?: This might be a bug, but I cannot find any information on it. Other channels are unresponsive.
What keywords did you search in NGINX Ingress controller issues before filing this one?: GKE nginx-ingress externalTrafficPolicyLocal timeout. Only other ticket found is #2582, which is a completely different issue.
Is this a BUG REPORT or FEATURE REQUEST?: bug report
NGINX Ingress controller version: 0.24.1 installed via helm
Kubernetes version (use
kubectl version
): v1.12.7-gke.10Environment:
What happened:
I installed nginx-ingress originally without setting the externalTrafficPolicy and everything worked fine. I noticed I could not get to the client IPs on the app side (which I need), so I upgraded the nginx-ingress and set the externalTrafficPolicy to Local as per the documentation.
I was now able to get the user IPs in the application, however, about 2/3rds of all requests started timing out. Using StackDriver uptime checks, I could see that the application was routinely unreachable from 4 of 6 locations. Not always the same locations, just generally 4 out of 6. Response times were very slow as well. Lots of people experienced intermittent downtime between page loads, where clicking a link would give a timeout, refreshing would fix it, etc.
I completely wiped and re-deployed the application, no change.
I eventually set back the externalTrafficPolicy to Cluster and all problems stopped. This graph from stackdriver shows the exact moment when I re-deploy nginx-ingress with the policy set to Cluster and the subsequent drop in response times;
What you expected to happen:
I either expected it to not work at all, which would mean some different problem, or just just work. Now it works somewhat, which is extremely confusing.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
I have 2 ingresses; one for the actual application and one that uses the "nginx.ingress.kubernetes.io/temporal-redirect" annotation to redirect some subdomains to the main domain.
Both ingresses use TLS. Certificates are obtained by cert-manager using LetsEncrypt with the HTTP challenge.
The text was updated successfully, but these errors were encountered: