Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE with externalTrafficPolicy Local causes massive timeouts #4121

Closed
gboor opened this issue May 26, 2019 · 9 comments
Closed

GKE with externalTrafficPolicy Local causes massive timeouts #4121

gboor opened this issue May 26, 2019 · 9 comments

Comments

@gboor
Copy link

gboor commented May 26, 2019

Is this a request for help?: This might be a bug, but I cannot find any information on it. Other channels are unresponsive.

What keywords did you search in NGINX Ingress controller issues before filing this one?: GKE nginx-ingress externalTrafficPolicyLocal timeout. Only other ticket found is #2582, which is a completely different issue.


Is this a BUG REPORT or FEATURE REQUEST?: bug report

NGINX Ingress controller version: 0.24.1 installed via helm

Kubernetes version (use kubectl version): v1.12.7-gke.10

Environment:

  • Cloud provider or hardware configuration: GKE

What happened:

I installed nginx-ingress originally without setting the externalTrafficPolicy and everything worked fine. I noticed I could not get to the client IPs on the app side (which I need), so I upgraded the nginx-ingress and set the externalTrafficPolicy to Local as per the documentation.

I was now able to get the user IPs in the application, however, about 2/3rds of all requests started timing out. Using StackDriver uptime checks, I could see that the application was routinely unreachable from 4 of 6 locations. Not always the same locations, just generally 4 out of 6. Response times were very slow as well. Lots of people experienced intermittent downtime between page loads, where clicking a link would give a timeout, refreshing would fix it, etc.

I completely wiped and re-deployed the application, no change.

I eventually set back the externalTrafficPolicy to Cluster and all problems stopped. This graph from stackdriver shows the exact moment when I re-deploy nginx-ingress with the policy set to Cluster and the subsequent drop in response times;

Uptime Check latency

What you expected to happen:

I either expected it to not work at all, which would mean some different problem, or just just work. Now it works somewhat, which is extremely confusing.

How to reproduce it (as minimally and precisely as possible):

  1. Set up a cluster in GKE, without HTTP load balancing, public
  2. Deploy nginx-ingress using the helm chart with --set controller.service.externalTrafficPolicy=Local
  3. Deploy a simple hello world web app with an ingress that uses this ingress.
  4. Monitor.

Anything else we need to know:

I have 2 ingresses; one for the actual application and one that uses the "nginx.ingress.kubernetes.io/temporal-redirect" annotation to redirect some subdomains to the main domain.
Both ingresses use TLS. Certificates are obtained by cert-manager using LetsEncrypt with the HTTP challenge.

@ElvinEfendi
Copy link
Member

ElvinEfendi commented May 26, 2019

I'd like to note that specific option you're toggling is K8s specific (https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/service.go), it is not about ingress-nginx. So maybe bring this up in kubernetes/kubernetes.

FWIW I've observed a similar behaviour (in GKE as well) but only when we roll ingress-nginx pods where new ones end up in different nodes (this triggers reconfiguration of upstreams in GCP network load balancer resource because health check in the former nodes starts failing expectedly since there's no ingress-nginx pod running on them). Also in my case the issue goes away after some seconds. It happens only during ingress-nginx deploys.

@gboor
Copy link
Author

gboor commented May 28, 2019

Thanks for the feedback, I'll make sure to post something there too.

I did eventually resolve the issue, but not in a satisfactory or sensible way. What you said about pods ending up on different nodes triggered an idea. I ended up completely deleting the whole ingress controller and completely re-deploying it with the correct trafficpolicy and then... it worked.

I just have no idea WHY - reason enough to post this on kubernetes/kubernetes as well. This ticket is closed for now.

@gboor gboor closed this as completed May 28, 2019
@ayushin
Copy link

ayushin commented Jun 9, 2019

I have the same problem for quite a while. What do you mean by "correct trafficpolicy"?

@gboor
Copy link
Author

gboor commented Jun 10, 2019

@ayushin I mean externalTrafficPolicy: Local - which was the correct one for my use-case.

It basically "fixed itself" once I fully removed nginx-ingress (helm delete --purge) and re-deployed it with externalTrafficPolicy: Local. No more issues after that.

Before I had attempted to UPDATE the externalTrafficPolicy to Local, but that caused the timeouts. I am not sure why this would be different.

But ymmv.

@alexcastano
Copy link

We had this exact problem. The default configuration is externalTrafficPolicy: Local and the recommended one for GKE:

https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/cloud-generic.yaml

However, if you have more than one node in the cluster, it means if the requested service is not running in the server which receives the request, the request will hang until the client times out. So for this reason, and AFAIK with my limited kubernetes knowledge, the right configuration should be:

externalTrafficPolicy: Cluster

I consider this behaviour a bug.

I hope it helps :)

@mveroone
Copy link

mveroone commented Aug 9, 2019

Just for the record, am I right stating that in order to keep externalTrafficPolicy: Local, it's best to use nginx-ingress in DaemonSet mode ?

@aledbf
Copy link
Member

aledbf commented Aug 9, 2019

am I right stating that in order to keep externalTrafficPolicy: Local, it's best to use nginx-ingress in DaemonSet mode ?

No. That "defeats" the goal of externalTrafficPolicy: Local (send traffic only to the node where the pod is running)

@mveroone
Copy link

mveroone commented Aug 9, 2019

Sorry, I should have given context. In my case, I had to change it to Local in order to preserve source IPs of the requests since my cluster is a GKE cluster. (see #4401)

But you're right, in a classic setup, Local in DaemonSet mode is similar to Cluster for everything else.

@guanzo
Copy link

guanzo commented Jul 24, 2020

@ayushin I mean externalTrafficPolicy: Local - which was the correct one for my use-case.

It basically "fixed itself" once I fully removed nginx-ingress (helm delete --purge) and re-deployed it with externalTrafficPolicy: Local. No more issues after that.

Before I had attempted to UPDATE the externalTrafficPolicy to Local, but that caused the timeouts. I am not sure why this would be different.

But ymmv.

I managed to get it working by only deleting the nginx controller service, and recreating it with externalTrafficPolicy: Local, rather than deleting the entire helm installation. Note that your k8s provider may also delete the underlying LoadBalancer implementation when the service is deleted.. I was unsure of that and took a huge risk by deleting the service.. but it worked out :). I'm on GKE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants