GKE with externalTrafficPolicy Local causes massive timeouts #4121

gboor · 2019-05-26T11:48:14Z

Is this a request for help?: This might be a bug, but I cannot find any information on it. Other channels are unresponsive.

What keywords did you search in NGINX Ingress controller issues before filing this one?: GKE nginx-ingress externalTrafficPolicyLocal timeout. Only other ticket found is #2582, which is a completely different issue.

Is this a BUG REPORT or FEATURE REQUEST?: bug report

NGINX Ingress controller version: 0.24.1 installed via helm

Kubernetes version (use kubectl version): v1.12.7-gke.10

Environment:

Cloud provider or hardware configuration: GKE

What happened:

I installed nginx-ingress originally without setting the externalTrafficPolicy and everything worked fine. I noticed I could not get to the client IPs on the app side (which I need), so I upgraded the nginx-ingress and set the externalTrafficPolicy to Local as per the documentation.

I was now able to get the user IPs in the application, however, about 2/3rds of all requests started timing out. Using StackDriver uptime checks, I could see that the application was routinely unreachable from 4 of 6 locations. Not always the same locations, just generally 4 out of 6. Response times were very slow as well. Lots of people experienced intermittent downtime between page loads, where clicking a link would give a timeout, refreshing would fix it, etc.

I completely wiped and re-deployed the application, no change.

I eventually set back the externalTrafficPolicy to Cluster and all problems stopped. This graph from stackdriver shows the exact moment when I re-deploy nginx-ingress with the policy set to Cluster and the subsequent drop in response times;

What you expected to happen:

I either expected it to not work at all, which would mean some different problem, or just just work. Now it works somewhat, which is extremely confusing.

How to reproduce it (as minimally and precisely as possible):

Set up a cluster in GKE, without HTTP load balancing, public
Deploy nginx-ingress using the helm chart with --set controller.service.externalTrafficPolicy=Local
Deploy a simple hello world web app with an ingress that uses this ingress.
Monitor.

Anything else we need to know:

I have 2 ingresses; one for the actual application and one that uses the "nginx.ingress.kubernetes.io/temporal-redirect" annotation to redirect some subdomains to the main domain.
Both ingresses use TLS. Certificates are obtained by cert-manager using LetsEncrypt with the HTTP challenge.

The text was updated successfully, but these errors were encountered:

ElvinEfendi · 2019-05-26T12:06:06Z

I'd like to note that specific option you're toggling is K8s specific (https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/service.go), it is not about ingress-nginx. So maybe bring this up in kubernetes/kubernetes.

FWIW I've observed a similar behaviour (in GKE as well) but only when we roll ingress-nginx pods where new ones end up in different nodes (this triggers reconfiguration of upstreams in GCP network load balancer resource because health check in the former nodes starts failing expectedly since there's no ingress-nginx pod running on them). Also in my case the issue goes away after some seconds. It happens only during ingress-nginx deploys.

gboor · 2019-05-28T11:02:04Z

Thanks for the feedback, I'll make sure to post something there too.

I did eventually resolve the issue, but not in a satisfactory or sensible way. What you said about pods ending up on different nodes triggered an idea. I ended up completely deleting the whole ingress controller and completely re-deploying it with the correct trafficpolicy and then... it worked.

I just have no idea WHY - reason enough to post this on kubernetes/kubernetes as well. This ticket is closed for now.

ayushin · 2019-06-09T12:50:58Z

I have the same problem for quite a while. What do you mean by "correct trafficpolicy"?

gboor · 2019-06-10T09:05:52Z

@ayushin I mean externalTrafficPolicy: Local - which was the correct one for my use-case.

It basically "fixed itself" once I fully removed nginx-ingress (helm delete --purge) and re-deployed it with externalTrafficPolicy: Local. No more issues after that.

Before I had attempted to UPDATE the externalTrafficPolicy to Local, but that caused the timeouts. I am not sure why this would be different.

But ymmv.

alexcastano · 2019-06-12T10:10:42Z

We had this exact problem. The default configuration is externalTrafficPolicy: Local and the recommended one for GKE:

https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/cloud-generic.yaml

However, if you have more than one node in the cluster, it means if the requested service is not running in the server which receives the request, the request will hang until the client times out. So for this reason, and AFAIK with my limited kubernetes knowledge, the right configuration should be:

externalTrafficPolicy: Cluster

I consider this behaviour a bug.

I hope it helps :)

mveroone · 2019-08-09T11:27:55Z

Just for the record, am I right stating that in order to keep externalTrafficPolicy: Local, it's best to use nginx-ingress in DaemonSet mode ?

aledbf · 2019-08-09T11:32:52Z

am I right stating that in order to keep externalTrafficPolicy: Local, it's best to use nginx-ingress in DaemonSet mode ?

No. That "defeats" the goal of externalTrafficPolicy: Local (send traffic only to the node where the pod is running)

mveroone · 2019-08-09T11:38:08Z

Sorry, I should have given context. In my case, I had to change it to Local in order to preserve source IPs of the requests since my cluster is a GKE cluster. (see #4401)

But you're right, in a classic setup, Local in DaemonSet mode is similar to Cluster for everything else.

guanzo · 2020-07-24T21:41:46Z

@ayushin I mean externalTrafficPolicy: Local - which was the correct one for my use-case.

It basically "fixed itself" once I fully removed nginx-ingress (helm delete --purge) and re-deployed it with externalTrafficPolicy: Local. No more issues after that.

Before I had attempted to UPDATE the externalTrafficPolicy to Local, but that caused the timeouts. I am not sure why this would be different.

But ymmv.

I managed to get it working by only deleting the nginx controller service, and recreating it with externalTrafficPolicy: Local, rather than deleting the entire helm installation. Note that your k8s provider may also delete the underlying LoadBalancer implementation when the service is deleted.. I was unsure of that and took a huge risk by deleting the service.. but it worked out :). I'm on GKE.

gboor closed this as completed May 28, 2019

danking mentioned this issue Feb 5, 2020

mysterious latency in service after change to networking hail-is/hail#8047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE with externalTrafficPolicy Local causes massive timeouts #4121

GKE with externalTrafficPolicy Local causes massive timeouts #4121

gboor commented May 26, 2019

ElvinEfendi commented May 26, 2019 •

edited

Loading

gboor commented May 28, 2019

ayushin commented Jun 9, 2019

gboor commented Jun 10, 2019

alexcastano commented Jun 12, 2019

mveroone commented Aug 9, 2019

aledbf commented Aug 9, 2019

mveroone commented Aug 9, 2019 •

edited

Loading

guanzo commented Jul 24, 2020 •

edited

Loading

GKE with externalTrafficPolicy Local causes massive timeouts #4121

GKE with externalTrafficPolicy Local causes massive timeouts #4121

Comments

gboor commented May 26, 2019

ElvinEfendi commented May 26, 2019 • edited Loading

gboor commented May 28, 2019

ayushin commented Jun 9, 2019

gboor commented Jun 10, 2019

alexcastano commented Jun 12, 2019

mveroone commented Aug 9, 2019

aledbf commented Aug 9, 2019

mveroone commented Aug 9, 2019 • edited Loading

guanzo commented Jul 24, 2020 • edited Loading

ElvinEfendi commented May 26, 2019 •

edited

Loading

mveroone commented Aug 9, 2019 •

edited

Loading

guanzo commented Jul 24, 2020 •

edited

Loading