-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
502 Bad Gateway from proxy on retry after timeout #3596
Comments
Some more information: I used the debug sidecar to capture packets. From what I can tell, according to the packet capture, the debug sidecar is reporting that the 502 Bad Gateway is coming from the remote service. However, if I remove the linkerd proxy completely and run the request again, I never see the 502 Bad Gateway at all and retries seem to work correctly. |
@rocketraman Is this issue related to the one you described on slack, where there was a thread pool exhaustion in the application code? LMK if we can close this issue, if it's not Linkerd-related. |
@ihcsim no this is a different issue and is linkerd related AFAIK. |
I have the same situation here. In my server, a lot of outbound http APIs calls are executed. For fresh started linkered, it works well, but those outbound API calls started to response with 502 Bad Gateway after some random amount of time have passed. I am not sure the following logs is directly related to this issue, but it's always there when problem happens.
(yyy.xxxxx.com is redacted value which was originally http API provider's domain outside of my k8s cluster)
It continuously happens since I used linkerd 2.4.0 and still happening at linkerd 2.6.0 now. |
@rocketraman @yjiq150 thanks for the reports. I'm working on a reproduction case now and will update when I have more information. |
@rocketraman and @yjiq150 I've got a test case running now, so hopefully I can reproduce the behavior. When you say that you're calling an external service, are you using an ExternalName Service type? Or is your service making an http call directly to the URL of the external service? |
No it's not ExternaName service from k8s. It was just http/https API calls to 3rd party servers outside the k8s cluster. |
Yes, it's a direct http call to the URL of the external service outside the k8s cluster. |
@rocketraman @yjiq150 thank you for that. My test case is also making direct calls to a URL that I created. So far I've not seen the behavior described in this issue, but I'll keep testing. |
@rocketraman @yjiq150 my test has been running for a few days now and so far I haven't reproduced the error. The external service is set to reply after 10 seconds and returns errors a percentage of the time. I see in the original post that the application is retrying a request after 30 seconds:
What is the duration, on average, for responses where you see this behavior? |
@rocketraman I heard from @grampelberg that you found the cause for this issue in a third-party dependency, so I'm going to close it for now. @yjiq150 if you have more information about the scenario where you saw this behavior, please reopen the ticket with details on how to reproduce. |
I thought it was two separate issues, but yes, it is possible its the same cause for both of them. For reference, here is the explanation of the underlying problem, and the workaround: |
Thank you for the update! I investigated my issue based on the idea from your case and I found that my elasticsearch-js (nodejs) client use connection pool with keep alive setting. No eviction policy was set for the idle connection in the pool. I set the max idle time for connections to the same value as I guess the error scenario as follows. @cpretzer What do you think about this? Shouldn't there be no problem when keeping a idle connection without eviction? |
@yjiq150 thanks for the update. I don't know about this particular situation, but I've seen some unexpected behavior when a client expects a connection to be open through a keepalive mechanism and a server closes the connection. What was the duration of the |
I didn't change |
I don't think it's just the client that will/can cause a half-close connection. Not an ES expert, but there are a number of these ES half-close connection issues out there: elastic/elasticsearch-php#225, https://discuss.elastic.co/t/logstash-output-elasticsearch-connection-reset-while-trying-to-send-bulk-request-to-elasticsearch/173941, https://discuss.elastic.co/t/transportclient-disconnecting-and-not-reconnecting-automatically/13791, https://stackoverflow.com/q/49844295/1144203. They sound very similar to what you are seeing, right? And all point to different factors that can contribute to half-close connections. |
Thanks for the additional information. However, I still think that the problem has something to do with custom iptable rule installed by linkerd, because there were no problem at all when we use elasticsearch-js without maxIdle eviction policy with no linkerd proxy installed. |
We were facing the same issue. We were about to give up on Linkerd until we found this post! We are using Nginx Ingress, and we increased net.netfilter. nf_conntrack_tcp_timeout_close_wait to 3600 seconds for ingress controller pods using PodSecurityPolicy + SecurityContext. Once we applied this fix, 502 Gateway timeout issues completely went away. I also believe that this issue is specific to Linkerd's implementation and iptable rules. Can we put this information in somewhere in Linkerd documentation for debugging 502 Gateway timeouts? This will save tons of debugging times for users and potentially save many Linkerd users from moving away! |
See #4276 to change this by default. |
Bug Report
What is the issue?
I call an external http/1.1 service from a linkerd2-injected container. The service takes a while to execute each request, and often the client side times out and retries.
The behavior I'm seeing is
How can it be reproduced?
I don't have a full reproduction, however in my environment I can easily reproduce it by ensuring a timeout occurs by blocking the external service from responding.
Logs, error output, etc
linkerd check
outputEnvironment
Possible solution
Additional context
The text was updated successfully, but these errors were encountered: