Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

502 Bad Gateway from proxy on retry after timeout #3596

Closed
rocketraman opened this issue Oct 17, 2019 · 19 comments
Closed

502 Bad Gateway from proxy on retry after timeout #3596

rocketraman opened this issue Oct 17, 2019 · 19 comments
Assignees

Comments

@rocketraman
Copy link
Contributor

rocketraman commented Oct 17, 2019

Bug Report

What is the issue?

I call an external http/1.1 service from a linkerd2-injected container. The service takes a while to execute each request, and often the client side times out and retries.

The behavior I'm seeing is

  1. Request made
  2. Request times out
  3. 1 second later, my container retries the request again, and
  4. linkerd proxy immediately returns 502 Bad Gateway to the container

How can it be reproduced?

I don't have a full reproduction, however in my environment I can easily reproduce it by ensuring a timeout occurs by blocking the external service from responding.

Logs, error output, etc

app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815158241Z DBUG [210207.178386s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound dst=Some(DstAddr { dst_logical: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), dst_concrete: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), direction: Out, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815356042Z DBUG [210207.178566s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound ep=Some(Endpoint { dst_logical: None, dst_concrete: None, addr: V4(10.254.2.5:7071), identity: None(NoPeerName(NotProvidedByServiceDiscovery)), metadata: Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None }, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815472742Z DBUG [210207.178643s] 10.254.2.5:7071 linkerd2_proxy::proxy::http::client client request: method=POST uri=http://x-ops-fevm.azure.x.com:7071/api/p version=HTTP/1.1 headers={"host": "x-ops-fevm.azure.x.com:7071", "accept": "application/json", "accept-charset": "UTF-8", "user-agent": "Ktor client", "content-type": "application/pdf", "content-length": "9984137", "l5d-dst-canonical": "x-ops-fevm.azure.x.com:7071"}
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815498442Z DBUG [210207.178721s] linkerd2_proxy::transport::connect connecting to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.817578849Z DBUG [210207.180791s] linkerd2_proxy::transport::connect connection established to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.817596449Z DBUG [210207.180878s] linkerd2_proxy::transport::metrics client connection open
[... bunch of linkerd2_proxy::transport::tls::listen ...]
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:18:01.392623774Z DBUG [210314.755744s] linkerd2_proxy::proxy::server http1 server error: Error(BodyWrite, Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })
[... bunch of linkerd2_proxy::transport::tls::listen ...]
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:19:15.452792Z DBUG [210388.815977s] linkerd2_proxy::proxy::tcp forward duplex complete: Connection reset by peer (os error 104)
[... bunch of linkerd2_proxy::transport::tls::listen ...]
app-55db9ff5bf-sthkq app 2019-10-17T16:21:13.8160006Z 2019-10-17 16:21:13,815 WARN  --- [ty-call-handler#2156] MyApp          : Long-running call attempt 0 failed, retrying... Timed out waiting for 300000 ms [[EXCEPTION: kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 300000 ms|       ... suppressed 5 lines| at java.lang.Thread.run(Thread.java:834) [?:?]|]]
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.821686419Z DBUG [210508.184840s] linkerd2_proxy::transport::tls::listen accepted connection from 10.5.1.125:35288 to Some(V4(10.254.2.5:7071)); skipping TLS (loopback)
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.822822123Z DBUG [210508.186064s] linkerd2_proxy::app::outbound outbound addr=Ok(Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }))
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.822892923Z DBUG [210508.186148s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound dst=Some(DstAddr { dst_logical: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), dst_concrete: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), direction: Out, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.823080324Z DBUG [210508.186321s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound ep=Some(Endpoint { dst_logical: None, dst_concrete: None, addr: V4(10.254.2.5:7071), identity: None(NoPeerName(NotProvidedByServiceDiscovery)), metadata: Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None }, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.823131324Z DBUG [210508.186397s] 10.254.2.5:7071 linkerd2_proxy::proxy::http::client client request: method=POST uri=http://x-ops-fevm.azure.x.com:7071/api/p version=HTTP/1.1 headers={"host": "x-ops-fevm.azure.x.com:7071", "accept": "application/json", "accept-charset": "UTF-8", "user-agent": "Ktor client", "content-type": "application/pdf", "content-length": "9984137", "l5d-dst-canonical": "x-ops-fevm.azure.x.com:7071"}
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.823202624Z DBUG [210508.186450s] linkerd2_proxy::transport::connect connecting to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.82501623Z DBUG [210508.188244s] linkerd2_proxy::transport::connect connection established to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.82503323Z DBUG [210508.188303s] linkerd2_proxy::transport::metrics client connection open
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.825265131Z DBUG [210508.188420s] proxy={client=out dst=10.254.2.5:7071 proto=Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false }} linkerd2_proxy::proxy::http::glue http body error: error reading a body from connection: end of file before message length reached
app-55db9ff5bf-sthkq linkerd-proxy app-55db9ff5bf-sthkq2019-10-17T16:21:14.825554932Z ERR! [210508.188673s] linkerd2_proxy::app::errors unexpected error: error from user's Payload stream: error reading a body from connection: end of file before message length reached
app 2019-10-17T16:21:14.827497238Z 2019-10-17 16:21:14,826 WARN  --- [ty-call-handler#2156] MyApp          : Long-running call attempt 1 failed, retrying... Unable to do long running call, status=[502 Bad Gateway]

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

  • Kubernetes Version: 1.12.7
  • Cluster Environment: AKS (via AKS-engine)
  • Host OS: Ubuntu 16.04.6 LTS
  • Linkerd version:
$ linkerd version
Client version: stable-2.6.0
Server version: stable-2.6.0

Possible solution

Additional context

@rocketraman
Copy link
Contributor Author

rocketraman commented Oct 17, 2019

Some more information:

I used the debug sidecar to capture packets. From what I can tell, according to the packet capture, the debug sidecar is reporting that the 502 Bad Gateway is coming from the remote service.

However, if I remove the linkerd proxy completely and run the request again, I never see the 502 Bad Gateway at all and retries seem to work correctly.

@ihcsim
Copy link
Contributor

ihcsim commented Nov 26, 2019

@rocketraman Is this issue related to the one you described on slack, where there was a thread pool exhaustion in the application code? LMK if we can close this issue, if it's not Linkerd-related.

@rocketraman
Copy link
Contributor Author

rocketraman commented Nov 26, 2019

@ihcsim no this is a different issue and is linkerd related AFAIK.

@yjiq150
Copy link

yjiq150 commented Dec 13, 2019

I have the same situation here.

In my server, a lot of outbound http APIs calls are executed. For fresh started linkered, it works well, but those outbound API calls started to response with 502 Bad Gateway after some random amount of time have passed.
(Sometime it happens after few days and sometimes it happen few hours after the server restart.)
When I restart the server(linkerd side car container also be restarted), the problem is gone, then same thing happens after a while.

I am not sure the following logs is directly related to this issue, but it's always there when problem happens.

linkerd2_proxy::proxy::http::canonicalize failed to refine 'yyy.xxxxx.com': deadline has elapsed; using original name

(yyy.xxxxx.com is redacted value which was originally http API provider's domain outside of my k8s cluster)

linkerd2_proxy::app::errors unexpected error: error trying to connect: request timed out

It continuously happens since I used linkerd 2.4.0 and still happening at linkerd 2.6.0 now.

@cpretzer
Copy link
Contributor

cpretzer commented Dec 23, 2019

@rocketraman @yjiq150 thanks for the reports. I'm working on a reproduction case now and will update when I have more information.

@cpretzer
Copy link
Contributor

@rocketraman and @yjiq150 I've got a test case running now, so hopefully I can reproduce the behavior.

When you say that you're calling an external service, are you using an ExternalName Service type?

Or is your service making an http call directly to the URL of the external service?

@yjiq150
Copy link

yjiq150 commented Dec 27, 2019

No it's not ExternaName service from k8s. It was just http/https API calls to 3rd party servers outside the k8s cluster.

@rocketraman
Copy link
Contributor Author

Yes, it's a direct http call to the URL of the external service outside the k8s cluster.

@cpretzer
Copy link
Contributor

@rocketraman @yjiq150 thank you for that. My test case is also making direct calls to a URL that I created. So far I've not seen the behavior described in this issue, but I'll keep testing.

@cpretzer
Copy link
Contributor

cpretzer commented Jan 3, 2020

@rocketraman @yjiq150 my test has been running for a few days now and so far I haven't reproduced the error.

The external service is set to reply after 10 seconds and returns errors a percentage of the time.

I see in the original post that the application is retrying a request after 30 seconds:

app-55db9ff5bf-sthkq app 2019-10-17T16:21:13.8160006Z 2019-10-17 16:21:13,815 WARN  --- [ty-call-handler#2156] MyApp          : Long-running call attempt 0 failed, retrying... Timed out waiting for 300000 ms [[EXCEPTION: kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 300000 ms|       ... suppressed 5 lines| at java.lang.Thread.run(Thread.java:834) [?:?]|]]

What is the duration, on average, for responses where you see this behavior?

@cpretzer
Copy link
Contributor

cpretzer commented Jan 6, 2020

@rocketraman I heard from @grampelberg that you found the cause for this issue in a third-party dependency, so I'm going to close it for now.

@yjiq150 if you have more information about the scenario where you saw this behavior, please reopen the ticket with details on how to reproduce.

@cpretzer cpretzer closed this as completed Jan 6, 2020
@rocketraman
Copy link
Contributor Author

I thought it was two separate issues, but yes, it is possible its the same cause for both of them. For reference, here is the explanation of the underlying problem, and the workaround:

#3605 (comment)

@yjiq150
Copy link

yjiq150 commented Jan 30, 2020

Thank you for the update!

I investigated my issue based on the idea from your case and I found that my elasticsearch-js (nodejs) client use connection pool with keep alive setting.

No eviction policy was set for the idle connection in the pool. I set the max idle time for connections to the same value as net.netfilter.nf_conntrack_tcp_timeout_close_wait last week.
I found no more 502 Bad Gateway errors since then (Yay!).

I guess the error scenario as follows.
The idle connections in connection pool the somehow still connected to the server and at some point the connection is still alive and seems normal from the application level.
However it gets 502 Bad Gateway error when the requested through that connection.

@cpretzer What do you think about this? Shouldn't there be no problem when keeping a idle connection without eviction?
(I investigated nodejs http/socket library that elasticsearch-js internally uses. However, It seems the library does not have half-close behavior)

@cpretzer
Copy link
Contributor

@yjiq150 thanks for the update.

I don't know about this particular situation, but I've seen some unexpected behavior when a client expects a connection to be open through a keepalive mechanism and a server closes the connection.

What was the duration of the net.netfilter. nf_conntrack_tcp_timeout_close_wait value that you used to reconfigure the max idle time? Is that max idle time configuration in the elasticsearch-js library?

@yjiq150
Copy link

yjiq150 commented Feb 1, 2020

I didn't change net.netfilter. nf_conntrack_tcp_timeout_close_wait value in k8s cluster and I guess the default value is 60 seconds. So I just applied the eviction configuration for connections with the 60 seconds max idle time.

@ihcsim
Copy link
Contributor

ihcsim commented Feb 2, 2020

(I investigated nodejs http/socket library that elasticsearch-js internally uses. However, It seems the library does not have half-close behavior)

I don't think it's just the client that will/can cause a half-close connection. Not an ES expert, but there are a number of these ES half-close connection issues out there: elastic/elasticsearch-php#225, https://discuss.elastic.co/t/logstash-output-elasticsearch-connection-reset-while-trying-to-send-bulk-request-to-elasticsearch/173941, https://discuss.elastic.co/t/transportclient-disconnecting-and-not-reconnecting-automatically/13791, https://stackoverflow.com/q/49844295/1144203. They sound very similar to what you are seeing, right? And all point to different factors that can contribute to half-close connections.

@yjiq150
Copy link

yjiq150 commented Feb 3, 2020

Thanks for the additional information.

However, I still think that the problem has something to do with custom iptable rule installed by linkerd, because there were no problem at all when we use elasticsearch-js without maxIdle eviction policy with no linkerd proxy installed.

@ericsuhong
Copy link

We were facing the same issue. We were about to give up on Linkerd until we found this post!

We are using Nginx Ingress, and we increased net.netfilter. nf_conntrack_tcp_timeout_close_wait to 3600 seconds for ingress controller pods using PodSecurityPolicy + SecurityContext.

Once we applied this fix, 502 Gateway timeout issues completely went away.

I also believe that this issue is specific to Linkerd's implementation and iptable rules.

Can we put this information in somewhere in Linkerd documentation for debugging 502 Gateway timeouts? This will save tons of debugging times for users and potentially save many Linkerd users from moving away!

@grampelberg
Copy link
Contributor

See #4276 to change this by default.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants