502 Bad Gateway from proxy on retry after timeout #3596

rocketraman · 2019-10-17T16:38:12Z

Bug Report

What is the issue?

I call an external http/1.1 service from a linkerd2-injected container. The service takes a while to execute each request, and often the client side times out and retries.

The behavior I'm seeing is

Request made
Request times out
1 second later, my container retries the request again, and
linkerd proxy immediately returns 502 Bad Gateway to the container

How can it be reproduced?

I don't have a full reproduction, however in my environment I can easily reproduce it by ensuring a timeout occurs by blocking the external service from responding.

Logs, error output, etc

app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815158241Z DBUG [210207.178386s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound dst=Some(DstAddr { dst_logical: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), dst_concrete: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), direction: Out, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815356042Z DBUG [210207.178566s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound ep=Some(Endpoint { dst_logical: None, dst_concrete: None, addr: V4(10.254.2.5:7071), identity: None(NoPeerName(NotProvidedByServiceDiscovery)), metadata: Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None }, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815472742Z DBUG [210207.178643s] 10.254.2.5:7071 linkerd2_proxy::proxy::http::client client request: method=POST uri=http://x-ops-fevm.azure.x.com:7071/api/p version=HTTP/1.1 headers={"host": "x-ops-fevm.azure.x.com:7071", "accept": "application/json", "accept-charset": "UTF-8", "user-agent": "Ktor client", "content-type": "application/pdf", "content-length": "9984137", "l5d-dst-canonical": "x-ops-fevm.azure.x.com:7071"}
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.815498442Z DBUG [210207.178721s] linkerd2_proxy::transport::connect connecting to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.817578849Z DBUG [210207.180791s] linkerd2_proxy::transport::connect connection established to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:16:13.817596449Z DBUG [210207.180878s] linkerd2_proxy::transport::metrics client connection open
[... bunch of linkerd2_proxy::transport::tls::listen ...]
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:18:01.392623774Z DBUG [210314.755744s] linkerd2_proxy::proxy::server http1 server error: Error(BodyWrite, Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })
[... bunch of linkerd2_proxy::transport::tls::listen ...]
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:19:15.452792Z DBUG [210388.815977s] linkerd2_proxy::proxy::tcp forward duplex complete: Connection reset by peer (os error 104)
[... bunch of linkerd2_proxy::transport::tls::listen ...]
app-55db9ff5bf-sthkq app 2019-10-17T16:21:13.8160006Z 2019-10-17 16:21:13,815 WARN  --- [ty-call-handler#2156] MyApp          : Long-running call attempt 0 failed, retrying... Timed out waiting for 300000 ms [[EXCEPTION: kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 300000 ms|       ... suppressed 5 lines| at java.lang.Thread.run(Thread.java:834) [?:?]|]]
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.821686419Z DBUG [210508.184840s] linkerd2_proxy::transport::tls::listen accepted connection from 10.5.1.125:35288 to Some(V4(10.254.2.5:7071)); skipping TLS (loopback)
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.822822123Z DBUG [210508.186064s] linkerd2_proxy::app::outbound outbound addr=Ok(Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }))
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.822892923Z DBUG [210508.186148s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound dst=Some(DstAddr { dst_logical: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), dst_concrete: Name(NameAddr { name: "x-ops-fevm.azure.x.com", port: 7071 }), direction: Out, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.823080324Z DBUG [210508.186321s] x-ops-fevm.azure.x.com:7071 linkerd2_proxy::app::outbound outbound ep=Some(Endpoint { dst_logical: None, dst_concrete: None, addr: V4(10.254.2.5:7071), identity: None(NoPeerName(NotProvidedByServiceDiscovery)), metadata: Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None }, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.823131324Z DBUG [210508.186397s] 10.254.2.5:7071 linkerd2_proxy::proxy::http::client client request: method=POST uri=http://x-ops-fevm.azure.x.com:7071/api/p version=HTTP/1.1 headers={"host": "x-ops-fevm.azure.x.com:7071", "accept": "application/json", "accept-charset": "UTF-8", "user-agent": "Ktor client", "content-type": "application/pdf", "content-length": "9984137", "l5d-dst-canonical": "x-ops-fevm.azure.x.com:7071"}
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.823202624Z DBUG [210508.186450s] linkerd2_proxy::transport::connect connecting to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.82501623Z DBUG [210508.188244s] linkerd2_proxy::transport::connect connection established to 10.254.2.5:7071
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.82503323Z DBUG [210508.188303s] linkerd2_proxy::transport::metrics client connection open
app-55db9ff5bf-sthkq linkerd-proxy 2019-10-17T16:21:14.825265131Z DBUG [210508.188420s] proxy={client=out dst=10.254.2.5:7071 proto=Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false }} linkerd2_proxy::proxy::http::glue http body error: error reading a body from connection: end of file before message length reached
app-55db9ff5bf-sthkq linkerd-proxy app-55db9ff5bf-sthkq2019-10-17T16:21:14.825554932Z ERR! [210508.188673s] linkerd2_proxy::app::errors unexpected error: error from user's Payload stream: error reading a body from connection: end of file before message length reached
app 2019-10-17T16:21:14.827497238Z 2019-10-17 16:21:14,826 WARN  --- [ty-call-handler#2156] MyApp          : Long-running call attempt 1 failed, retrying... Unable to do long running call, status=[502 Bad Gateway]

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

Kubernetes Version: 1.12.7
Cluster Environment: AKS (via AKS-engine)
Host OS: Ubuntu 16.04.6 LTS
Linkerd version:

$ linkerd version
Client version: stable-2.6.0
Server version: stable-2.6.0

Possible solution

Additional context

The text was updated successfully, but these errors were encountered:

rocketraman · 2019-10-17T18:09:37Z

Some more information:

I used the debug sidecar to capture packets. From what I can tell, according to the packet capture, the debug sidecar is reporting that the 502 Bad Gateway is coming from the remote service.

However, if I remove the linkerd proxy completely and run the request again, I never see the 502 Bad Gateway at all and retries seem to work correctly.

ihcsim · 2019-11-26T21:31:59Z

@rocketraman Is this issue related to the one you described on slack, where there was a thread pool exhaustion in the application code? LMK if we can close this issue, if it's not Linkerd-related.

rocketraman · 2019-11-26T23:31:20Z

@ihcsim no this is a different issue and is linkerd related AFAIK.

yjiq150 · 2019-12-13T04:53:52Z

I have the same situation here.

In my server, a lot of outbound http APIs calls are executed. For fresh started linkered, it works well, but those outbound API calls started to response with 502 Bad Gateway after some random amount of time have passed.
(Sometime it happens after few days and sometimes it happen few hours after the server restart.)
When I restart the server(linkerd side car container also be restarted), the problem is gone, then same thing happens after a while.

I am not sure the following logs is directly related to this issue, but it's always there when problem happens.

linkerd2_proxy::proxy::http::canonicalize failed to refine 'yyy.xxxxx.com': deadline has elapsed; using original name

(yyy.xxxxx.com is redacted value which was originally http API provider's domain outside of my k8s cluster)

linkerd2_proxy::app::errors unexpected error: error trying to connect: request timed out

It continuously happens since I used linkerd 2.4.0 and still happening at linkerd 2.6.0 now.

cpretzer · 2019-12-23T21:52:47Z

@rocketraman @yjiq150 thanks for the reports. I'm working on a reproduction case now and will update when I have more information.

cpretzer · 2019-12-25T23:02:02Z

@rocketraman and @yjiq150 I've got a test case running now, so hopefully I can reproduce the behavior.

When you say that you're calling an external service, are you using an ExternalName Service type?

Or is your service making an http call directly to the URL of the external service?

yjiq150 · 2019-12-27T06:20:36Z

No it's not ExternaName service from k8s. It was just http/https API calls to 3rd party servers outside the k8s cluster.

rocketraman · 2019-12-27T07:12:56Z

Yes, it's a direct http call to the URL of the external service outside the k8s cluster.

cpretzer · 2019-12-27T15:38:49Z

@rocketraman @yjiq150 thank you for that. My test case is also making direct calls to a URL that I created. So far I've not seen the behavior described in this issue, but I'll keep testing.

cpretzer · 2020-01-03T21:48:33Z

@rocketraman @yjiq150 my test has been running for a few days now and so far I haven't reproduced the error.

The external service is set to reply after 10 seconds and returns errors a percentage of the time.

I see in the original post that the application is retrying a request after 30 seconds:

app-55db9ff5bf-sthkq app 2019-10-17T16:21:13.8160006Z 2019-10-17 16:21:13,815 WARN  --- [ty-call-handler#2156] MyApp          : Long-running call attempt 0 failed, retrying... Timed out waiting for 300000 ms [[EXCEPTION: kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 300000 ms|       ... suppressed 5 lines| at java.lang.Thread.run(Thread.java:834) [?:?]|]]

What is the duration, on average, for responses where you see this behavior?

cpretzer · 2020-01-06T23:21:13Z

@rocketraman I heard from @grampelberg that you found the cause for this issue in a third-party dependency, so I'm going to close it for now.

@yjiq150 if you have more information about the scenario where you saw this behavior, please reopen the ticket with details on how to reproduce.

rocketraman · 2020-01-16T14:40:07Z

I thought it was two separate issues, but yes, it is possible its the same cause for both of them. For reference, here is the explanation of the underlying problem, and the workaround:

#3605 (comment)

yjiq150 · 2020-01-30T01:18:20Z

Thank you for the update!

I investigated my issue based on the idea from your case and I found that my elasticsearch-js (nodejs) client use connection pool with keep alive setting.

No eviction policy was set for the idle connection in the pool. I set the max idle time for connections to the same value as net.netfilter.nf_conntrack_tcp_timeout_close_wait last week.
I found no more 502 Bad Gateway errors since then (Yay!).

I guess the error scenario as follows.
The idle connections in connection pool the somehow still connected to the server and at some point the connection is still alive and seems normal from the application level.
However it gets 502 Bad Gateway error when the requested through that connection.

@cpretzer What do you think about this? Shouldn't there be no problem when keeping a idle connection without eviction?
(I investigated nodejs http/socket library that elasticsearch-js internally uses. However, It seems the library does not have half-close behavior)

cpretzer · 2020-01-31T21:27:28Z

@yjiq150 thanks for the update.

I don't know about this particular situation, but I've seen some unexpected behavior when a client expects a connection to be open through a keepalive mechanism and a server closes the connection.

What was the duration of the net.netfilter. nf_conntrack_tcp_timeout_close_wait value that you used to reconfigure the max idle time? Is that max idle time configuration in the elasticsearch-js library?

yjiq150 · 2020-02-01T04:30:35Z

I didn't change net.netfilter. nf_conntrack_tcp_timeout_close_wait value in k8s cluster and I guess the default value is 60 seconds. So I just applied the eviction configuration for connections with the 60 seconds max idle time.

ihcsim · 2020-02-02T04:21:30Z

(I investigated nodejs http/socket library that elasticsearch-js internally uses. However, It seems the library does not have half-close behavior)

I don't think it's just the client that will/can cause a half-close connection. Not an ES expert, but there are a number of these ES half-close connection issues out there: elastic/elasticsearch-php#225, https://discuss.elastic.co/t/logstash-output-elasticsearch-connection-reset-while-trying-to-send-bulk-request-to-elasticsearch/173941, https://discuss.elastic.co/t/transportclient-disconnecting-and-not-reconnecting-automatically/13791, https://stackoverflow.com/q/49844295/1144203. They sound very similar to what you are seeing, right? And all point to different factors that can contribute to half-close connections.

yjiq150 · 2020-02-03T13:35:26Z

Thanks for the additional information.

However, I still think that the problem has something to do with custom iptable rule installed by linkerd, because there were no problem at all when we use elasticsearch-js without maxIdle eviction policy with no linkerd proxy installed.

ericsuhong · 2020-04-18T22:34:02Z

We were facing the same issue. We were about to give up on Linkerd until we found this post!

We are using Nginx Ingress, and we increased net.netfilter. nf_conntrack_tcp_timeout_close_wait to 3600 seconds for ingress controller pods using PodSecurityPolicy + SecurityContext.

Once we applied this fix, 502 Gateway timeout issues completely went away.

I also believe that this issue is specific to Linkerd's implementation and iptable rules.

Can we put this information in somewhere in Linkerd documentation for debugging 502 Gateway timeouts? This will save tons of debugging times for users and potentially save many Linkerd users from moving away!

grampelberg · 2020-04-20T15:52:01Z

See #4276 to change this by default.

grampelberg added bug needs/repro labels Oct 28, 2019

grampelberg assigned cpretzer Dec 16, 2019

cpretzer closed this as completed Jan 6, 2020

ghost mentioned this issue Jun 4, 2020

Pods return 502 Bad Gateway on requests Altinn/altinn-studio#4320

Closed

muhammad-fahad2068 mentioned this issue Nov 8, 2020

Graphql process getting killed after injecting linkerd #5186

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

502 Bad Gateway from proxy on retry after timeout #3596

502 Bad Gateway from proxy on retry after timeout #3596

rocketraman commented Oct 17, 2019 •

edited

Loading

rocketraman commented Oct 17, 2019 •

edited

Loading

ihcsim commented Nov 26, 2019

rocketraman commented Nov 26, 2019 •

edited

Loading

yjiq150 commented Dec 13, 2019

cpretzer commented Dec 23, 2019 •

edited

Loading

cpretzer commented Dec 25, 2019

yjiq150 commented Dec 27, 2019

rocketraman commented Dec 27, 2019

cpretzer commented Dec 27, 2019

cpretzer commented Jan 3, 2020

cpretzer commented Jan 6, 2020

rocketraman commented Jan 16, 2020

yjiq150 commented Jan 30, 2020

cpretzer commented Jan 31, 2020

yjiq150 commented Feb 1, 2020

ihcsim commented Feb 2, 2020

yjiq150 commented Feb 3, 2020

ericsuhong commented Apr 18, 2020

grampelberg commented Apr 20, 2020

502 Bad Gateway from proxy on retry after timeout #3596

502 Bad Gateway from proxy on retry after timeout #3596

Comments

rocketraman commented Oct 17, 2019 • edited Loading

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

rocketraman commented Oct 17, 2019 • edited Loading

ihcsim commented Nov 26, 2019

rocketraman commented Nov 26, 2019 • edited Loading

yjiq150 commented Dec 13, 2019

cpretzer commented Dec 23, 2019 • edited Loading

cpretzer commented Dec 25, 2019

yjiq150 commented Dec 27, 2019

rocketraman commented Dec 27, 2019

cpretzer commented Dec 27, 2019

cpretzer commented Jan 3, 2020

cpretzer commented Jan 6, 2020

rocketraman commented Jan 16, 2020

yjiq150 commented Jan 30, 2020

cpretzer commented Jan 31, 2020

yjiq150 commented Feb 1, 2020

ihcsim commented Feb 2, 2020

yjiq150 commented Feb 3, 2020

ericsuhong commented Apr 18, 2020

grampelberg commented Apr 20, 2020

rocketraman commented Oct 17, 2019 •

edited

Loading

`linkerd check` output

rocketraman commented Oct 17, 2019 •

edited

Loading

rocketraman commented Nov 26, 2019 •

edited

Loading

cpretzer commented Dec 23, 2019 •

edited

Loading