gRPC traffic is not retried because there is no content-length header #7141

cpretzer · 2021-10-22T17:37:17Z

What is the issue?

While testing gRPC retries using the emojivoto application, I found that the gRPC requests between the web and voting services are not retryable because the gRPC protocol doesn't have a content-length header, which is part of the logic that is used to determine whether a POST request is retryable.

How can it be reproduced?

Deploy linkerd 2.11 to a cluster (assumes the CLI is installed): linkerd install | kubectl apply -f -
Deploy emojivoto: kubectl apply -f https://run.linkerd.io/emojivoto.yml
Scale vote-bot to 0 replicas, because we want to target only the VoteDoughnut route
Deploy the voting ServiceProfile with isRetryable enabled for all routes: kubectl apply -f https://raw.githubusercontent.com/BuoyantIO/emojivoto/main/training/service-profiles/voting-svc-profile.yml
Set the proxy-log-level to linkerd=trace,info for the web pod: docs
Port forward to the web service: kubectl port-forward svc/web -n emojivoto 8080:80
Make a request to vote for the doughnut: curl http://localhost:8080/api/vote?choice=:doughnut:
View the proxy logs for the web pod and you will see something like the output below

Logs, error output, etc

2021-10-21T17:33:24.352503923Z [   564.043062s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_stack_tracing: service ready=true ok=true
2021-10-21T17:33:24.352518037Z [   564.043072s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://voting-svc.emojivoto:8080/emojivoto.v1.VotingService/VoteDoughnut, version: HTTP/2.0, headers: {"content-type": "application/grpc", "user-agent": "grpc-go/1.29.1", "te": "trailers", "grpc-trace-bin": "AACG+V5sxnQNK5cC/n1VpjOjAU3DqwK1qCUqAgA"}, body: BoxBody }
2021-10-21T17:33:24.352667528Z [   564.043231s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_service_profiles::http::route_request: Using configured route condition=All([Method(POST), Path(^/emojivoto\.v1\.VotingService/VoteDoughnut$)])
2021-10-21T17:33:24.352675121Z [   564.043242s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_retry: retryable=true
2021-10-21T17:33:24.352679109Z [   564.043246s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_app_core::retry: not retryable req.has_body=true req.content_length=None
2021-10-21T17:33:24.352696058Z [   564.043270s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}:concrete{addr=voting-svc.emojivoto.svc.cluster.local:8080}:endpoint{server.addr=10.42.0.75:8080}: linkerd_reconnect: Ready

`linkerd check` output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check

Status check results are √

Environment

Kubernetes Version: v1.21.3+k3s1
Cluster Environment: k3d
Host OS: PopOS
Linkerd version: 2.11

Possible solution

I'm still researching the gRPC protocol, so no proposed solution. After discussing with the Linkerd maintainers, there are a few things to take into consideration for a solution:

The current implementation is for POST requests, which do have a a content-length header, so the solution should preserve support for POST bodies
gRPC can be Unary or Streaming, which should be taken into consideration
The gRPC protocol includes a Message-Length field as part of a DATA frame, but there may be a more reliable way to cover both H1 POST requests and gRPC requests

The text was updated successfully, but these errors were encountered:

Currently, the proxy will retry requests with bodies if and only if they have a `Content-Length` header with a value <= 64 KB. If the request has a body but no `Content-Length` header, we currently assume that its body will exceed the maximum buffered size, and skip trying to retry it. However, this means gRPC requests will never be retried, because it turns out gRPC requests don't include a `Content-Length` header (see linkerd/linkerd2#7141). Whoops! This PR fixes this by changing the retry logic so that `Content-Length` is treated as a _hint_, rather than a requirement. Now, we will preemptively skip retrying requests with bodies when the there is a `Content-Length` header with a value that exceeds the limit, but if no `Content-Length` header is present, we will still attempt to buffer the body for a potential retry. We are still protected against unbounded buffering because the buffering body will stop buffering and discard any previously buffered data if the buffered length ever exceeds the maximum. Requiring a `Content-Length` simply allowed us to skip _trying_. This should probably fix linkerd/linkerd2#7141, although it might be worth testing with a non-Rust gRPC implementation to make sure it works?

Currently, the proxy will retry requests with bodies if and only if they have a `content-length` header with a value <= 64 KB. If the request has a body but no `content-length` header, we currently assume that its body will exceed the maximum buffered size, and skip trying to retry it. However, this means gRPC requests will never be retried, because it turns out gRPC requests don't include a `content-length` header (see linkerd/linkerd2#7141). Whoops! This PR fixes this by changing the retry logic to use `Body::size_hint` to determine if buffering should be attempted. This value will be set from `content-length` when it is set and may be set in additional situations where the body length is known before the request is processed. We are still protected against unbounded buffering because the buffering body will stop buffering and discard any previously buffered data if the buffered length ever exceeds the maximum. Co-authored-by: Oliver Gould <[email protected]>

olix0r added this to the stable-2.12.0 milestone Oct 22, 2021

adleong assigned hawkw Oct 26, 2021

hawkw mentioned this issue Oct 29, 2021

retry: allow retrying requests without Content-Length headers linkerd/linkerd2-proxy#1341

Merged

olix0r closed this as completed in linkerd/linkerd2-proxy#1341 Nov 1, 2021

github-actions bot locked as resolved and limited conversation to collaborators Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC traffic is not retried because there is no content-length header #7141

gRPC traffic is not retried because there is no content-length header #7141

cpretzer commented Oct 22, 2021

gRPC traffic is not retried because there is no content-length header #7141

gRPC traffic is not retried because there is no content-length header #7141

Comments

cpretzer commented Oct 22, 2021

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

`linkerd check` output