Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC traffic is not retried because there is no content-length header #7141

Closed
cpretzer opened this issue Oct 22, 2021 · 0 comments · Fixed by linkerd/linkerd2-proxy#1341
Closed
Assignees
Milestone

Comments

@cpretzer
Copy link
Contributor

What is the issue?

While testing gRPC retries using the emojivoto application, I found that the gRPC requests between the web and voting services are not retryable because the gRPC protocol doesn't have a content-length header, which is part of the logic that is used to determine whether a POST request is retryable.

How can it be reproduced?

  1. Deploy linkerd 2.11 to a cluster (assumes the CLI is installed): linkerd install | kubectl apply -f -
  2. Deploy emojivoto: kubectl apply -f https://run.linkerd.io/emojivoto.yml
  3. Scale vote-bot to 0 replicas, because we want to target only the VoteDoughnut route
  4. Deploy the voting ServiceProfile with isRetryable enabled for all routes: kubectl apply -f https://raw.githubusercontent.com/BuoyantIO/emojivoto/main/training/service-profiles/voting-svc-profile.yml
  5. Set the proxy-log-level to linkerd=trace,info for the web pod: docs
  6. Port forward to the web service: kubectl port-forward svc/web -n emojivoto 8080:80
  7. Make a request to vote for the doughnut: curl http://localhost:8080/api/vote?choice=:doughnut:
  8. View the proxy logs for the web pod and you will see something like the output below

Logs, error output, etc

2021-10-21T17:33:24.352503923Z [   564.043062s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_stack_tracing: service ready=true ok=true
2021-10-21T17:33:24.352518037Z [   564.043072s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://voting-svc.emojivoto:8080/emojivoto.v1.VotingService/VoteDoughnut, version: HTTP/2.0, headers: {"content-type": "application/grpc", "user-agent": "grpc-go/1.29.1", "te": "trailers", "grpc-trace-bin": "AACG+V5sxnQNK5cC/n1VpjOjAU3DqwK1qCUqAgA"}, body: BoxBody }
2021-10-21T17:33:24.352667528Z [   564.043231s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_service_profiles::http::route_request: Using configured route condition=All([Method(POST), Path(^/emojivoto\.v1\.VotingService/VoteDoughnut$)])
2021-10-21T17:33:24.352675121Z [   564.043242s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_retry: retryable=true
2021-10-21T17:33:24.352679109Z [   564.043246s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_app_core::retry: not retryable req.has_body=true req.content_length=None
2021-10-21T17:33:24.352696058Z [   564.043270s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}:concrete{addr=voting-svc.emojivoto.svc.cluster.local:8080}:endpoint{server.addr=10.42.0.75:8080}: linkerd_reconnect: Ready

linkerd check output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check

Status check results are √

Environment

  • Kubernetes Version: v1.21.3+k3s1
  • Cluster Environment: k3d
  • Host OS: PopOS
  • Linkerd version: 2.11

Possible solution

I'm still researching the gRPC protocol, so no proposed solution. After discussing with the Linkerd maintainers, there are a few things to take into consideration for a solution:

  • The current implementation is for POST requests, which do have a a content-length header, so the solution should preserve support for POST bodies
  • gRPC can be Unary or Streaming, which should be taken into consideration
  • The gRPC protocol includes a Message-Length field as part of a DATA frame, but there may be a more reliable way to cover both H1 POST requests and gRPC requests
@olix0r olix0r added this to the stable-2.12.0 milestone Oct 22, 2021
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Oct 29, 2021
Currently, the proxy will retry requests with bodies if and only if they
have a `Content-Length` header with a value <= 64 KB. If the request has
a body but no `Content-Length` header, we currently assume that its body
will exceed the maximum buffered size, and skip trying to retry it.
However, this means gRPC requests will never be retried, because it
turns out gRPC requests don't include a `Content-Length` header (see
linkerd/linkerd2#7141). Whoops!

This PR fixes this by changing the retry logic so that `Content-Length`
is treated as a _hint_, rather than a requirement. Now, we will
preemptively skip retrying requests with bodies when the there is a
`Content-Length` header with a value that exceeds the limit, but if no
`Content-Length` header is present, we will still attempt to buffer the
body for a potential retry.

We are still protected against unbounded buffering because the buffering
body will stop buffering and discard any previously buffered data if the
buffered length ever exceeds the maximum. Requiring a `Content-Length`
simply allowed us to skip _trying_.

This should probably fix linkerd/linkerd2#7141, although it might be
worth testing with a non-Rust gRPC implementation to make sure it works?
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Nov 1, 2021
Currently, the proxy will retry requests with bodies if and only if they
have a `content-length` header with a value <= 64 KB. If the request has
a body but no `content-length` header, we currently assume that its body
will exceed the maximum buffered size, and skip trying to retry it.
However, this means gRPC requests will never be retried, because it
turns out gRPC requests don't include a `content-length` header (see
linkerd/linkerd2#7141). Whoops!

This PR fixes this by changing the retry logic to use `Body::size_hint` to
determine if buffering should be attempted. This value will be set from
`content-length` when it is set and may be set in additional situations
where the body length is known before the request is processed.

We are still protected against unbounded buffering because the buffering
body will stop buffering and discard any previously buffered data if the
buffered length ever exceeds the maximum.

Co-authored-by: Oliver Gould <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants