Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement gRPC-specific tracing for RPC life cycle #1986

Closed
jsha opened this issue Apr 11, 2018 · 17 comments
Closed

Implement gRPC-specific tracing for RPC life cycle #1986

jsha opened this issue Apr 11, 2018 · 17 comments
Assignees
Labels
P2 Type: Performance Performance improvements (CPU, network, memory, etc)

Comments

@jsha
Copy link
Contributor

jsha commented Apr 11, 2018

I'm using gRPC v1.1 and Go v1.10.1 on Linux.

I found that with a single gRPC client and a single gRPC server, when load goes above 250 concurrent requests, a lot of my RPCs are delivered late, often significantly late (90 seconds or more). I believe I was hitting the MaxConcurrentStreams limit. Adjusting that limit higher fixed the problem.

When debugging this, it was hard to gather accurate data. I'd like to generate either a stat or a log line when I hit the concurrent stream limit in the future, so I know that I need to scale up my service or adjust the limit. It looks like the relevant code is around https://github.com/grpc/grpc-go/blob/master/transport/http2_client.go#L539, and I might need to provide an accessor for the current value of t.waitingStreams in order to use it as a Prometheus gauge. What do you think of this approach to the problem? Would you be willing to accept a patch? Are there alternate approaches that you like better?

Thanks,
Jacob

@MakMukhi
Copy link
Contributor

MaxConcurrentStreams is a limit that the server sets for its clients by sending out HTTP2 settings frame.
I'm not quite sure what do you mean by "when I hit the concurrent stream limit". As soon as you create at least as many streams allowed by the server, you have hit the limit.
As a user one is allowed to spawn as many streams as one would like but the gRPC client needs to respect the limit set by the server to throttle the clients. A server would ideally set it based on the resources available, so if your server has ample resources and are willing to increase the limit, why set a limit in the first place?

@jsha
Copy link
Contributor Author

jsha commented Apr 11, 2018

As soon as you create at least as many streams allowed by the server, you have hit the limit.

Correct, that's exactly what I meant.

why set a limit in the first place?

Golang's HTTP/2 library sets a default limit of 250: https://github.com/golang/net/blob/master/http2/server.go#L56. gRPC inherits that limit.

I am indeed planning to increase the limit from the default. However, I'd like future visibility into whether I'm hitting that new, higher limit. I have visibility into other parts of my system, like how long RPCs are taking, via https://github.com/grpc-ecosystem/go-grpc-prometheus (which uses interceptors). However, since the waiting happens after the client interceptor has already handed off control to the gRPC library, this particular source of latency falls between the cracks and is not currently possible to measure.

@MakMukhi
Copy link
Contributor

I suppose, right before this line, we could add the following logging:

if t.waitingStreams == 0 {
info("Client is trying to create more streams that allowed by server.")
}

What do you think about this? However, I'd like to get more input on this from the rest of the team.

Out of curiosity, is gRPC-Go performance something that you guys care about. If so, we'd love to hear what scenario does your service run in?

@jsha
Copy link
Contributor Author

jsha commented Apr 11, 2018

The info approach seems like the simplest and fastest, and would meet our most basic visibility needs: we could set an alert to go off when that line shows up in our logs.

One similar but slightly nicer alternative would be to record the time before and after this for loop, so we could print something like fmt.Sprintf("RPC %s spent %s blocking on stream availability; check MaxConcurrentStreams", callHdr.Method, time.Since(loopBegan)). This would allow someone to immediately identify how bad the problem is (milliseconds or seconds).

The most thorough intervention would allow some way to graph the effect of blocking on available streams. For instance, the client could set a header to indicate when the request was created (before any in-client blocking). A server interceptor could examine that header, and compare it to the current server time to calculate an "rpc_lateness" stat: The different between when the client wanted to make the request, and when the server got it (subject to the vagaries of clock skew, of course).

Out of curiosity, is gRPC-Go performance something that you guys care about. If so, we'd love to hear what scenario does your service run in?

Yep! I work on https://github.com/letsencrypt/boulder/, the server for Let's Encrypt, a free Certificate Authority. We use gRPC extensively and care about its performance. So far, performance is great! This is the first snag we hit. We have about 7 different gRPC services. Some of those services call other services in turn. Some are moderately low-latency, like our storage service, which fronts a database and has a timeout of 5 seconds. Others are very high-latency, like our validation service, which makes requests out to the Internet and has a timeout of 90 seconds.

@jsha
Copy link
Contributor Author

jsha commented Apr 11, 2018

Another interesting approach, perhaps even better, would be to allow the server to expose its current number of concurrent streams on a per-client basis. This would allow us to graph that number over time and alert if it comes within half of the limit, so that we could scale up well in advance of hitting any latency issues.

@MakMukhi
Copy link
Contributor

What you're describing is part of one of our goals; to have a gRPC-specific tracing for an RPC. We plan to record the time it takes an RPC in various stages of its life cycle; from creation to being scheduled, from being scheduled to being written on the wire and from there to getting a response back. We might add more events to that list since we haven't gotten to creating a formal design yet.

Doing this also requires that normal scenarios (when this tracing is turned-off) are not affected by these expensive time.Now() calls.

We can use this issue to track progress on this goal.

On the server-side, we could have similar info logs(as mentioned in the previous comment) printed out whenever settings for a connection are updated.

That's great to know. We are actively working on improving our performance and feedback is always welcome.

Additionally, looking at your code(part of it) looks like you are using gRPC's native server implementation which defaults to math.MaxUint32 for MaxConcurrentStreams. How does your client get throttled to go HTTP2 server's default limit?

@MakMukhi MakMukhi changed the title Monitor blocking on stream availability Implement gRPC-specific tracing for RPC life cycle Apr 12, 2018
@MakMukhi MakMukhi added P1 Type: Feature New features or improvements in behavior and removed Status: Requires Reporter Clarification Type: Question labels Apr 12, 2018
@jsha
Copy link
Contributor Author

jsha commented Apr 12, 2018

Additionally, looking at your code(part of it) looks like you are using gRPC's native server implementation which defaults to math.MaxUint32 for MaxConcurrentStreams. How does your client get throttled to go HTTP2 server's default limit?

Interesting. I didn't consciously choose a native server implementation vs a non-native one. What controls that? Is that something that's changed since v1.1?

@MakMukhi
Copy link
Contributor

It's not new. If you look at the example code in our repo, you'll see that gRPC servers are launched by calling Serve() on them. That's using gRPC's native implementation for the underlying HTTP2 transport.
Looks like your code does the same thing.
Why then, I wonder, do you see a smaller limit? When you say you increased the MaxConcurrentStreams limit, how did you do that?

One thing that I should mention: by default gRPC clients start with a limit set to 100, this limit is overridden when the server sends its first settings frame(part of the handshake process).

@jsha
Copy link
Contributor Author

jsha commented Apr 12, 2018

Ah, I think there may be a bug with setting maxStreams to MaxUint32:

maxStreams := config.MaxStreams
if maxStreams == 0 {
maxStreams = math.MaxUint32
} else {
isettings = append(isettings, http2.Setting{
ID: http2.SettingMaxConcurrentStreams,
Val: maxStreams,
})
}

Note that in the default case, maxStreams is set to MaxUint32, but that is not sent as a settings frame. That works fine if you assume the underlying HTTP/2 library isn't sending a settings frame for MaxConcurrentStreams, but in this case the Go library is sending a default value of 250.

@MakMukhi
Copy link
Contributor

Do you have a proxy between your server and client? How did you increase this limit?

@jsha
Copy link
Contributor Author

jsha commented Apr 12, 2018

There's no proxy. Here's a branch where I set up a toy client and server to test the behavior: https://github.com/letsencrypt/boulder/compare/chillcli. At first I used it with the default limit and verified the blocking behavior. Then I used grpc.NewServer(... grpc.MaxConcurrentStreams(1000),...) to increase the limit, and verified that that fixed the blocking behavior (so long as concurrent requests were below 1000).

@MakMukhi
Copy link
Contributor

Ah man! Ok so this is what's happening:

  1. The version of gRPC-Go that you are using is very old and had a bug such that the client would default to 100 streams and not update this value even after receiving the first settings frame from server unless the server explicitly set that setting.

  2. I'm sorry I got confused when you referred to code from head and thought that you were perhaps using a newer version. I do realize now you mentioned using v1.1 couple times actually.

  3. Moreover, that old version has a vastly different implementation of the underlying transport. We have made several major performance improvements on that since then. I'd recommend trying out our latest release. The most recent performance improvement, however, will go out in v1.12. You're more than welcome to try it out on head.

  4. Also, feedback is always appreciated. :)

@jsha
Copy link
Contributor Author

jsha commented Apr 12, 2018

Excellent, thanks for the explanation! I will work on upgrading our gRPC dependency. We'll need to do a little work since there was a change to balancers and/or certificate handling that broke our balancer/validator. I'll check back in once I've landed the fix.

@jsha
Copy link
Contributor Author

jsha commented Apr 24, 2018

FYI, I upgraded our gRPC dependency, and used the sample client/server I linked earlier. I can confirm that even without setting MaxConcurrentStreams explicitly, I can run a very large number of RPCs concurrently, suggesting that MaxConcurrentStreams is defaulting to MaxUint32 as expected. Thanks for the help! I'll leave this ticket open to track the topic of tracing.

@MakMukhi
Copy link
Contributor

Glad that worked.

@dfawley dfawley added P2 and removed P1 labels Apr 26, 2018
@dfawley dfawley added Type: Performance Performance improvements (CPU, network, memory, etc) and removed Type: Feature New features or improvements in behavior labels May 10, 2019
@stale stale bot added the stale label Sep 6, 2019
@dfawley dfawley removed the stale label Sep 6, 2019
@adtac
Copy link
Contributor

adtac commented Nov 7, 2019

I should probably be assigned this

@menghanl
Copy link
Contributor

menghanl commented May 3, 2021

Closing due to lack of activity and priority.

@menghanl menghanl closed this as completed May 3, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

No branches or pull requests

5 participants