Implement gRPC-specific tracing for RPC life cycle #1986

jsha · 2018-04-11T21:38:58Z

I'm using gRPC v1.1 and Go v1.10.1 on Linux.

I found that with a single gRPC client and a single gRPC server, when load goes above 250 concurrent requests, a lot of my RPCs are delivered late, often significantly late (90 seconds or more). I believe I was hitting the MaxConcurrentStreams limit. Adjusting that limit higher fixed the problem.

When debugging this, it was hard to gather accurate data. I'd like to generate either a stat or a log line when I hit the concurrent stream limit in the future, so I know that I need to scale up my service or adjust the limit. It looks like the relevant code is around https://github.com/grpc/grpc-go/blob/master/transport/http2_client.go#L539, and I might need to provide an accessor for the current value of t.waitingStreams in order to use it as a Prometheus gauge. What do you think of this approach to the problem? Would you be willing to accept a patch? Are there alternate approaches that you like better?

Thanks,
Jacob

The text was updated successfully, but these errors were encountered:

MakMukhi · 2018-04-11T21:48:12Z

MaxConcurrentStreams is a limit that the server sets for its clients by sending out HTTP2 settings frame.
I'm not quite sure what do you mean by "when I hit the concurrent stream limit". As soon as you create at least as many streams allowed by the server, you have hit the limit.
As a user one is allowed to spawn as many streams as one would like but the gRPC client needs to respect the limit set by the server to throttle the clients. A server would ideally set it based on the resources available, so if your server has ample resources and are willing to increase the limit, why set a limit in the first place?

jsha · 2018-04-11T22:01:05Z

As soon as you create at least as many streams allowed by the server, you have hit the limit.

Correct, that's exactly what I meant.

why set a limit in the first place?

Golang's HTTP/2 library sets a default limit of 250: https://github.com/golang/net/blob/master/http2/server.go#L56. gRPC inherits that limit.

I am indeed planning to increase the limit from the default. However, I'd like future visibility into whether I'm hitting that new, higher limit. I have visibility into other parts of my system, like how long RPCs are taking, via https://github.com/grpc-ecosystem/go-grpc-prometheus (which uses interceptors). However, since the waiting happens after the client interceptor has already handed off control to the gRPC library, this particular source of latency falls between the cracks and is not currently possible to measure.

MakMukhi · 2018-04-11T22:30:14Z

I suppose, right before this line, we could add the following logging:

if t.waitingStreams == 0 {
info("Client is trying to create more streams that allowed by server.")
}

What do you think about this? However, I'd like to get more input on this from the rest of the team.

Out of curiosity, is gRPC-Go performance something that you guys care about. If so, we'd love to hear what scenario does your service run in?

jsha · 2018-04-11T23:13:36Z

The info approach seems like the simplest and fastest, and would meet our most basic visibility needs: we could set an alert to go off when that line shows up in our logs.

One similar but slightly nicer alternative would be to record the time before and after this for loop, so we could print something like fmt.Sprintf("RPC %s spent %s blocking on stream availability; check MaxConcurrentStreams", callHdr.Method, time.Since(loopBegan)). This would allow someone to immediately identify how bad the problem is (milliseconds or seconds).

The most thorough intervention would allow some way to graph the effect of blocking on available streams. For instance, the client could set a header to indicate when the request was created (before any in-client blocking). A server interceptor could examine that header, and compare it to the current server time to calculate an "rpc_lateness" stat: The different between when the client wanted to make the request, and when the server got it (subject to the vagaries of clock skew, of course).

Out of curiosity, is gRPC-Go performance something that you guys care about. If so, we'd love to hear what scenario does your service run in?

Yep! I work on https://github.com/letsencrypt/boulder/, the server for Let's Encrypt, a free Certificate Authority. We use gRPC extensively and care about its performance. So far, performance is great! This is the first snag we hit. We have about 7 different gRPC services. Some of those services call other services in turn. Some are moderately low-latency, like our storage service, which fronts a database and has a timeout of 5 seconds. Others are very high-latency, like our validation service, which makes requests out to the Internet and has a timeout of 90 seconds.

jsha · 2018-04-11T23:25:34Z

Another interesting approach, perhaps even better, would be to allow the server to expose its current number of concurrent streams on a per-client basis. This would allow us to graph that number over time and alert if it comes within half of the limit, so that we could scale up well in advance of hitting any latency issues.

MakMukhi · 2018-04-12T00:33:16Z

What you're describing is part of one of our goals; to have a gRPC-specific tracing for an RPC. We plan to record the time it takes an RPC in various stages of its life cycle; from creation to being scheduled, from being scheduled to being written on the wire and from there to getting a response back. We might add more events to that list since we haven't gotten to creating a formal design yet.

Doing this also requires that normal scenarios (when this tracing is turned-off) are not affected by these expensive time.Now() calls.

We can use this issue to track progress on this goal.

On the server-side, we could have similar info logs(as mentioned in the previous comment) printed out whenever settings for a connection are updated.

That's great to know. We are actively working on improving our performance and feedback is always welcome.

Additionally, looking at your code(part of it) looks like you are using gRPC's native server implementation which defaults to math.MaxUint32 for MaxConcurrentStreams. How does your client get throttled to go HTTP2 server's default limit?

jsha · 2018-04-12T18:07:23Z

Additionally, looking at your code(part of it) looks like you are using gRPC's native server implementation which defaults to math.MaxUint32 for MaxConcurrentStreams. How does your client get throttled to go HTTP2 server's default limit?

Interesting. I didn't consciously choose a native server implementation vs a non-native one. What controls that? Is that something that's changed since v1.1?

MakMukhi · 2018-04-12T18:18:55Z

It's not new. If you look at the example code in our repo, you'll see that gRPC servers are launched by calling Serve() on them. That's using gRPC's native implementation for the underlying HTTP2 transport.
Looks like your code does the same thing.
Why then, I wonder, do you see a smaller limit? When you say you increased the MaxConcurrentStreams limit, how did you do that?

One thing that I should mention: by default gRPC clients start with a limit set to 100, this limit is overridden when the server sends its first settings frame(part of the handshake process).

jsha · 2018-04-12T20:16:38Z

Ah, I think there may be a bug with setting maxStreams to MaxUint32:

grpc-go/transport/http2_server.go

Lines 131 to 139 in db0f071

    
           maxStreams := config.MaxStreams 
        
           if maxStreams == 0 { 
        
           	maxStreams = math.MaxUint32 
        
           } else { 
        
           	isettings = append(isettings, http2.Setting{ 
        
           		ID:  http2.SettingMaxConcurrentStreams, 
        
           		Val: maxStreams, 
        
           	}) 
        
           }

Note that in the default case, maxStreams is set to MaxUint32, but that is not sent as a settings frame. That works fine if you assume the underlying HTTP/2 library isn't sending a settings frame for MaxConcurrentStreams, but in this case the Go library is sending a default value of 250.

MakMukhi · 2018-04-12T20:31:02Z

Do you have a proxy between your server and client? How did you increase this limit?

jsha · 2018-04-12T20:38:03Z

There's no proxy. Here's a branch where I set up a toy client and server to test the behavior: https://github.com/letsencrypt/boulder/compare/chillcli. At first I used it with the default limit and verified the blocking behavior. Then I used grpc.NewServer(... grpc.MaxConcurrentStreams(1000),...) to increase the limit, and verified that that fixed the blocking behavior (so long as concurrent requests were below 1000).

MakMukhi · 2018-04-12T21:58:21Z

Ah man! Ok so this is what's happening:

The version of gRPC-Go that you are using is very old and had a bug such that the client would default to 100 streams and not update this value even after receiving the first settings frame from server unless the server explicitly set that setting.
I'm sorry I got confused when you referred to code from head and thought that you were perhaps using a newer version. I do realize now you mentioned using v1.1 couple times actually.
Moreover, that old version has a vastly different implementation of the underlying transport. We have made several major performance improvements on that since then. I'd recommend trying out our latest release. The most recent performance improvement, however, will go out in v1.12. You're more than welcome to try it out on head.
Also, feedback is always appreciated. :)

jsha · 2018-04-12T22:42:11Z

Excellent, thanks for the explanation! I will work on upgrading our gRPC dependency. We'll need to do a little work since there was a change to balancers and/or certificate handling that broke our balancer/validator. I'll check back in once I've landed the fix.

jsha · 2018-04-24T00:15:13Z

FYI, I upgraded our gRPC dependency, and used the sample client/server I linked earlier. I can confirm that even without setting MaxConcurrentStreams explicitly, I can run a very large number of RPCs concurrently, suggesting that MaxConcurrentStreams is defaulting to MaxUint32 as expected. Thanks for the help! I'll leave this ticket open to track the topic of tracing.

MakMukhi · 2018-04-24T00:25:52Z

Glad that worked.

adtac · 2019-11-07T21:57:45Z

I should probably be assigned this

menghanl · 2021-05-03T23:50:31Z

Closing due to lack of activity and priority.

MakMukhi added Type: Question Status: Requires Reporter Clarification labels Apr 11, 2018

jsha mentioned this issue Apr 11, 2018

VA receives PerformValidation RPCs very late letsencrypt/boulder#3641

Closed

MakMukhi changed the title ~~Monitor blocking on stream availability~~ Implement gRPC-specific tracing for RPC life cycle Apr 12, 2018

MakMukhi added P1 Type: Feature and removed Status: Requires Reporter Clarification Type: Question labels Apr 12, 2018

menghanl assigned MakMukhi Apr 12, 2018

dfawley added P2 and removed P1 labels Apr 26, 2018

dfawley unassigned MakMukhi Oct 1, 2018

dfawley added Type: Performance and removed Type: Feature labels May 10, 2019

stale bot added the stale label Sep 6, 2019

dfawley removed the stale label Sep 6, 2019

menghanl assigned adtac Nov 7, 2019

grpc deleted a comment from stale bot Nov 7, 2019

cendhu mentioned this issue Feb 3, 2020

Throttle endorsement and deliver service hyperledger/fabric#529

Closed

adtac mentioned this issue Feb 8, 2020

profiling: add hooks within grpc #3159

Merged

menghanl closed this as completed May 3, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement gRPC-specific tracing for RPC life cycle #1986

Implement gRPC-specific tracing for RPC life cycle #1986

jsha commented Apr 11, 2018

MakMukhi commented Apr 11, 2018

jsha commented Apr 11, 2018 •

edited

Loading

MakMukhi commented Apr 11, 2018

jsha commented Apr 11, 2018

jsha commented Apr 11, 2018

MakMukhi commented Apr 12, 2018

jsha commented Apr 12, 2018

MakMukhi commented Apr 12, 2018

jsha commented Apr 12, 2018

MakMukhi commented Apr 12, 2018

jsha commented Apr 12, 2018

MakMukhi commented Apr 12, 2018

jsha commented Apr 12, 2018

jsha commented Apr 24, 2018

MakMukhi commented Apr 24, 2018

adtac commented Nov 7, 2019

menghanl commented May 3, 2021

Implement gRPC-specific tracing for RPC life cycle #1986

Implement gRPC-specific tracing for RPC life cycle #1986

Comments

jsha commented Apr 11, 2018

MakMukhi commented Apr 11, 2018

jsha commented Apr 11, 2018 • edited Loading

MakMukhi commented Apr 11, 2018

jsha commented Apr 11, 2018

jsha commented Apr 11, 2018

MakMukhi commented Apr 12, 2018

jsha commented Apr 12, 2018

MakMukhi commented Apr 12, 2018

jsha commented Apr 12, 2018

MakMukhi commented Apr 12, 2018

jsha commented Apr 12, 2018

MakMukhi commented Apr 12, 2018

jsha commented Apr 12, 2018

jsha commented Apr 24, 2018

MakMukhi commented Apr 24, 2018

adtac commented Nov 7, 2019

menghanl commented May 3, 2021

jsha commented Apr 11, 2018 •

edited

Loading