-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement gRPC-specific tracing for RPC life cycle #1986
Comments
|
Correct, that's exactly what I meant.
Golang's HTTP/2 library sets a default limit of 250: https://github.com/golang/net/blob/master/http2/server.go#L56. gRPC inherits that limit. I am indeed planning to increase the limit from the default. However, I'd like future visibility into whether I'm hitting that new, higher limit. I have visibility into other parts of my system, like how long RPCs are taking, via https://github.com/grpc-ecosystem/go-grpc-prometheus (which uses interceptors). However, since the waiting happens after the client interceptor has already handed off control to the gRPC library, this particular source of latency falls between the cracks and is not currently possible to measure. |
I suppose, right before this line, we could add the following logging: if t.waitingStreams == 0 {
info("Client is trying to create more streams that allowed by server.")
} What do you think about this? However, I'd like to get more input on this from the rest of the team. Out of curiosity, is gRPC-Go performance something that you guys care about. If so, we'd love to hear what scenario does your service run in? |
The One similar but slightly nicer alternative would be to record the time before and after this for loop, so we could print something like The most thorough intervention would allow some way to graph the effect of blocking on available streams. For instance, the client could set a header to indicate when the request was created (before any in-client blocking). A server interceptor could examine that header, and compare it to the current server time to calculate an "rpc_lateness" stat: The different between when the client wanted to make the request, and when the server got it (subject to the vagaries of clock skew, of course).
Yep! I work on https://github.com/letsencrypt/boulder/, the server for Let's Encrypt, a free Certificate Authority. We use gRPC extensively and care about its performance. So far, performance is great! This is the first snag we hit. We have about 7 different gRPC services. Some of those services call other services in turn. Some are moderately low-latency, like our storage service, which fronts a database and has a timeout of 5 seconds. Others are very high-latency, like our validation service, which makes requests out to the Internet and has a timeout of 90 seconds. |
Another interesting approach, perhaps even better, would be to allow the server to expose its current number of concurrent streams on a per-client basis. This would allow us to graph that number over time and alert if it comes within half of the limit, so that we could scale up well in advance of hitting any latency issues. |
What you're describing is part of one of our goals; to have a gRPC-specific tracing for an RPC. We plan to record the time it takes an RPC in various stages of its life cycle; from creation to being scheduled, from being scheduled to being written on the wire and from there to getting a response back. We might add more events to that list since we haven't gotten to creating a formal design yet. Doing this also requires that normal scenarios (when this tracing is turned-off) are not affected by these expensive We can use this issue to track progress on this goal. On the server-side, we could have similar info logs(as mentioned in the previous comment) printed out whenever settings for a connection are updated. That's great to know. We are actively working on improving our performance and feedback is always welcome. Additionally, looking at your code(part of it) looks like you are using gRPC's native server implementation which defaults to math.MaxUint32 for |
Interesting. I didn't consciously choose a native server implementation vs a non-native one. What controls that? Is that something that's changed since v1.1? |
It's not new. If you look at the example code in our repo, you'll see that gRPC servers are launched by calling One thing that I should mention: by default gRPC clients start with a limit set to 100, this limit is overridden when the server sends its first settings frame(part of the handshake process). |
Ah, I think there may be a bug with setting maxStreams to MaxUint32: grpc-go/transport/http2_server.go Lines 131 to 139 in db0f071
Note that in the default case, maxStreams is set to MaxUint32, but that is not sent as a settings frame. That works fine if you assume the underlying HTTP/2 library isn't sending a settings frame for MaxConcurrentStreams, but in this case the Go library is sending a default value of 250. |
Do you have a proxy between your server and client? How did you increase this limit? |
There's no proxy. Here's a branch where I set up a toy client and server to test the behavior: https://github.com/letsencrypt/boulder/compare/chillcli. At first I used it with the default limit and verified the blocking behavior. Then I used |
Ah man! Ok so this is what's happening:
|
Excellent, thanks for the explanation! I will work on upgrading our gRPC dependency. We'll need to do a little work since there was a change to balancers and/or certificate handling that broke our balancer/validator. I'll check back in once I've landed the fix. |
FYI, I upgraded our gRPC dependency, and used the sample client/server I linked earlier. I can confirm that even without setting MaxConcurrentStreams explicitly, I can run a very large number of RPCs concurrently, suggesting that MaxConcurrentStreams is defaulting to MaxUint32 as expected. Thanks for the help! I'll leave this ticket open to track the topic of tracing. |
Glad that worked. |
I should probably be assigned this |
Closing due to lack of activity and priority. |
I'm using gRPC v1.1 and Go v1.10.1 on Linux.
I found that with a single gRPC client and a single gRPC server, when load goes above 250 concurrent requests, a lot of my RPCs are delivered late, often significantly late (90 seconds or more). I believe I was hitting the MaxConcurrentStreams limit. Adjusting that limit higher fixed the problem.
When debugging this, it was hard to gather accurate data. I'd like to generate either a stat or a log line when I hit the concurrent stream limit in the future, so I know that I need to scale up my service or adjust the limit. It looks like the relevant code is around https://github.com/grpc/grpc-go/blob/master/transport/http2_client.go#L539, and I might need to provide an accessor for the current value of
t.waitingStreams
in order to use it as a Prometheus gauge. What do you think of this approach to the problem? Would you be willing to accept a patch? Are there alternate approaches that you like better?Thanks,
Jacob
The text was updated successfully, but these errors were encountered: