VA receives PerformValidation RPCs very late #3641

jsha · 2018-04-11T23:22:43Z

During periods of peak traffic, the VA sometimes receives a PerformValidation RPC long after the corresponding POST to the challenge to request validation. After some analysis, it looks like this is caused by blocking when we hit a MaxConcurrentStreams limit in Go's HTTP/2 stack (which is used by gRPC). See grpc/grpc-go#1986 for some more details. The default value is 250, and during the spikes that cause these delayed validations, we see traffic of about 550 rps, which should be enough to cause delays.

I believe this problem manifests in all of our components during peak load, but it is particularly noticeable in the VA, where many RPCs take a long time, and so use up a slot for longer.

This also explains why, in the past, we've seen slow DNS cause timeouts in the IsSafeDomain RPC, even though that RPC almost never hits the network, and when it does, it uses a different resolver than the PerformValidation RPC. I think what was happening in those cases was that the slow PerformValidation RPCs were using up all the slots for the VA service, so some fraction of all RPCs to the VA timed out.

During periods of peak load, some RPCs are significantly delayed (on the order of seconds) by client-side blocking. HTTP/2 clients have to obey a "max concurrent streams" setting sent by the server. In Go's HTTP/2 implementation, this value [defaults to 250](https://github.com/golang/net/blob/master/http2/server.go#L56), so the gRPC default is also 250. So whenever there are more than 250 requests in progress at a time, additional requests will be delayed until there is a slot available. During this peak load, we aren't hitting limits on CPU or memory, so we should increase the max concurrent streams limit to take better advantage of our available resources. This PR adds a config field to do that. Fixes #3641.

jsha added this to the Sprint 2018-04-10 milestone Apr 11, 2018

jsha self-assigned this Apr 11, 2018

jsha mentioned this issue Apr 11, 2018

Allow configuring gRPC's MaxConcurrentStreams #3642

Merged

cpu closed this as completed in #3642 Apr 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VA receives PerformValidation RPCs very late #3641

VA receives PerformValidation RPCs very late #3641

jsha commented Apr 11, 2018 •

edited

Loading

VA receives PerformValidation RPCs very late #3641

VA receives PerformValidation RPCs very late #3641

Comments

jsha commented Apr 11, 2018 • edited Loading

jsha commented Apr 11, 2018 •

edited

Loading