You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During periods of peak traffic, the VA sometimes receives a PerformValidation RPC long after the corresponding POST to the challenge to request validation. After some analysis, it looks like this is caused by blocking when we hit a MaxConcurrentStreams limit in Go's HTTP/2 stack (which is used by gRPC). See grpc/grpc-go#1986 for some more details. The default value is 250, and during the spikes that cause these delayed validations, we see traffic of about 550 rps, which should be enough to cause delays.
I believe this problem manifests in all of our components during peak load, but it is particularly noticeable in the VA, where many RPCs take a long time, and so use up a slot for longer.
This also explains why, in the past, we've seen slow DNS cause timeouts in the IsSafeDomain RPC, even though that RPC almost never hits the network, and when it does, it uses a different resolver than the PerformValidation RPC. I think what was happening in those cases was that the slow PerformValidation RPCs were using up all the slots for the VA service, so some fraction of all RPCs to the VA timed out.
The text was updated successfully, but these errors were encountered:
During periods of peak load, some RPCs are significantly delayed (on the order of seconds) by client-side blocking. HTTP/2 clients have to obey a "max concurrent streams" setting sent by the server. In Go's HTTP/2 implementation, this value [defaults to 250](https://github.com/golang/net/blob/master/http2/server.go#L56), so the gRPC default is also 250. So whenever there are more than 250 requests in progress at a time, additional requests will be delayed until there is a slot available.
During this peak load, we aren't hitting limits on CPU or memory, so we should increase the max concurrent streams limit to take better advantage of our available resources. This PR adds a config field to do that.
Fixes#3641.
During periods of peak traffic, the VA sometimes receives a PerformValidation RPC long after the corresponding POST to the challenge to request validation. After some analysis, it looks like this is caused by blocking when we hit a MaxConcurrentStreams limit in Go's HTTP/2 stack (which is used by gRPC). See grpc/grpc-go#1986 for some more details. The default value is 250, and during the spikes that cause these delayed validations, we see traffic of about 550 rps, which should be enough to cause delays.
I believe this problem manifests in all of our components during peak load, but it is particularly noticeable in the VA, where many RPCs take a long time, and so use up a slot for longer.
This also explains why, in the past, we've seen slow DNS cause timeouts in the IsSafeDomain RPC, even though that RPC almost never hits the network, and when it does, it uses a different resolver than the PerformValidation RPC. I think what was happening in those cases was that the slow PerformValidation RPCs were using up all the slots for the VA service, so some fraction of all RPCs to the VA timed out.
The text was updated successfully, but these errors were encountered: