-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clients get 503s on server GracefulStop under heavy load. #1387
Comments
Here's why I think you're getting the 503s (Service Unavailable): Does this hypothesis align with what you're seeing? |
@MakMukhi yes that is correct. The "standard" way of handling this is for the server to send 2 HTTP/2 GOAWAY frames. The first has max stream ID, so new streams are still accepted. Then the server waits for some period of time (typically RTT), and then sends the real GOAWAY, then stops listening. |
@mattklein123 Just to be clear I don't suspect that the problem here is in-flight RPCs being cancelled. Also going through the server code quickly it looks like the in-flight RPCs would just be accepted by the server even after having sent a GoAway. This might be a bug that we'd have to look into; I believe we should be more restrictive and not accept those RPCs(this pertains to the race you mentioned above). |
Not necessarily. If Envoy attempts to send a request, and it is reset (even with refused stream), Envoy will return a 503. One could argue that Envoy should retry if refused stream reset is returned, but that's orthogonal to this. This is a very common problem in HTTP/2 servers.
Not quite. This is part of a graceful draining protocol. By the time GOAWAY is sent, a new listener is up and expecting new connections. This is actually how Envoy itself works. It does the double GOAWAY dance during drain and hot restart. |
BTW, if this is actually true and you do accept new streams after sending GOAWAY, then the issue is definitely something else and we can close this issue and will need to do more debugging. I had assumed that this was the issue though since it fit. |
Do you think there might be a situation when the old server has stopped listening but the new server hasn't started listening yet and the client is creating a new connection(perhaps a new ClientConnbeing created or an existing ClientConn saw a connection error and is trying to reconnect). About the server accepting in-flight RPCs even after having sent a GoAway: I'll dig deeper into that and perhaps write a simple reproduction of it to make sure this is the case. |
@MakMukhi The new server is definitely up and listening. We start the new server, see it bind to port and then gracefully stop the old server. |
I think this would be very useful to have confirmation of. I looked at the code very briefly, and I think you are correct, but would love to know for sure. (Also this means that when this issue is "fixed" the double GOAWAY will definitely need to be implemented). Assuming that the server is accepting new streams, there must be some issue in terms of how our code is doing the hot restart dance in some cases. We will need to debug further. |
After further investigation it turns out that although the sever doesn't reject the new stream request outright it silently ignores it. Moreover, when the client gets a GoAway it closes all streams originated after the goaway ID sent by the server, causing Unavailable errors on all those in-flight RPCs. |
Thanks for the update @MakMukhi |
@MakMukhi when do you envision this will be released? |
The implementation is not complete yet. I'm working on the server-side part of it which I hope to send the PR out today. By next week it should be merged and if you guys require a release for that, we can go ahead and create 1.5.1. |
Thanks! Will wait for that. |
This slipped through the cracks this week. We'll get this is in early next week. |
Please answer these questions before submitting your issue.
What version of gRPC are you using?
v1.4.0
What version of Go are you using (
go version
)?1.8
What operating system (Linux, Windows, …) and version?
Linux, Ubuntu 14.04.5 LTS
What did you do?
If possible, provide a recipe for reproducing the error.
Our server serves @ 1.5 million RPM. We get 503s from our network proxy (envoy https://github.com/lyft/envoy) on a
GracefulStop
of the GRPC server.We have a hot restart mechanism where we use
SO_REUSEPORT
to start a new server and drain the old server. The new server starts up fine and starts handling requests while clients of the old server on reports 503 (as the server is doing a graceful stop)According to Matt Klein at Lyft @mattklein123 , the below could be a potential issue:
There is a race condition inherent with GOAWAY and http/2. Basically, the GOAWAY can cross with new streams being sent. Those streams would then be reset by the server that sent GOAWAY. There is a workaround that people use (which Envoy does) which I'm sure Go is not doing. That workaround is basically to send 2 GOAWAY frames with a delay between them. The first GOAWAY has last stream ID set to max stream ID, after a delay, a real GOAWAY is sent.
What did you expect to see?
We expected to see a clean draining of the requests and no 5xx
What did you see instead?
503s from the client.
The text was updated successfully, but these errors were encountered: