Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clients get 503s on server GracefulStop under heavy load. #1387

Closed
gitsen opened this issue Jul 24, 2017 · 14 comments
Closed

Clients get 503s on server GracefulStop under heavy load. #1387

gitsen opened this issue Jul 24, 2017 · 14 comments
Assignees

Comments

@gitsen
Copy link

gitsen commented Jul 24, 2017

Please answer these questions before submitting your issue.

What version of gRPC are you using?

v1.4.0

What version of Go are you using (go version)?

1.8

What operating system (Linux, Windows, …) and version?

Linux, Ubuntu 14.04.5 LTS

What did you do?

If possible, provide a recipe for reproducing the error.
Our server serves @ 1.5 million RPM. We get 503s from our network proxy (envoy https://github.com/lyft/envoy) on a GracefulStop of the GRPC server.
We have a hot restart mechanism where we use SO_REUSEPORT to start a new server and drain the old server. The new server starts up fine and starts handling requests while clients of the old server on reports 503 (as the server is doing a graceful stop)
According to Matt Klein at Lyft @mattklein123 , the below could be a potential issue:
There is a race condition inherent with GOAWAY and http/2. Basically, the GOAWAY can cross with new streams being sent. Those streams would then be reset by the server that sent GOAWAY. There is a workaround that people use (which Envoy does) which I'm sure Go is not doing. That workaround is basically to send 2 GOAWAY frames with a delay between them. The first GOAWAY has last stream ID set to max stream ID, after a delay, a real GOAWAY is sent.

What did you expect to see?

We expected to see a clean draining of the requests and no 5xx

What did you see instead?

503s from the client.

@MakMukhi
Copy link
Contributor

Here's why I think you're getting the 503s (Service Unavailable):
When the client gets a GoAway for a transport, it GracefullyCloses that transport(i.e. waits to close it until all the active RPCs have finished) but at the same time it tries to create another transport to the server (relevant code). However, in your case the server has stopped serving on that port and presumably the new one hasn't started serving on that port yet. Thus this new connection attempt will result in a 503 Service Unavailable.
For all the transports on all the ClientConns that got a GoAway one 503 will be received.

Does this hypothesis align with what you're seeing?

@mattklein123
Copy link

@MakMukhi yes that is correct. The "standard" way of handling this is for the server to send 2 HTTP/2 GOAWAY frames. The first has max stream ID, so new streams are still accepted. Then the server waits for some period of time (typically RTT), and then sends the real GOAWAY, then stops listening.

@MakMukhi
Copy link
Contributor

@mattklein123 Just to be clear I don't suspect that the problem here is in-flight RPCs being cancelled.
I would expect to see a service unavailable when trying to establish a connection instead of making an RPC call. Also, even if two GoAways were sent the client behavior would remain the same; close the current connection and try to establish a new one. Since the server would stop listening after the 2nd GoAway the client would still see a 503 when it attempts to connect agan.

Also going through the server code quickly it looks like the in-flight RPCs would just be accepted by the server even after having sent a GoAway. This might be a bug that we'd have to look into; I believe we should be more restrictive and not accept those RPCs(this pertains to the race you mentioned above).

@mattklein123
Copy link

I would expect to see a service unavailable when trying to establish a connection instead of making an RPC call.

Not necessarily. If Envoy attempts to send a request, and it is reset (even with refused stream), Envoy will return a 503. One could argue that Envoy should retry if refused stream reset is returned, but that's orthogonal to this. This is a very common problem in HTTP/2 servers.

Also, even if two GoAways were sent the client behavior would remain the same; close the current connection and try to establish a new one. Since the server would stop listening after the 2nd GoAway the client would still see a 503 when it attempts to connect agan.

Not quite. This is part of a graceful draining protocol. By the time GOAWAY is sent, a new listener is up and expecting new connections. This is actually how Envoy itself works. It does the double GOAWAY dance during drain and hot restart.

@mattklein123
Copy link

Also going through the server code quickly it looks like the in-flight RPCs would just be accepted by the server even after having sent a GoAway. This might be a bug that we'd have to look into; I believe we should be more restrictive and not accept those RPCs(this pertains to the race you mentioned above).

BTW, if this is actually true and you do accept new streams after sending GOAWAY, then the issue is definitely something else and we can close this issue and will need to do more debugging. I had assumed that this was the issue though since it fit.

@MakMukhi
Copy link
Contributor

Do you think there might be a situation when the old server has stopped listening but the new server hasn't started listening yet and the client is creating a new connection(perhaps a new ClientConnbeing created or an existing ClientConn saw a connection error and is trying to reconnect).

About the server accepting in-flight RPCs even after having sent a GoAway: I'll dig deeper into that and perhaps write a simple reproduction of it to make sure this is the case.

@gitsen
Copy link
Author

gitsen commented Jul 25, 2017

@MakMukhi The new server is definitely up and listening. We start the new server, see it bind to port and then gracefully stop the old server.

@mattklein123
Copy link

About the server accepting in-flight RPCs even after having sent a GoAway: I'll dig deeper into that and perhaps write a simple reproduction of it to make sure this is the case.

I think this would be very useful to have confirmation of. I looked at the code very briefly, and I think you are correct, but would love to know for sure. (Also this means that when this issue is "fixed" the double GOAWAY will definitely need to be implemented).

Assuming that the server is accepting new streams, there must be some issue in terms of how our code is doing the hot restart dance in some cases. We will need to debug further.

@MakMukhi
Copy link
Contributor

After further investigation it turns out that although the sever doesn't reject the new stream request outright it silently ignores it. Moreover, when the client gets a GoAway it closes all streams originated after the goaway ID sent by the server, causing Unavailable errors on all those in-flight RPCs.
The proposal of sending two GoAways makes sense and is in fact mentioned in the HTTP2 spec itself. It was also proposed in #147 but wasn't implemented.
We'll prioritize this and should have something out by end of this week.

@gitsen
Copy link
Author

gitsen commented Jul 25, 2017

Thanks for the update @MakMukhi

@gitsen
Copy link
Author

gitsen commented Jul 28, 2017

@MakMukhi when do you envision this will be released?

@MakMukhi
Copy link
Contributor

The implementation is not complete yet. I'm working on the server-side part of it which I hope to send the PR out today. By next week it should be merged and if you guys require a release for that, we can go ahead and create 1.5.1.

@gitsen
Copy link
Author

gitsen commented Jul 28, 2017

Thanks! Will wait for that.

@MakMukhi
Copy link
Contributor

MakMukhi commented Aug 4, 2017

This slipped through the cracks this week. We'll get this is in early next week.
Sorry about and thanks for you patience.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants