Clients get 503s on server GracefulStop under heavy load. #1387

gitsen · 2017-07-24T20:36:37Z

Please answer these questions before submitting your issue.

What version of gRPC are you using?

v1.4.0

What version of Go are you using (`go version`)?

1.8

What operating system (Linux, Windows, …) and version?

Linux, Ubuntu 14.04.5 LTS

What did you do?

If possible, provide a recipe for reproducing the error.
Our server serves @ 1.5 million RPM. We get 503s from our network proxy (envoy https://github.com/lyft/envoy) on a GracefulStop of the GRPC server.
We have a hot restart mechanism where we use SO_REUSEPORT to start a new server and drain the old server. The new server starts up fine and starts handling requests while clients of the old server on reports 503 (as the server is doing a graceful stop)
According to Matt Klein at Lyft @mattklein123 , the below could be a potential issue:
There is a race condition inherent with GOAWAY and http/2. Basically, the GOAWAY can cross with new streams being sent. Those streams would then be reset by the server that sent GOAWAY. There is a workaround that people use (which Envoy does) which I'm sure Go is not doing. That workaround is basically to send 2 GOAWAY frames with a delay between them. The first GOAWAY has last stream ID set to max stream ID, after a delay, a real GOAWAY is sent.

What did you expect to see?

We expected to see a clean draining of the requests and no 5xx

What did you see instead?

503s from the client.

The text was updated successfully, but these errors were encountered:

MakMukhi · 2017-07-24T23:30:09Z

Here's why I think you're getting the 503s (Service Unavailable):
When the client gets a GoAway for a transport, it GracefullyCloses that transport(i.e. waits to close it until all the active RPCs have finished) but at the same time it tries to create another transport to the server (relevant code). However, in your case the server has stopped serving on that port and presumably the new one hasn't started serving on that port yet. Thus this new connection attempt will result in a 503 Service Unavailable.
For all the transports on all the ClientConns that got a GoAway one 503 will be received.

Does this hypothesis align with what you're seeing?

mattklein123 · 2017-07-24T23:32:30Z

@MakMukhi yes that is correct. The "standard" way of handling this is for the server to send 2 HTTP/2 GOAWAY frames. The first has max stream ID, so new streams are still accepted. Then the server waits for some period of time (typically RTT), and then sends the real GOAWAY, then stops listening.

MakMukhi · 2017-07-24T23:42:27Z

@mattklein123 Just to be clear I don't suspect that the problem here is in-flight RPCs being cancelled.
I would expect to see a service unavailable when trying to establish a connection instead of making an RPC call. Also, even if two GoAways were sent the client behavior would remain the same; close the current connection and try to establish a new one. Since the server would stop listening after the 2nd GoAway the client would still see a 503 when it attempts to connect agan.

Also going through the server code quickly it looks like the in-flight RPCs would just be accepted by the server even after having sent a GoAway. This might be a bug that we'd have to look into; I believe we should be more restrictive and not accept those RPCs(this pertains to the race you mentioned above).

mattklein123 · 2017-07-24T23:46:28Z

I would expect to see a service unavailable when trying to establish a connection instead of making an RPC call.

Not necessarily. If Envoy attempts to send a request, and it is reset (even with refused stream), Envoy will return a 503. One could argue that Envoy should retry if refused stream reset is returned, but that's orthogonal to this. This is a very common problem in HTTP/2 servers.

Also, even if two GoAways were sent the client behavior would remain the same; close the current connection and try to establish a new one. Since the server would stop listening after the 2nd GoAway the client would still see a 503 when it attempts to connect agan.

Not quite. This is part of a graceful draining protocol. By the time GOAWAY is sent, a new listener is up and expecting new connections. This is actually how Envoy itself works. It does the double GOAWAY dance during drain and hot restart.

mattklein123 · 2017-07-24T23:49:37Z

Also going through the server code quickly it looks like the in-flight RPCs would just be accepted by the server even after having sent a GoAway. This might be a bug that we'd have to look into; I believe we should be more restrictive and not accept those RPCs(this pertains to the race you mentioned above).

BTW, if this is actually true and you do accept new streams after sending GOAWAY, then the issue is definitely something else and we can close this issue and will need to do more debugging. I had assumed that this was the issue though since it fit.

MakMukhi · 2017-07-25T00:01:06Z

Do you think there might be a situation when the old server has stopped listening but the new server hasn't started listening yet and the client is creating a new connection(perhaps a new ClientConnbeing created or an existing ClientConn saw a connection error and is trying to reconnect).

About the server accepting in-flight RPCs even after having sent a GoAway: I'll dig deeper into that and perhaps write a simple reproduction of it to make sure this is the case.

gitsen · 2017-07-25T00:02:35Z

@MakMukhi The new server is definitely up and listening. We start the new server, see it bind to port and then gracefully stop the old server.

mattklein123 · 2017-07-25T00:04:13Z

About the server accepting in-flight RPCs even after having sent a GoAway: I'll dig deeper into that and perhaps write a simple reproduction of it to make sure this is the case.

I think this would be very useful to have confirmation of. I looked at the code very briefly, and I think you are correct, but would love to know for sure. (Also this means that when this issue is "fixed" the double GOAWAY will definitely need to be implemented).

Assuming that the server is accepting new streams, there must be some issue in terms of how our code is doing the hot restart dance in some cases. We will need to debug further.

MakMukhi · 2017-07-25T21:47:02Z

After further investigation it turns out that although the sever doesn't reject the new stream request outright it silently ignores it. Moreover, when the client gets a GoAway it closes all streams originated after the goaway ID sent by the server, causing Unavailable errors on all those in-flight RPCs.
The proposal of sending two GoAways makes sense and is in fact mentioned in the HTTP2 spec itself. It was also proposed in #147 but wasn't implemented.
We'll prioritize this and should have something out by end of this week.

gitsen · 2017-07-25T22:18:38Z

Thanks for the update @MakMukhi

gitsen · 2017-07-28T18:50:01Z

@MakMukhi when do you envision this will be released?

MakMukhi · 2017-07-28T18:53:50Z

The implementation is not complete yet. I'm working on the server-side part of it which I hope to send the PR out today. By next week it should be merged and if you guys require a release for that, we can go ahead and create 1.5.1.

gitsen · 2017-07-28T18:54:44Z

Thanks! Will wait for that.

MakMukhi · 2017-08-04T18:57:02Z

This slipped through the cracks this week. We'll get this is in early next week.
Sorry about and thanks for you patience.

MakMukhi mentioned this issue Jul 26, 2017

Enabling client process multiple GoAways #1393

Merged

dfawley assigned MakMukhi Jul 28, 2017

MakMukhi mentioned this issue Jul 28, 2017

Server should send 2 goaway messages to gracefully shutdown the connection. #1403

Merged

wenbozhu added the friction bug label Aug 3, 2017

MakMukhi closed this as completed in #1403 Aug 9, 2017

lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clients get 503s on server GracefulStop under heavy load. #1387

Clients get 503s on server GracefulStop under heavy load. #1387

gitsen commented Jul 24, 2017

MakMukhi commented Jul 24, 2017

mattklein123 commented Jul 24, 2017

MakMukhi commented Jul 24, 2017

mattklein123 commented Jul 24, 2017

mattklein123 commented Jul 24, 2017

MakMukhi commented Jul 25, 2017

gitsen commented Jul 25, 2017 •

edited

Loading

mattklein123 commented Jul 25, 2017

MakMukhi commented Jul 25, 2017

gitsen commented Jul 25, 2017

gitsen commented Jul 28, 2017

MakMukhi commented Jul 28, 2017

gitsen commented Jul 28, 2017

MakMukhi commented Aug 4, 2017

Clients get 503s on server GracefulStop under heavy load. #1387

Clients get 503s on server GracefulStop under heavy load. #1387

Comments

gitsen commented Jul 24, 2017

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

MakMukhi commented Jul 24, 2017

mattklein123 commented Jul 24, 2017

MakMukhi commented Jul 24, 2017

mattklein123 commented Jul 24, 2017

mattklein123 commented Jul 24, 2017

MakMukhi commented Jul 25, 2017

gitsen commented Jul 25, 2017 • edited Loading

mattklein123 commented Jul 25, 2017

MakMukhi commented Jul 25, 2017

gitsen commented Jul 25, 2017

gitsen commented Jul 28, 2017

MakMukhi commented Jul 28, 2017

gitsen commented Jul 28, 2017

MakMukhi commented Aug 4, 2017

What version of Go are you using (`go version`)?

gitsen commented Jul 25, 2017 •

edited

Loading