Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bidi streaming leads to memory leak on the server side (CloseSend not called on the client side) #6457

Closed
mju opened this issue Jul 18, 2023 · 13 comments

Comments

@mju
Copy link

mju commented Jul 18, 2023

Hello there,

We had a memory leak on our gRPC server. The leaking handler was for a bidirectional streaming gRPC endpoint.

grpc-go/stream.go

Lines 141 to 156 in 02946a3

// NewStream creates a new Stream for the client side. This is typically
// called by generated code. ctx is used for the lifetime of the stream.
//
// To ensure resources are not leaked due to the stream returned, one of the following
// actions must be performed:
//
// 1. Call Close on the ClientConn.
// 2. Cancel the context provided.
// 3. Call RecvMsg until a non-nil error is returned. A protobuf-generated
// client-streaming RPC, for instance, might use the helper function
// CloseAndRecv (note that CloseSend does not Recv, therefore is not
// guaranteed to release all resources).
// 4. Receive a non-nil, non-io.EOF error from Header or SendMsg.
//
// If none of the above happen, a goroutine and a context will be leaked, and grpc
// will not call the optionally-configured stats handler with a stats.End message.
points out that if we close the client connection (grpc.ClientConn), then the resources will be relinquished. I am quoting point #1, or The user calls Close on the ClientConn. We tried to close the ClientConn and tighten up the server by setting https://pkg.go.dev/google.golang.org/[email protected]/keepalive#ServerParameters. The server was still leaky with the following in the heap dump.

go tool pprof -text heap.txt
File: server
Type: inuse_space
Time: ...
Showing nodes accounting for 81.88MB, 90.22% of 90.76MB total
Dropped 645 nodes (cum <= 0.45MB)
      flat  flat%   sum%        cum   cum%
   46.18MB 50.89% 50.89%    46.18MB 50.89%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
   23.11MB 25.47% 76.35%    23.11MB 25.47%  bufio.NewReaderSize (inline)

The leak stopped as soon as we invoke CloseSend() on the client side after the last Send().

Here, I mainly want to get a clarification on whether CloseSend() is required so streaming won't be leaky on the server side?
I didn't find any documentation that suggests so. If so, maybe we can update some documents to suggest that.

If you can help, what was leaked? Was it also a goroutine + a context? How would you recommend that we configure our
servers so it's less prone to misbehaved clients?

Thanks a lot.

@arvindbr8 arvindbr8 self-assigned this Jul 24, 2023
@arvindbr8
Copy link
Member

Hi @mju,

A memory leak is not expected in this scenario. We could take a better look if you could send us steps to reproduce this issue. Please let us know.

@github-actions
Copy link

github-actions bot commented Aug 1, 2023

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Aug 1, 2023
@github-actions github-actions bot closed this as completed Aug 8, 2023
@mju
Copy link
Author

mju commented Aug 16, 2023

We have just discovered that this caused a goroutine leak as well. We accumulated 1500+ goroutines.
To reproduce, set up a bidi streaming endpoint and do not call CloseSend() on the client side. We are
trying to collect a goroutine dump.

@easwars
Copy link
Contributor

easwars commented Aug 16, 2023

When you close the ClientConn, even without calling CloseSend() on your open streams, the former should hard close all open connections to backends. If your server rpc handler is blocked on a stream.Recv(), that should get unblocked with an error. And that is the signal that your server handler would use to exit and thereby cleaning up the goroutines for that stream.

What keepalive parameters do you set on your server?

Also, there is a bug on the client where in we are failing to send a GOAWAY when the ClientConn is closed. See #460. Irrespective of this issue, if your server handler is blocked on Recv(), it should be able to exit when the associated connection is closed.

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Aug 23, 2023
@mju
Copy link
Author

mju commented Aug 27, 2023

After more digging on our end, we identified a deadlock caused by hashicorp/vault#22393 on our server-side. As a result, for every request received by the server, two goroutines were leaked. It's two goroutines because of how gRPC's server.go works. We didn't configure keep-alive on the server-side. We started to reuse connections recently. We had a way to reproduce the problem quite reliably. What was confusing is that once CloseSend() was added, the leak stopped. Would you know why? It's as-if the locked-up goroutines on the server-side would be cleaned up when CloseSend() is present. We tried to find clues from server.go but weren't able to.

In any case, the main cause of the leak was for sure the deadlock. Thanks for the help.

@github-actions github-actions bot removed the stale label Aug 27, 2023
@easwars
Copy link
Contributor

easwars commented Aug 29, 2023

Calling CloseSend on the client should result in a call to Recv() getting unblocked on the server handler with io.EOF. Are you seeing that? Is your server handler blocked on Recv()?

We didn't configure keep-alive on the server-side

Are you talking about max_connection_age?

@github-actions
Copy link

github-actions bot commented Sep 4, 2023

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Sep 4, 2023
@mju
Copy link
Author

mju commented Sep 5, 2023

Calling CloseSend on the client should result in a call to Recv() getting unblocked on the server handler with io.EOF. Are you seeing that? Is your server handler blocked on Recv()?

No, our server handler didn't block on Recv(). After the server handler got a response from the last Recv() call, it went and deadlocked. After adding CloseSend() on the client side, we no longer were able to reproduce the problem. It could somehow be a coincidence.

We didn't configure keep-alive on the server-side

Are you talking about max_connection_age?

Correct. We didn't configure any of these.

@easwars
Copy link
Contributor

easwars commented Sep 6, 2023

After adding CloseSend() on the client side, we no longer were able to reproduce the problem. It could somehow be a coincidence.

I cannot see any connection between CloseSend() and the issue you are seeing. But can you confirm that once you fixed the deadlock on your side, does removing the CloseSend() on the client lead to problems?

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Sep 12, 2023
@mju
Copy link
Author

mju commented Sep 15, 2023

After adding CloseSend() on the client side, we no longer were able to reproduce the problem. It could somehow be a coincidence.

I cannot see any connection between CloseSend() and the issue you are seeing. But can you confirm that once you fixed the deadlock on your side, does removing the CloseSend() on the client lead to problems?

No. To confirm that fixing the deadlock alone fixed our problem, we tested it without defer CloseSend(). IIRC, we still had CloseSend() after the last Send() on the client-side. But we were testing with an error condition, so the execution shouldn't have reached that CloseSend().

@github-actions github-actions bot removed the stale label Sep 15, 2023
@easwars
Copy link
Contributor

easwars commented Sep 15, 2023

Thanks. If you run into issues, please file a new issue.

@easwars easwars closed this as completed Sep 15, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants