-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ByteBuffer memory leak in retry mechanism #9340
Comments
Interesting. The message is being serialized but didn't exit the MessageFramer, otherwise we'd see a record for the NettyClientStream. Looks like the problem is the MessageFramer isn't getting closed (which'd need to happen before calling the WriteQueue) when the cancellation triggers. Without retries this would be hard to hit because after each message there is a flush() (although I can think of some cases it could still happen, but it'd be rare). But for hedging in particular that probably isn't guaranteed. I think the fix may be as easy as adding Are you using plain retries or are you using hedging? I don't know if it is worth it to you to swap briefly to OkHttp until this is fixed. It wouldn't have this problem because it uses heap memory, but it will have different performance behavior. |
After some tweaking I can get a single leaked buffer on startup. I set the deadline to 500 microseconds and see something that appears similar. With a deadline of 1 ms I saw a single leaked buffer after minutes of attempts and runs. 10 ms and 20 ms didn't reproduce for me. If we assume the 500 µs issue I see is the same thing you see (they do look the same), then unfortunately the This is sorta a workaround, but it does resolve the reproduction on my machine: diff --git a/core/src/main/java/io/grpc/internal/RetriableStream.java b/core/src/main/java/io/grpc/i
nternal/RetriableStream.java
index 55c17c50b..1eaae6aac 100644
--- a/core/src/main/java/io/grpc/internal/RetriableStream.java
+++ b/core/src/main/java/io/grpc/internal/RetriableStream.java
@@ -550,6 +550,7 @@ abstract class RetriableStream<ReqT> implements ClientStream {
@Override
public void runWith(Substream substream) {
substream.stream.writeMessage(method.streamRequest(message));
+ substream.stream.flush();
}
}
|
Thank you for the response! I tested both ideas. Adding Adding |
Normal leak detection is sampled. If you run long enough you should find any leaks that exist. Paranoid samples at 100% so you will find leaks faster. The sample rate is the only difference. If the leak detector sees something different than writeBytes()/newDirectBuffer() for the # 1 record, then that'd be useful to share. On the other hand, if it looks similar to what you already shared then I'd be stumped; the other flows I see should be identical to when retries are disabled (e.g., when The fix I had above is safe but a bit of a heavy hammer. It really doesn't explain which code path is causing this issue. We might make that change temporarily while we try to fix up retries properly. This does seem subtle, because commenting out the closed stream short-circuiting when draining also didn't fix the leak (this is the code path when server notices deadline before client): grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java Lines 298 to 300 in d7a6c1e
|
Makes sense. After testing with |
Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340
Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See #9340
The workaround should make it into 1.48.1. I imagine that will be ~2 weeks away. |
can this be backported to 1.41? |
Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340
@cristifalcas, why are you using 1.41? Are you using it because it is the Java 7 branch? |
we have a project that uses 1.41, not sure if it's because they use an old java version. |
1.41 is a community-supported branch for Java 7. So it is feasible and wouldn't be bad, but I'd recommend using the normal supported releases (two most recent minor versions) unless there's a strong reason not to. That said, 1.47 doesn't have the fix workaround either, but we haven't simply because we are pretty busy and nobody's asked for it (also because my change was a workaround; I'd much rather backport a real fix, but that isn't a short-term thing). |
Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340
Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340
Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See #9340
Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See #9340
👋 Hi! Thanks for porting back to 1.47.x. I wonder if the actual binary has been released on maven yet? |
Thanks a lot for taking the time to fix this 🎉 |
disable retry in YDB GRPC channel by default due to grpc/grpc-java#9340
disable retry in YDB GRPC channel by default due to grpc/grpc-java#9340
gRPC 1.45.3 to fix grpc/grpc-java#9340
gRPC 1.45.3 to fix grpc/grpc-java#9340
What version of gRPC-Java are you using?
Bug detected on latest code from master branch (at the time of this post), and also 1.45 (the version used in production where the leak was noticed).
What is your environment?
Tested on macos (12.4) and centos7 with jdk11. In production where the leak was originally detected, it's debian with jre17
What did you expect to see?
These bytebuffer should be getting released.
What did you see instead?
Bytebuffers getting allocated and not released when a retry is triggered, it's causing OOMs in production (and in my local code reproduction).
Steps to reproduce the bug
I modified the hello-world example to showcase this bug.
then in the
src/main/java/io/grpc/examples/helloworld
dir you'll see the client and server. Replace the client with my client: https://gist.github.com/davidnadeau/055f88cec499a6afb502c9e5f231a271 (I just change the stub to be async and use a deadline, then call the greet method in a loop). Then you can run from the examples dir:It's definitely related to the retry mechanism, since adding
.disableRetry()
to the client channel makes the leak go away. I believe the deadline is causing the retry to trigger (you can see in the logs that the deadline is triggering).Full leak log here: https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7
There is a noticeable latency increase when disabling retries which prevents it from being a viable solution for my case.
The text was updated successfully, but these errors were encountered: