Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteBuffer memory leak in retry mechanism #9340

Closed
davidnadeau opened this issue Jul 7, 2022 · 12 comments
Closed

ByteBuffer memory leak in retry mechanism #9340

davidnadeau opened this issue Jul 7, 2022 · 12 comments
Assignees
Labels
Milestone

Comments

@davidnadeau
Copy link

davidnadeau commented Jul 7, 2022

What version of gRPC-Java are you using?

Bug detected on latest code from master branch (at the time of this post), and also 1.45 (the version used in production where the leak was noticed).

What is your environment?

Tested on macos (12.4) and centos7 with jdk11. In production where the leak was originally detected, it's debian with jre17

What did you expect to see?

These bytebuffer should be getting released.

What did you see instead?

Bytebuffers getting allocated and not released when a retry is triggered, it's causing OOMs in production (and in my local code reproduction).

Steps to reproduce the bug

I modified the hello-world example to showcase this bug.

git clone [email protected]:grpc/grpc-java.git
cd grpc-java/examples

then in the src/main/java/io/grpc/examples/helloworld dir you'll see the client and server. Replace the client with my client: https://gist.github.com/davidnadeau/055f88cec499a6afb502c9e5f231a271 (I just change the stub to be async and use a deadline, then call the greet method in a loop). Then you can run from the examples dir:

./gradlew installDist
Then in their own tabs:
./build/install/examples/bin/hello-world-server
./build/install/examples/bin/hello-world-client

with more verbose leak logging
HELLO_WORLD_CLIENT_OPTS="-Dio.grpc.netty.shaded.io.netty.leakDetection.level=paranoid " ./build/install/examples/bin/hello-world-client

It's definitely related to the retry mechanism, since adding .disableRetry() to the client channel makes the leak go away. I believe the deadline is causing the retry to trigger (you can see in the logs that the deadline is triggering).

Full leak log here: https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7

There is a noticeable latency increase when disabling retries which prevents it from being a viable solution for my case.

@ejona86
Copy link
Member

ejona86 commented Jul 8, 2022

Interesting. The message is being serialized but didn't exit the MessageFramer, otherwise we'd see a record for the NettyClientStream.

Looks like the problem is the MessageFramer isn't getting closed (which'd need to happen before calling the WriteQueue) when the cancellation triggers. Without retries this would be hard to hit because after each message there is a flush() (although I can think of some cases it could still happen, but it'd be rare). But for hedging in particular that probably isn't guaranteed.

I think the fix may be as easy as adding framer.dispose() just before calling sink.cancel().

Are you using plain retries or are you using hedging? I don't know if it is worth it to you to swap briefly to OkHttp until this is fixed. It wouldn't have this problem because it uses heap memory, but it will have different performance behavior.

@ejona86 ejona86 added the bug label Jul 8, 2022
@ejona86
Copy link
Member

ejona86 commented Jul 8, 2022

After some tweaking I can get a single leaked buffer on startup. I set the deadline to 500 microseconds and see something that appears similar. With a deadline of 1 ms I saw a single leaked buffer after minutes of attempts and runs. 10 ms and 20 ms didn't reproduce for me.

If we assume the 500 µs issue I see is the same thing you see (they do look the same), then unfortunately the framer.dispose() does not seem to be sufficient. That's quite surprising. Maybe the cancellation is happening on server-side first and the inbound direction is what is triggering the leak.

This is sorta a workaround, but it does resolve the reproduction on my machine:

diff --git a/core/src/main/java/io/grpc/internal/RetriableStream.java b/core/src/main/java/io/grpc/i
nternal/RetriableStream.java
index 55c17c50b..1eaae6aac 100644
--- a/core/src/main/java/io/grpc/internal/RetriableStream.java
+++ b/core/src/main/java/io/grpc/internal/RetriableStream.java
@@ -550,6 +550,7 @@ abstract class RetriableStream<ReqT> implements ClientStream {
       @Override
       public void runWith(Substream substream) {
         substream.stream.writeMessage(method.streamRequest(message));
+        substream.stream.flush();
       }
     }
 

@davidnadeau
Copy link
Author

davidnadeau commented Jul 9, 2022

Thank you for the response!

I tested both ideas. Adding framer.dispose() doesn't seem to help.

Adding substream.stream.flush(); appears to be a big improvement (maybe even a fix), I don't see any leaks at the normal leakDetection.level. I still see some on the paranoid level though (and I see none when I disableRetry()). I think this fixes the big leak, and there may be a smaller one somewhere as well.

@ejona86
Copy link
Member

ejona86 commented Jul 9, 2022

Normal leak detection is sampled. If you run long enough you should find any leaks that exist. Paranoid samples at 100% so you will find leaks faster. The sample rate is the only difference.

If the leak detector sees something different than writeBytes()/newDirectBuffer() for the # 1 record, then that'd be useful to share. On the other hand, if it looks similar to what you already shared then I'd be stumped; the other flows I see should be identical to when retries are disabled (e.g., when passThrough == true).

The fix I had above is safe but a bit of a heavy hammer. It really doesn't explain which code path is causing this issue. We might make that change temporarily while we try to fix up retries properly. This does seem subtle, because commenting out the closed stream short-circuiting when draining also didn't fix the leak (this is the code path when server notices deadline before client):

if (substream.closed) {
return;
}

@davidnadeau
Copy link
Author

Makes sense. After testing with substream.stream.flush(); on production load and data, it looks great, memory usage is stable. It's also worth noting that there wasn't a noticeable performance impact by adding this additional flush(). Do you have any idea when this fix may make its way into a release/patch?

ejona86 added a commit to ejona86/grpc-java that referenced this issue Jul 11, 2022
Data is getting orphaned sitting in MessageFramer. This hack thus always
flushes data out of the framer so no data can remain sitting there.

See grpc#9340
ejona86 added a commit that referenced this issue Jul 15, 2022
Data is getting orphaned sitting in MessageFramer. This hack thus always
flushes data out of the framer so no data can remain sitting there.

See #9340
@ejona86
Copy link
Member

ejona86 commented Jul 15, 2022

The workaround should make it into 1.48.1. I imagine that will be ~2 weeks away.

@cristifalcas
Copy link

can this be backported to 1.41?

@ejona86 ejona86 added this to the Next milestone Jul 19, 2022
temawi pushed a commit to temawi/grpc-java that referenced this issue Jul 22, 2022
Data is getting orphaned sitting in MessageFramer. This hack thus always
flushes data out of the framer so no data can remain sitting there.

See grpc#9340
@ejona86
Copy link
Member

ejona86 commented Jul 22, 2022

@cristifalcas, why are you using 1.41? Are you using it because it is the Java 7 branch?

@cristifalcas
Copy link

we have a project that uses 1.41, not sure if it's because they use an old java version.
If it's not feasible to backport it to that version, we'll push for an upgrade.

@ejona86
Copy link
Member

ejona86 commented Jul 22, 2022

1.41 is a community-supported branch for Java 7. So it is feasible and wouldn't be bad, but I'd recommend using the normal supported releases (two most recent minor versions) unless there's a strong reason not to. That said, 1.47 doesn't have the fix workaround either, but we haven't simply because we are pretty busy and nobody's asked for it (also because my change was a workaround; I'd much rather backport a real fix, but that isn't a short-term thing).

ejona86 added a commit to ejona86/grpc-java that referenced this issue Jul 28, 2022
Data is getting orphaned sitting in MessageFramer. This hack thus always
flushes data out of the framer so no data can remain sitting there.

See grpc#9340
ejona86 added a commit to ejona86/grpc-java that referenced this issue Jul 28, 2022
Data is getting orphaned sitting in MessageFramer. This hack thus always
flushes data out of the framer so no data can remain sitting there.

See grpc#9340
ejona86 added a commit that referenced this issue Jul 28, 2022
Data is getting orphaned sitting in MessageFramer. This hack thus always
flushes data out of the framer so no data can remain sitting there.

See #9340
ejona86 added a commit that referenced this issue Jul 28, 2022
Data is getting orphaned sitting in MessageFramer. This hack thus always
flushes data out of the framer so no data can remain sitting there.

See #9340
@sheilahu
Copy link

sheilahu commented Aug 8, 2022

👋 Hi! Thanks for porting back to 1.47.x. I wonder if the actual binary has been released on maven yet?

@davidnadeau
Copy link
Author

Thanks a lot for taking the time to fix this 🎉

robot-piglet pushed a commit to yandex-cloud/ydb-java-sdk that referenced this issue Sep 20, 2022
disable retry in YDB GRPC channel by default due to grpc/grpc-java#9340
robot-piglet pushed a commit to yandex-cloud/ydb-java-sdk that referenced this issue Sep 22, 2022
disable retry in YDB GRPC channel by default due to grpc/grpc-java#9340
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 15, 2022
@ejona86 ejona86 modified the milestones: Next, 1.49 Dec 20, 2022
suztomo added a commit to googleapis/java-shared-dependencies that referenced this issue Jan 17, 2023
suztomo added a commit to googleapis/java-shared-dependencies that referenced this issue Jan 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants