ByteBuffer memory leak in retry mechanism #9340

davidnadeau · 2022-07-07T12:47:29Z

What version of gRPC-Java are you using?

Bug detected on latest code from master branch (at the time of this post), and also 1.45 (the version used in production where the leak was noticed).

What is your environment?

Tested on macos (12.4) and centos7 with jdk11. In production where the leak was originally detected, it's debian with jre17

What did you expect to see?

These bytebuffer should be getting released.

What did you see instead?

Bytebuffers getting allocated and not released when a retry is triggered, it's causing OOMs in production (and in my local code reproduction).

Steps to reproduce the bug

I modified the hello-world example to showcase this bug.

git clone [email protected]:grpc/grpc-java.git
cd grpc-java/examples

then in the src/main/java/io/grpc/examples/helloworld dir you'll see the client and server. Replace the client with my client: https://gist.github.com/davidnadeau/055f88cec499a6afb502c9e5f231a271 (I just change the stub to be async and use a deadline, then call the greet method in a loop). Then you can run from the examples dir:

./gradlew installDist
Then in their own tabs:
./build/install/examples/bin/hello-world-server
./build/install/examples/bin/hello-world-client

with more verbose leak logging
HELLO_WORLD_CLIENT_OPTS="-Dio.grpc.netty.shaded.io.netty.leakDetection.level=paranoid " ./build/install/examples/bin/hello-world-client

It's definitely related to the retry mechanism, since adding .disableRetry() to the client channel makes the leak go away. I believe the deadline is causing the retry to trigger (you can see in the logs that the deadline is triggering).

Full leak log here: https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7

There is a noticeable latency increase when disabling retries which prevents it from being a viable solution for my case.

The text was updated successfully, but these errors were encountered:

ejona86 · 2022-07-08T23:30:00Z

Interesting. The message is being serialized but didn't exit the MessageFramer, otherwise we'd see a record for the NettyClientStream.

Looks like the problem is the MessageFramer isn't getting closed (which'd need to happen before calling the WriteQueue) when the cancellation triggers. Without retries this would be hard to hit because after each message there is a flush() (although I can think of some cases it could still happen, but it'd be rare). But for hedging in particular that probably isn't guaranteed.

I think the fix may be as easy as adding framer.dispose() just before calling sink.cancel().

Are you using plain retries or are you using hedging? I don't know if it is worth it to you to swap briefly to OkHttp until this is fixed. It wouldn't have this problem because it uses heap memory, but it will have different performance behavior.

ejona86 · 2022-07-08T23:57:54Z

After some tweaking I can get a single leaked buffer on startup. I set the deadline to 500 microseconds and see something that appears similar. With a deadline of 1 ms I saw a single leaked buffer after minutes of attempts and runs. 10 ms and 20 ms didn't reproduce for me.

If we assume the 500 µs issue I see is the same thing you see (they do look the same), then unfortunately the framer.dispose() does not seem to be sufficient. That's quite surprising. Maybe the cancellation is happening on server-side first and the inbound direction is what is triggering the leak.

This is sorta a workaround, but it does resolve the reproduction on my machine:

diff --git a/core/src/main/java/io/grpc/internal/RetriableStream.java b/core/src/main/java/io/grpc/i
nternal/RetriableStream.java
index 55c17c50b..1eaae6aac 100644
--- a/core/src/main/java/io/grpc/internal/RetriableStream.java
+++ b/core/src/main/java/io/grpc/internal/RetriableStream.java
@@ -550,6 +550,7 @@ abstract class RetriableStream<ReqT> implements ClientStream {
       @Override
       public void runWith(Substream substream) {
         substream.stream.writeMessage(method.streamRequest(message));
+        substream.stream.flush();
       }
     }

davidnadeau · 2022-07-09T02:07:04Z

Thank you for the response!

I tested both ideas. Adding framer.dispose() doesn't seem to help.

Adding substream.stream.flush(); appears to be a big improvement (maybe even a fix), I don't see any leaks at the normal leakDetection.level. I still see some on the paranoid level though (and I see none when I disableRetry()). I think this fixes the big leak, and there may be a smaller one somewhere as well.

ejona86 · 2022-07-09T03:20:02Z

Normal leak detection is sampled. If you run long enough you should find any leaks that exist. Paranoid samples at 100% so you will find leaks faster. The sample rate is the only difference.

If the leak detector sees something different than writeBytes()/newDirectBuffer() for the # 1 record, then that'd be useful to share. On the other hand, if it looks similar to what you already shared then I'd be stumped; the other flows I see should be identical to when retries are disabled (e.g., when passThrough == true).

The fix I had above is safe but a bit of a heavy hammer. It really doesn't explain which code path is causing this issue. We might make that change temporarily while we try to fix up retries properly. This does seem subtle, because commenting out the closed stream short-circuiting when draining also didn't fix the leak (this is the code path when server notices deadline before client):

grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java

Lines 298 to 300 in d7a6c1e

    
           if (substream.closed) { 
        
             return; 
        
           }

davidnadeau · 2022-07-11T19:55:56Z

Makes sense. After testing with substream.stream.flush(); on production load and data, it looks great, memory usage is stable. It's also worth noting that there wasn't a noticeable performance impact by adding this additional flush(). Do you have any idea when this fix may make its way into a release/patch?

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See #9340

ejona86 · 2022-07-15T22:27:38Z

The workaround should make it into 1.48.1. I imagine that will be ~2 weeks away.

cristifalcas · 2022-07-16T06:24:16Z

can this be backported to 1.41?

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340

ejona86 · 2022-07-22T21:00:36Z

@cristifalcas, why are you using 1.41? Are you using it because it is the Java 7 branch?

cristifalcas · 2022-07-22T21:16:38Z

we have a project that uses 1.41, not sure if it's because they use an old java version.
If it's not feasible to backport it to that version, we'll push for an upgrade.

ejona86 · 2022-07-22T21:50:07Z

1.41 is a community-supported branch for Java 7. So it is feasible and wouldn't be bad, but I'd recommend using the normal supported releases (two most recent minor versions) unless there's a strong reason not to. That said, 1.47 doesn't have the fix workaround either, but we haven't simply because we are pretty busy and nobody's asked for it (also because my change was a workaround; I'd much rather backport a real fix, but that isn't a short-term thing).

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See #9340

sheilahu · 2022-08-08T23:36:48Z

👋 Hi! Thanks for porting back to 1.47.x. I wonder if the actual binary has been released on maven yet?

davidnadeau · 2022-09-15T20:22:50Z

Thanks a lot for taking the time to fix this 🎉

disable retry in YDB GRPC channel by default due to grpc/grpc-java#9340

gRPC 1.45.3 to fix grpc/grpc-java#9340

ejona86 added the bug label Jul 8, 2022

ejona86 added a commit to ejona86/grpc-java that referenced this issue Jul 11, 2022

core: Workaround retry causing memory leak

35f3013

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340

ejona86 mentioned this issue Jul 11, 2022

core: Workaround retry causing memory leak #9360

Merged

ejona86 added a commit that referenced this issue Jul 15, 2022

core: Workaround retry causing memory leak

4cb1fba

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See #9340

ejona86 assigned sergiitk Jul 19, 2022

ejona86 added this to the Next milestone Jul 19, 2022

szetszwo mentioned this issue Jul 21, 2022

RATIS-1632. Upgrade gRPC to 1.48.1 apache/ratis-thirdparty#36

Merged

1 task

temawi pushed a commit to temawi/grpc-java that referenced this issue Jul 22, 2022

core: Workaround retry causing memory leak

b193246

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340

ejona86 added a commit to ejona86/grpc-java that referenced this issue Jul 28, 2022

core: Workaround retry causing memory leak

ed105b4

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340

ejona86 mentioned this issue Jul 28, 2022

core: Workaround retry causing memory leak (1.48.x backport) #9415

Merged

ejona86 added a commit to ejona86/grpc-java that referenced this issue Jul 28, 2022

core: Workaround retry causing memory leak

bbb7ede

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See grpc#9340

ejona86 mentioned this issue Jul 28, 2022

core: Workaround retry causing memory leak (1.47.x backport) #9416

Merged

ejona86 added a commit that referenced this issue Jul 28, 2022

core: Workaround retry causing memory leak

289a442

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See #9340

ejona86 added a commit that referenced this issue Jul 28, 2022

core: Workaround retry causing memory leak

34e98da

Data is getting orphaned sitting in MessageFramer. This hack thus always flushes data out of the framer so no data can remain sitting there. See #9340

davidnadeau closed this as completed Sep 15, 2022

robot-piglet pushed a commit to yandex-cloud/ydb-java-sdk that referenced this issue Sep 20, 2022

disable retries by default in YDB GRPC channel

49bd266

disable retry in YDB GRPC channel by default due to grpc/grpc-java#9340

robot-piglet pushed a commit to yandex-cloud/ydb-java-sdk that referenced this issue Sep 22, 2022

disable retries by default in YDB GRPC channel

fa4f31d

disable retry in YDB GRPC channel by default due to grpc/grpc-java#9340

ejona86 mentioned this issue Sep 26, 2022

LEAK: ByteBuf.release() was not called before it's garbage-collected #9563

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 15, 2022

ejona86 modified the milestones: Next, 1.49 Dec 20, 2022

suztomo added a commit to googleapis/java-shared-dependencies that referenced this issue Jan 17, 2023

deps: gRPC 1.45.3

d7dc5da

gRPC 1.45.3 to fix grpc/grpc-java#9340

suztomo added a commit to googleapis/java-shared-dependencies that referenced this issue Jan 17, 2023

deps: gRPC 1.45.3 (#964)

77988fa

gRPC 1.45.3 to fix grpc/grpc-java#9340

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteBuffer memory leak in retry mechanism #9340

ByteBuffer memory leak in retry mechanism #9340

davidnadeau commented Jul 7, 2022 •

edited

Loading

ejona86 commented Jul 8, 2022

ejona86 commented Jul 8, 2022

davidnadeau commented Jul 9, 2022 •

edited

Loading

ejona86 commented Jul 9, 2022

davidnadeau commented Jul 11, 2022

ejona86 commented Jul 15, 2022

cristifalcas commented Jul 16, 2022

ejona86 commented Jul 22, 2022

cristifalcas commented Jul 22, 2022

ejona86 commented Jul 22, 2022

sheilahu commented Aug 8, 2022

davidnadeau commented Sep 15, 2022

ByteBuffer memory leak in retry mechanism #9340

ByteBuffer memory leak in retry mechanism #9340

Comments

davidnadeau commented Jul 7, 2022 • edited Loading

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

ejona86 commented Jul 8, 2022

ejona86 commented Jul 8, 2022

davidnadeau commented Jul 9, 2022 • edited Loading

ejona86 commented Jul 9, 2022

davidnadeau commented Jul 11, 2022

ejona86 commented Jul 15, 2022

cristifalcas commented Jul 16, 2022

ejona86 commented Jul 22, 2022

cristifalcas commented Jul 22, 2022

ejona86 commented Jul 22, 2022

sheilahu commented Aug 8, 2022

davidnadeau commented Sep 15, 2022

davidnadeau commented Jul 7, 2022 •

edited

Loading

davidnadeau commented Jul 9, 2022 •

edited

Loading