High memory usage inside transport.newBufWriter (memory leak?) #6242

wh75er · 2023-05-02T15:50:41Z

Hello! We've confronted an interesting case of memory growth inside our gRPC service (server side). Memory growth pattern is really tiny (about 10-13MB per 24h), but it's noticeable. Growth continued up to 6 days (definitely a problem)

How did it all start?

We've updated our gRPC to a new version (1.27.0 up to 1.47.0), because of a new dependency in our repository. There are no changes at all in our service's code.

After a few days memory growth was noticed. We've tried to wait and watch, but growth hasn't stopped.

Then we've tried to update our gRPC to the latest version, but it doesn't help to solve the problem. We've also collected some profiles to trace causes of this trouble

Interesting fact

In our repository we've also have one more gRPC server (let's call it service B). It uses the same gRPC code as a server A (the service we have problem with). Service B has lower RPS than service A and there are no problems at all

Technical info

golang version: 1.20.1
gRPC version: 1.47.0 / 1.54.0 (migrated from 1.27.0)
gRPC service has only one entry (one service with one method)
RPS: ~5K

Pprof data

We've collected multiple samples (02.05 25.04 26.04 27.04 28.04). Diff shows that growth occurs inside grpc internals

go tool pprof -drop_negative -inuse_space -diff_base=25.04/pprof-heap-25.04-2.txt 02.05/pprof-heap-02.05-1.txt

(pprof) top
Showing nodes accounting for 17.64MB, 8.21% of 214.84MB total
Dropped 2 nodes (cum <= 1.07MB)
Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
   10.64MB  4.95%  4.95%    10.64MB  4.95%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
    1.50MB   0.7%  5.65%     1.50MB   0.7%  runtime.malg
       1MB  0.47%  6.12%        1MB  0.47%  github.com/GoWebProd/psqr.NewQuantile
       1MB  0.47%  6.58%     8.56MB  3.98%  google.golang.org/grpc/internal/transport.NewServerTransport
       1MB  0.47%  7.05%        1MB  0.47%  bytes.(*Buffer).grow
    0.50MB  0.23%  7.28%     0.50MB  0.23%  bufio.NewWriterSize
    0.50MB  0.23%  7.51%     0.50MB  0.23%  google.golang.org/grpc/metadata.Join
    0.50MB  0.23%  7.75%     0.50MB  0.23%  sync.(*Pool).pinSlow

(pprof) top -cum 25
Showing nodes accounting for 14.64MB, 6.81% of 214.84MB total
Dropped 2 nodes (cum <= 1.07MB)
Showing top 25 nodes out of 63
      flat  flat%   sum%        cum   cum%
   10.64MB  4.95%  4.95%    10.64MB  4.95%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
         0     0%  4.95%     8.56MB  3.98%  google.golang.org/grpc.(*Server).Serve.func3
         0     0%  4.95%     8.56MB  3.98%  google.golang.org/grpc.(*Server).handleRawConn
         0     0%  4.95%     8.56MB  3.98%  google.golang.org/grpc.(*Server).newHTTP2Transport
       1MB  0.47%  5.42%     8.56MB  3.98%  google.golang.org/grpc/internal/transport.NewServerTransport
         0     0%  5.42%     7.56MB  3.52%  google.golang.org/grpc/internal/transport.newFramer
         0     0%  5.42%     2.50MB  1.16%  src/service/api._ProcessRequest_Handler
         0     0%  5.42%     2.50MB  1.16%  google.golang.org/grpc.(*Server).handleStream
         0     0%  5.42%     2.50MB  1.16%  google.golang.org/grpc.(*Server).processUnaryRPC
         0     0%  5.42%     2.50MB  1.16%  google.golang.org/grpc.(*Server).serveStreams.func1.2
         0     0%  5.42%     2.50MB  1.16%  main.(*Service).AuthUnaryInterceptor
    1.50MB   0.7%  6.12%     1.50MB   0.7%  runtime.malg
         0     0%  6.12%     1.50MB   0.7%  runtime.newproc.func1
         0     0%  6.12%     1.50MB   0.7%  runtime.newproc1
         0     0%  6.12%     1.50MB   0.7%  runtime.systemstack

How to reproduce?

I've tried to reproduce it by creating a small load (about ~100 RPS) on service, but memory hasn't changed. I think trouble reproduces only on a high load

Related issues

I've checked different issues on this topic and found out that there are indeed some problems with write buffer in gRPC internals. Maybe this issue is somehow related to all of this:

#5751

#5757
#5758
#5759

I haven't found any issue reporting memory grow though

Conclusion

Thank you for your time. I would really appreciate if you:

suggest what could be checked in our code in case if we haven't closed some resources and there is a problem on our side
(or) confirm that there is a problem with gRPC buffer writer and memory growth is related to this

I'll try to find out how to reproduce it on a higher load and check why does transport.newBufWriter grow in memory. I will share my results here if I get something interesting

The text was updated successfully, but these errors were encountered:

easwars · 2023-05-02T19:22:16Z

It is not clear to me whether you are seeing an increase in memory utilization of about 10-15MB between your server application running gRPC version 1.27 and server application running gRPC version 1.47, or if the memory utilization in your server application running gRPC version 1.47 is continuously growing.

If the answer is the former, I'm afraid, we can't help much. 1.27.0 was released more than 3 years ago and so much has changed since then. The bufWriter code hasn't changed much though.

If your answer if the latter, i.e memory utilization is constantly growing, then we would like to hear more from you.

wh75er · 2023-05-03T09:16:23Z

Hello, @easwars! Sorry for inconvenience. The memory utilization is continuously growing on our server application running gRPC version 1.47 (and later we have tried 1.54).

Our memory problem have begun after migrating to a new gRPC version 1.47.0. Then, we have tried updating it to the latest version - 1.54.0 (in case if there were some fixes), but it didn't help. So timeline looks something like that

1.27.0 (memory utilization is fine) -(update)-> 1.47.0 (memory utilization is constantly growing) -(update)-> 1.54.0 (memory utilization is constantly growing)

My point was that we have 2 different gRPC server applications in the same repository. Each of them use the same vendor gRPC package. It looks like on one server application we can observe growth of the memory utilization, but on another gRPC server application memory usage is okay (the memory utilization is not growing). This detail makes me think that there is an error on the first gRPC server application on our side (probably it's not the case, because we only updated gRPC package and didn't modify any code), or RPS slightly differs between this two servers which causes issue to be observed only in one place (I'm not sure about this either, because RPS difference is not that high ~ less than 1K). Maybe one application uses bufWriter more often than another

easwars · 2023-05-03T22:35:55Z

Can you get a pprof dump that we can analyze?

Also, does the memory utilization come down in the service with high QPS when the QPS comes down? If so, then it could indicate that we are not leaking memory.

github-actions · 2023-05-09T22:45:43Z

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

wh75er · 2023-05-10T10:19:26Z

Sorry for long reply!

does the memory utilization come down in the service with high QPS when the QPS comes down

No, it doesn't. Looks like it grows with QPS, but when QPS goes down it keeps the memory utilization the same after last QPS peak. When QPS starts to grow again, it seems like the memory utilization continues to grow and then locks on

Can you get a pprof dump that we can analyze?

Sure, I've sent 2 dumps to your email, so you could check the diff. You can find it by subject high memory usage #6242

wh75er · 2023-05-10T10:59:08Z

Also forgot to tag you, @easwars

Doing it here

easwars · 2023-05-10T16:38:06Z

Thanks @wh75er. We will take a look soon.

wh75er · 2023-05-18T11:26:08Z

Hello! Is there any updates on the issue?

easwars · 2023-05-18T18:02:03Z

Really sorry. I haven't had cycles to look into this.

@zasweq @arvindbr8 @dfawley
Would one of you be able to look into this? Thanks.

dfawley · 2023-05-18T22:28:15Z

There are a lot of moving pieces in a system like this, and so this will be hard to help, here. Some things to consider:

Do you track the number of open connections? Is the memory scaling consistently with open connections?
Do you have max connection age or max connection idle enabled? https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters
Does your service also have outgoing gRPC connections (is it a gRPC client as well)?
What is the traffic like? Are all calls unary or are there streams too? Do clients connect and disconnect frequently?
If you can reliably reproduce the behavior, but can't debug the code to determine the cause, it would be interesting to determine when the problem appeared (what release, or even better, what commit).

wh75er · 2023-05-19T19:15:13Z

Do you track the number of open connections? Is the memory scaling consistently with open connections?

No, I don't. There is no data on how does open connections affect memory scaling. I'll try to figure it out

Do you have max connection age or max connection idle enabled? https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters

Yes, I have. But there were no problems with that and this part has not been modified

MaxConnectionIdle: time.Minute,
MaxConnectionAgeGrace: 5 * time.Second,
Time:                  5 * time.Second,
Timeout:               1 * time.Second,

Does your service also have outgoing gRPC connections (is it a gRPC client as well)?

No, it doesn't

What is the traffic like? Are all calls unary or are there streams too? Do clients connect and disconnect frequently?

Traffic is not uniform. There is a prime time where RPS is high, after that it decreases.

The application uses only unary calls. There is one interceptor with this feature - headers are attached to responses. I had an idea that this is somehow affecting grpc Writer, but I haven't succeeded in reproducing it by myself

If you can reliably reproduce the behavior, but can't debug the code to determine the cause, it would be interesting to determine when the problem appeared (what release, or even better, what commit).

No, I can not reproduce it at all. It seems like the problem appeared after a commit with gRPC upgrade

easwars · 2023-07-26T23:22:34Z

At this point, unfortunately without a good way to repro this problem, there is very little that we can help with. We will leave this open for a little bit longer before closing. If you find a good way for us to repro this problem we will gladly take a deeper look. Thank you for your patience.

github-actions · 2023-08-02T01:06:00Z

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

shantanubansal · 2024-02-29T09:09:07Z

Anyupdate on this issue?

arvindbr8 · 2024-02-29T23:21:20Z

This issue was closed 4 months ago. Please consider opening a new issue instead of leaving a comment here if you need to clarify something and maybe reference this issue in the new one

wh75er added the Type: Question label May 2, 2023

easwars self-assigned this May 2, 2023

easwars added the Status: Requires Reporter Clarification label May 2, 2023

github-actions bot added the stale label May 9, 2023

github-actions bot removed the stale label May 10, 2023

easwars removed the Status: Requires Reporter Clarification label May 10, 2023

dfawley assigned wh75er and unassigned easwars May 18, 2023

dfawley added the Status: Requires Reporter Clarification label May 18, 2023

arvindbr8 removed the Status: Requires Reporter Clarification label May 23, 2023

arvindbr8 self-assigned this May 23, 2023

easwars assigned easwars and unassigned arvindbr8 and wh75er Jul 11, 2023

easwars added the Status: Requires Reporter Clarification label Jul 26, 2023

easwars removed their assignment Jul 26, 2023

github-actions bot added the stale label Aug 2, 2023

github-actions bot closed this as completed Aug 9, 2023

arvindbr8 reopened this Nov 2, 2023

github-actions bot removed the stale label Nov 2, 2023

zasweq closed this as completed Nov 2, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage inside transport.newBufWriter (memory leak?) #6242

High memory usage inside transport.newBufWriter (memory leak?) #6242

wh75er commented May 2, 2023

easwars commented May 2, 2023

wh75er commented May 3, 2023 •

edited

Loading

easwars commented May 3, 2023

github-actions bot commented May 9, 2023

wh75er commented May 10, 2023

wh75er commented May 10, 2023

easwars commented May 10, 2023

wh75er commented May 18, 2023

easwars commented May 18, 2023

dfawley commented May 18, 2023

wh75er commented May 19, 2023

easwars commented Jul 26, 2023

github-actions bot commented Aug 2, 2023

shantanubansal commented Feb 29, 2024

arvindbr8 commented Feb 29, 2024

High memory usage inside transport.newBufWriter (memory leak?) #6242

High memory usage inside transport.newBufWriter (memory leak?) #6242

Comments

wh75er commented May 2, 2023

How did it all start?

Interesting fact

Technical info

Pprof data

How to reproduce?

Related issues

Conclusion

easwars commented May 2, 2023

wh75er commented May 3, 2023 • edited Loading

easwars commented May 3, 2023

github-actions bot commented May 9, 2023

wh75er commented May 10, 2023

wh75er commented May 10, 2023

easwars commented May 10, 2023

wh75er commented May 18, 2023

easwars commented May 18, 2023

dfawley commented May 18, 2023

wh75er commented May 19, 2023

easwars commented Jul 26, 2023

github-actions bot commented Aug 2, 2023

shantanubansal commented Feb 29, 2024

arvindbr8 commented Feb 29, 2024

wh75er commented May 3, 2023 •

edited

Loading