Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage inside transport.newBufWriter (memory leak?) #6242

Closed
wh75er opened this issue May 2, 2023 · 15 comments
Closed

High memory usage inside transport.newBufWriter (memory leak?) #6242

wh75er opened this issue May 2, 2023 · 15 comments

Comments

@wh75er
Copy link

wh75er commented May 2, 2023

Hello! We've confronted an interesting case of memory growth inside our gRPC service (server side). Memory growth pattern is really tiny (about 10-13MB per 24h), but it's noticeable. Growth continued up to 6 days (definitely a problem)

How did it all start?

We've updated our gRPC to a new version (1.27.0 up to 1.47.0), because of a new dependency in our repository. There are no changes at all in our service's code.

After a few days memory growth was noticed. We've tried to wait and watch, but growth hasn't stopped.

Then we've tried to update our gRPC to the latest version, but it doesn't help to solve the problem. We've also collected some profiles to trace causes of this trouble

Interesting fact

In our repository we've also have one more gRPC server (let's call it service B). It uses the same gRPC code as a server A (the service we have problem with). Service B has lower RPS than service A and there are no problems at all

Technical info

  1. golang version: 1.20.1
  2. gRPC version: 1.47.0 / 1.54.0 (migrated from 1.27.0)
  3. gRPC service has only one entry (one service with one method)
  4. RPS: ~5K

Pprof data

We've collected multiple samples (02.05 25.04 26.04 27.04 28.04). Diff shows that growth occurs inside grpc internals

go tool pprof -drop_negative -inuse_space -diff_base=25.04/pprof-heap-25.04-2.txt 02.05/pprof-heap-02.05-1.txt

(pprof) top
Showing nodes accounting for 17.64MB, 8.21% of 214.84MB total
Dropped 2 nodes (cum <= 1.07MB)
Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
   10.64MB  4.95%  4.95%    10.64MB  4.95%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
    1.50MB   0.7%  5.65%     1.50MB   0.7%  runtime.malg
       1MB  0.47%  6.12%        1MB  0.47%  github.com/GoWebProd/psqr.NewQuantile
       1MB  0.47%  6.58%     8.56MB  3.98%  google.golang.org/grpc/internal/transport.NewServerTransport
       1MB  0.47%  7.05%        1MB  0.47%  bytes.(*Buffer).grow
    0.50MB  0.23%  7.28%     0.50MB  0.23%  bufio.NewWriterSize
    0.50MB  0.23%  7.51%     0.50MB  0.23%  google.golang.org/grpc/metadata.Join
    0.50MB  0.23%  7.75%     0.50MB  0.23%  sync.(*Pool).pinSlow
(pprof) top -cum 25
Showing nodes accounting for 14.64MB, 6.81% of 214.84MB total
Dropped 2 nodes (cum <= 1.07MB)
Showing top 25 nodes out of 63
      flat  flat%   sum%        cum   cum%
   10.64MB  4.95%  4.95%    10.64MB  4.95%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
         0     0%  4.95%     8.56MB  3.98%  google.golang.org/grpc.(*Server).Serve.func3
         0     0%  4.95%     8.56MB  3.98%  google.golang.org/grpc.(*Server).handleRawConn
         0     0%  4.95%     8.56MB  3.98%  google.golang.org/grpc.(*Server).newHTTP2Transport
       1MB  0.47%  5.42%     8.56MB  3.98%  google.golang.org/grpc/internal/transport.NewServerTransport
         0     0%  5.42%     7.56MB  3.52%  google.golang.org/grpc/internal/transport.newFramer
         0     0%  5.42%     2.50MB  1.16%  src/service/api._ProcessRequest_Handler
         0     0%  5.42%     2.50MB  1.16%  google.golang.org/grpc.(*Server).handleStream
         0     0%  5.42%     2.50MB  1.16%  google.golang.org/grpc.(*Server).processUnaryRPC
         0     0%  5.42%     2.50MB  1.16%  google.golang.org/grpc.(*Server).serveStreams.func1.2
         0     0%  5.42%     2.50MB  1.16%  main.(*Service).AuthUnaryInterceptor
    1.50MB   0.7%  6.12%     1.50MB   0.7%  runtime.malg
         0     0%  6.12%     1.50MB   0.7%  runtime.newproc.func1
         0     0%  6.12%     1.50MB   0.7%  runtime.newproc1
         0     0%  6.12%     1.50MB   0.7%  runtime.systemstack

How to reproduce?

I've tried to reproduce it by creating a small load (about ~100 RPS) on service, but memory hasn't changed. I think trouble reproduces only on a high load

Related issues

I've checked different issues on this topic and found out that there are indeed some problems with write buffer in gRPC internals. Maybe this issue is somehow related to all of this:

#5751

#5757
#5758
#5759

I haven't found any issue reporting memory grow though

Conclusion

Thank you for your time. I would really appreciate if you:

  1. suggest what could be checked in our code in case if we haven't closed some resources and there is a problem on our side
  2. (or) confirm that there is a problem with gRPC buffer writer and memory growth is related to this

I'll try to find out how to reproduce it on a higher load and check why does transport.newBufWriter grow in memory. I will share my results here if I get something interesting

@easwars easwars self-assigned this May 2, 2023
@easwars
Copy link
Contributor

easwars commented May 2, 2023

It is not clear to me whether you are seeing an increase in memory utilization of about 10-15MB between your server application running gRPC version 1.27 and server application running gRPC version 1.47, or if the memory utilization in your server application running gRPC version 1.47 is continuously growing.

If the answer is the former, I'm afraid, we can't help much. 1.27.0 was released more than 3 years ago and so much has changed since then. The bufWriter code hasn't changed much though.

If your answer if the latter, i.e memory utilization is constantly growing, then we would like to hear more from you.

@wh75er
Copy link
Author

wh75er commented May 3, 2023

Hello, @easwars! Sorry for inconvenience. The memory utilization is continuously growing on our server application running gRPC version 1.47 (and later we have tried 1.54).

Our memory problem have begun after migrating to a new gRPC version 1.47.0. Then, we have tried updating it to the latest version - 1.54.0 (in case if there were some fixes), but it didn't help. So timeline looks something like that

1.27.0 (memory utilization is fine) -(update)-> 1.47.0 (memory utilization is constantly growing) -(update)-> 1.54.0 (memory utilization is constantly growing)

My point was that we have 2 different gRPC server applications in the same repository. Each of them use the same vendor gRPC package. It looks like on one server application we can observe growth of the memory utilization, but on another gRPC server application memory usage is okay (the memory utilization is not growing). This detail makes me think that there is an error on the first gRPC server application on our side (probably it's not the case, because we only updated gRPC package and didn't modify any code), or RPS slightly differs between this two servers which causes issue to be observed only in one place (I'm not sure about this either, because RPS difference is not that high ~ less than 1K). Maybe one application uses bufWriter more often than another

@easwars
Copy link
Contributor

easwars commented May 3, 2023

Can you get a pprof dump that we can analyze?

Also, does the memory utilization come down in the service with high QPS when the QPS comes down? If so, then it could indicate that we are not leaking memory.

@github-actions
Copy link

github-actions bot commented May 9, 2023

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label May 9, 2023
@wh75er
Copy link
Author

wh75er commented May 10, 2023

Sorry for long reply!

does the memory utilization come down in the service with high QPS when the QPS comes down

No, it doesn't. Looks like it grows with QPS, but when QPS goes down it keeps the memory utilization the same after last QPS peak. When QPS starts to grow again, it seems like the memory utilization continues to grow and then locks on

Can you get a pprof dump that we can analyze?

Sure, I've sent 2 dumps to your email, so you could check the diff. You can find it by subject high memory usage #6242

@github-actions github-actions bot removed the stale label May 10, 2023
@wh75er
Copy link
Author

wh75er commented May 10, 2023

Also forgot to tag you, @easwars

Doing it here

@easwars
Copy link
Contributor

easwars commented May 10, 2023

Thanks @wh75er. We will take a look soon.

@wh75er
Copy link
Author

wh75er commented May 18, 2023

Hello! Is there any updates on the issue?

@easwars
Copy link
Contributor

easwars commented May 18, 2023

Really sorry. I haven't had cycles to look into this.

@zasweq @arvindbr8 @dfawley
Would one of you be able to look into this? Thanks.

@dfawley
Copy link
Member

dfawley commented May 18, 2023

There are a lot of moving pieces in a system like this, and so this will be hard to help, here. Some things to consider:

  • Do you track the number of open connections? Is the memory scaling consistently with open connections?
  • Do you have max connection age or max connection idle enabled? https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters
  • Does your service also have outgoing gRPC connections (is it a gRPC client as well)?
  • What is the traffic like? Are all calls unary or are there streams too? Do clients connect and disconnect frequently?
  • If you can reliably reproduce the behavior, but can't debug the code to determine the cause, it would be interesting to determine when the problem appeared (what release, or even better, what commit).

@wh75er
Copy link
Author

wh75er commented May 19, 2023

Do you track the number of open connections? Is the memory scaling consistently with open connections?

No, I don't. There is no data on how does open connections affect memory scaling. I'll try to figure it out

Do you have max connection age or max connection idle enabled? https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters

Yes, I have. But there were no problems with that and this part has not been modified

MaxConnectionIdle: time.Minute,
MaxConnectionAgeGrace: 5 * time.Second,
Time:                  5 * time.Second,
Timeout:               1 * time.Second,

Does your service also have outgoing gRPC connections (is it a gRPC client as well)?

No, it doesn't

What is the traffic like? Are all calls unary or are there streams too? Do clients connect and disconnect frequently?

Traffic is not uniform. There is a prime time where RPS is high, after that it decreases.

The application uses only unary calls. There is one interceptor with this feature - headers are attached to responses. I had an idea that this is somehow affecting grpc Writer, but I haven't succeeded in reproducing it by myself

If you can reliably reproduce the behavior, but can't debug the code to determine the cause, it would be interesting to determine when the problem appeared (what release, or even better, what commit).

No, I can not reproduce it at all. It seems like the problem appeared after a commit with gRPC upgrade

@easwars
Copy link
Contributor

easwars commented Jul 26, 2023

At this point, unfortunately without a good way to repro this problem, there is very little that we can help with. We will leave this open for a little bit longer before closing. If you find a good way for us to repro this problem we will gladly take a deeper look. Thank you for your patience.

@github-actions
Copy link

github-actions bot commented Aug 2, 2023

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Aug 2, 2023
@github-actions github-actions bot closed this as completed Aug 9, 2023
@arvindbr8 arvindbr8 reopened this Nov 2, 2023
@github-actions github-actions bot removed the stale label Nov 2, 2023
@zasweq zasweq closed this as completed Nov 2, 2023
@shantanubansal
Copy link

Anyupdate on this issue?

@arvindbr8
Copy link
Member

This issue was closed 4 months ago. Please consider opening a new issue instead of leaving a comment here if you need to clarify something and maybe reference this issue in the new one

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants