-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-agent
pods won't give back memory after a successful backup
#8138
Comments
Are you running data movement backups? |
Could you monitor for a longer time? I see it is not really linger between 1st backup and 2nd backup, the memory usage kept dropping, from more than 200G to around 180G. |
Hello @Lyndon-Li, thanks for looking at it. We do have Data Mover activated (and we rely on it for cost reasons). Here is a screenshot of about 6 days of operations: We have schedules every 24h running (you can clearly see them with the CPU spikes). I can see that the memory trickles down a bit, but not much, and it keeps increasing after each backup is run. I expected the Kopia processes to drop memory completely after the data move is done (and thus memory usage for all 44 instances should go back to ~10Gb at rest, which is the sum of all 44 instances of node-agent take when they fresh boot and subscribe to the necessary events. |
Let me try to reproduce it locally, may take some time, will get back. |
@Gui13 Good catch. I am trying to profile the node-agent to find the leaks, will get you back. |
@Gui13 What is the number of CPU cores in each of your cluster node? |
On my own cluster, I have 8 cores on 1 node. On our production cluster we have 16 core per node, of which (nodes) we have 44. |
I could reproduce the problem as described here and found the cause. So this doesn't account for all the memory leak in the production env (around 180G). |
@Lyndon-Li yes indeed, I suspect that OOMKill is certainly be a cause in there. The issue #8135 is there because I expected the volume to be picked up and reconciled at some point (I mean that the DataMover code should maybe list the leftover PVCs and clean them up). Regarding the RAM usage, my own cluster has WAY less files than what my client is using in production, so the "leak" is a lot less visible. I didn't look at the code of velero, but is it possible that you are using Kopia as a Go library instead of calling it as an external tool ? In this case it might be a kopia global variable that's keeping data around ? |
OOMKill should not be related to the memory leak, as once node-agent pods are oomkilled, they give back the memory immediately. |
The leak in your own cluster falls into what I mentioned here. However, this memory leak/preserve has a roof/limit, so even in a larger scaled env, the leak should not exceed the roof. |
@Gui13 |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands. |
This issue was closed because it has been stalled for 14 days with no activity. |
What steps did you take and what happened:
When a backup is scheduled and performed, the
node-agent
daemonset distributes work on all nodes of the cluster.The backup takes a lot of memory in our case ( > 6GiB), which is not a problem per-se, when the backup is occurring (Kopia eats lots of memory, that's expected).
But the issue is that the node-agent processes will keep consuming memory even after all backups are done.
Here is a screenshot of the current memory usage of our node-agent fleet (we have 44 nodes, I cannot fit them all in the screen but all of them are consuming about the same memory). At that moment, there was no backup, restoration or anything else occurring:
This is what happens on our RAM metrics, you can see that the node agents are consuming a very meager amount of RAM, and start eating memory when the backup is running. After the backup ends, the memory usage is still hovering at 180GiB of RAM:
What did you expect to happen:
I expected the RAM usage to go back to close to 0 instead of lingering above 180GiB (for 44 node agents).
The following information will help us better understand what's going on:
Anything else you would like to add:
Some information on the cluster:
Environment:
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: