Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-agent pods won't give back memory after a successful backup #8138

Closed
Gui13 opened this issue Aug 21, 2024 · 16 comments
Closed

node-agent pods won't give back memory after a successful backup #8138

Gui13 opened this issue Aug 21, 2024 · 16 comments
Assignees

Comments

@Gui13
Copy link

Gui13 commented Aug 21, 2024

What steps did you take and what happened:

When a backup is scheduled and performed, the node-agent daemonset distributes work on all nodes of the cluster.

The backup takes a lot of memory in our case ( > 6GiB), which is not a problem per-se, when the backup is occurring (Kopia eats lots of memory, that's expected).

But the issue is that the node-agent processes will keep consuming memory even after all backups are done.

Here is a screenshot of the current memory usage of our node-agent fleet (we have 44 nodes, I cannot fit them all in the screen but all of them are consuming about the same memory). At that moment, there was no backup, restoration or anything else occurring:

Capture d’écran 2024-08-21 à 14 17 15

This is what happens on our RAM metrics, you can see that the node agents are consuming a very meager amount of RAM, and start eating memory when the backup is running. After the backup ends, the memory usage is still hovering at 180GiB of RAM:

Capture d’écran 2024-08-21 à 15 27 33

What did you expect to happen:

I expected the RAM usage to go back to close to 0 instead of lingering above 180GiB (for 44 node agents).

The following information will help us better understand what's going on:

Anything else you would like to add:

Some information on the cluster:

  • 44 Nodes, 64GiB of ram for each node
  • We have lots of small files to backup for each PVC, which might explain the memory usage
  • We don't use "parallelisation" for the node-agents, there's only 1 concurrent task

Environment:

  • Velero version (use velero version): 1.14.0
    
  • Velero features (use velero client config get features): features:
    
  • Kubernetes version (use kubectl version):
    
  • Client Version: v1.29.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.27.16
  • Kubernetes installer & version: Azure AKS, provisioned by Terraform
    
  • Cloud provider or hardware configuration: Azure
    
  • OS (e.g. from /etc/os-release): Azure Linux distribution
    

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

Are you running data movement backups?

@Lyndon-Li
Copy link
Contributor

Could you monitor for a longer time? I see it is not really linger between 1st backup and 2nd backup, the memory usage kept dropping, from more than 200G to around 180G.
Since there are plenty of memory in your nodes, the memory reclaim may not be that aggressive.

@Gui13
Copy link
Author

Gui13 commented Aug 21, 2024

Hello @Lyndon-Li, thanks for looking at it.

We do have Data Mover activated (and we rely on it for cost reasons).

Here is a screenshot of about 6 days of operations:
image

We have schedules every 24h running (you can clearly see them with the CPU spikes).

I can see that the memory trickles down a bit, but not much, and it keeps increasing after each backup is run.

I expected the Kopia processes to drop memory completely after the data move is done (and thus memory usage for all 44 instances should go back to ~10Gb at rest, which is the sum of all 44 instances of node-agent take when they fresh boot and subscribe to the necessary events.

@Lyndon-Li Lyndon-Li self-assigned this Aug 22, 2024
@Lyndon-Li
Copy link
Contributor

Let me try to reproduce it locally, may take some time, will get back.

@Gui13
Copy link
Author

Gui13 commented Aug 22, 2024

FWIW I can reproduce on my personal cluster, where I have very few disks, and I use exclusively the FSB (File system backup) and no Data Mover. This is 60 days of operation with 2 manual restarts of node-agent:

image

There is only 1 single node, with 1 node-agent. The cluster is a K3S machine and uses velero 1.140 & Kopia to backup longhorn-based filesystems.

It is hard to see, but the agent takes ~40MiB of memory at rest after a fresh reboot, then shoots up to perform the backup, then never goes back to 40MiB.
Also, I suspect there's a small leak after each backup, sufficient to be seen across 10+days. Here's a zoomed version of the above screen with my notes:

image

@Lyndon-Li
Copy link
Contributor

@Gui13 Good catch. I am trying to profile the node-agent to find the leaks, will get you back.

@Lyndon-Li
Copy link
Contributor

@Gui13 What is the number of CPU cores in each of your cluster node?

@Gui13
Copy link
Author

Gui13 commented Aug 24, 2024

On my own cluster, I have 8 cores on 1 node.

On our production cluster we have 16 core per node, of which (nodes) we have 44.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Aug 26, 2024

I could reproduce the problem as described here and found the cause.
This is something like an "expected" memory preserve according to the current code though it doesn't look rational.
However, this preserve still has limits, the max memory preserved is:
(64k * 2048 + 24M * 16) * 44 = 22G

So this doesn't account for all the memory leak in the production env (around 180G).

@Lyndon-Li
Copy link
Contributor

@Gui13 I see you opened another issue #8135, I suspect the leak of memory may be related to that issue, which results in the data path is not closed appropriately. Therefore, let's consider this two issues together

@Gui13
Copy link
Author

Gui13 commented Aug 26, 2024

@Lyndon-Li yes indeed, I suspect that OOMKill is certainly be a cause in there. The issue #8135 is there because I expected the volume to be picked up and reconciled at some point (I mean that the DataMover code should maybe list the leftover PVCs and clean them up).

Regarding the RAM usage, my own cluster has WAY less files than what my client is using in production, so the "leak" is a lot less visible.

I didn't look at the code of velero, but is it possible that you are using Kopia as a Go library instead of calling it as an external tool ? In this case it might be a kopia global variable that's keeping data around ?

@Lyndon-Li
Copy link
Contributor

OOMKill should not be related to the memory leak, as once node-agent pods are oomkilled, they give back the memory immediately.
And for #8135, as the expectation, if node-agent restarts for any reason (i.e., oomkilled), it suppose to clean up any leftovers, unless there is a bug.

@Lyndon-Li
Copy link
Contributor

my own cluster has WAY less files than what my client is using in production, so the "leak" is a lot less visible

The leak in your own cluster falls into what I mentioned here. However, this memory leak/preserve has a roof/limit, so even in a larger scaled env, the leak should not exceed the roof.
Therefore, it cannot explain the leak of GBs memory in each node in your production env.

@Lyndon-Li
Copy link
Contributor

@Gui13
If you are fine, you can fine me in the velero user slack channel or my personal slack channel, let's discuss and see if we can get more info/ideas.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

Copy link

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants