`node-agent` pods won't give back memory after a successful backup #8138

Gui13 · 2024-08-21T13:34:24Z

What steps did you take and what happened:

When a backup is scheduled and performed, the node-agent daemonset distributes work on all nodes of the cluster.

The backup takes a lot of memory in our case ( > 6GiB), which is not a problem per-se, when the backup is occurring (Kopia eats lots of memory, that's expected).

But the issue is that the node-agent processes will keep consuming memory even after all backups are done.

Here is a screenshot of the current memory usage of our node-agent fleet (we have 44 nodes, I cannot fit them all in the screen but all of them are consuming about the same memory). At that moment, there was no backup, restoration or anything else occurring:

This is what happens on our RAM metrics, you can see that the node agents are consuming a very meager amount of RAM, and start eating memory when the backup is running. After the backup ends, the memory usage is still hovering at 180GiB of RAM:

What did you expect to happen:

I expected the RAM usage to go back to close to 0 instead of lingering above 180GiB (for 44 node agents).

The following information will help us better understand what's going on:

Anything else you would like to add:

Some information on the cluster:

44 Nodes, 64GiB of ram for each node
We have lots of small files to backup for each PVC, which might explain the memory usage
We don't use "parallelisation" for the node-agents, there's only 1 concurrent task

Environment:

Velero version (use velero version): 1.14.0

Velero features (use velero client config get features): features:

Kubernetes version (use kubectl version):

Client Version: v1.29.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.27.16

Kubernetes installer & version: Azure AKS, provisioned by Terraform

Cloud provider or hardware configuration: Azure

OS (e.g. from /etc/os-release): Azure Linux distribution

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-08-21T14:12:19Z

Are you running data movement backups?

Lyndon-Li · 2024-08-21T14:25:44Z

Could you monitor for a longer time? I see it is not really linger between 1st backup and 2nd backup, the memory usage kept dropping, from more than 200G to around 180G.
Since there are plenty of memory in your nodes, the memory reclaim may not be that aggressive.

Gui13 · 2024-08-21T15:56:44Z

Hello @Lyndon-Li, thanks for looking at it.

We do have Data Mover activated (and we rely on it for cost reasons).

Here is a screenshot of about 6 days of operations:

We have schedules every 24h running (you can clearly see them with the CPU spikes).

I can see that the memory trickles down a bit, but not much, and it keeps increasing after each backup is run.

I expected the Kopia processes to drop memory completely after the data move is done (and thus memory usage for all 44 instances should go back to ~10Gb at rest, which is the sum of all 44 instances of node-agent take when they fresh boot and subscribe to the necessary events.

Lyndon-Li · 2024-08-22T02:45:23Z

Let me try to reproduce it locally, may take some time, will get back.

Gui13 · 2024-08-22T20:40:12Z

FWIW I can reproduce on my personal cluster, where I have very few disks, and I use exclusively the FSB (File system backup) and no Data Mover. This is 60 days of operation with 2 manual restarts of node-agent:

There is only 1 single node, with 1 node-agent. The cluster is a K3S machine and uses velero 1.140 & Kopia to backup longhorn-based filesystems.

It is hard to see, but the agent takes ~40MiB of memory at rest after a fresh reboot, then shoots up to perform the backup, then never goes back to 40MiB.
Also, I suspect there's a small leak after each backup, sufficient to be seen across 10+days. Here's a zoomed version of the above screen with my notes:

Lyndon-Li · 2024-08-23T02:14:32Z

@Gui13 Good catch. I am trying to profile the node-agent to find the leaks, will get you back.

Lyndon-Li · 2024-08-23T23:54:22Z

@Gui13 What is the number of CPU cores in each of your cluster node?

Gui13 · 2024-08-24T07:01:52Z

On my own cluster, I have 8 cores on 1 node.

On our production cluster we have 16 core per node, of which (nodes) we have 44.

Lyndon-Li · 2024-08-26T06:14:05Z

I could reproduce the problem as described here and found the cause.
This is something like an "expected" memory preserve according to the current code though it doesn't look rational.
However, this preserve still has limits, the max memory preserved is:
(64k * 2048 + 24M * 16) * 44 = 22G

So this doesn't account for all the memory leak in the production env (around 180G).

Lyndon-Li · 2024-08-26T06:19:06Z

@Gui13 I see you opened another issue #8135, I suspect the leak of memory may be related to that issue, which results in the data path is not closed appropriately. Therefore, let's consider this two issues together

Gui13 · 2024-08-26T08:58:54Z

@Lyndon-Li yes indeed, I suspect that OOMKill is certainly be a cause in there. The issue #8135 is there because I expected the volume to be picked up and reconciled at some point (I mean that the DataMover code should maybe list the leftover PVCs and clean them up).

Regarding the RAM usage, my own cluster has WAY less files than what my client is using in production, so the "leak" is a lot less visible.

I didn't look at the code of velero, but is it possible that you are using Kopia as a Go library instead of calling it as an external tool ? In this case it might be a kopia global variable that's keeping data around ?

Lyndon-Li · 2024-08-26T09:36:41Z

OOMKill should not be related to the memory leak, as once node-agent pods are oomkilled, they give back the memory immediately.
And for #8135, as the expectation, if node-agent restarts for any reason (i.e., oomkilled), it suppose to clean up any leftovers, unless there is a bug.

Lyndon-Li · 2024-08-26T09:39:41Z

my own cluster has WAY less files than what my client is using in production, so the "leak" is a lot less visible

The leak in your own cluster falls into what I mentioned here. However, this memory leak/preserve has a roof/limit, so even in a larger scaled env, the leak should not exceed the roof.
Therefore, it cannot explain the leak of GBs memory in each node in your production env.

Lyndon-Li · 2024-08-26T09:41:43Z

@Gui13
If you are fine, you can fine me in the velero user slack channel or my personal slack channel, let's discuss and see if we can get more info/ideas.

github-actions · 2024-10-26T01:59:47Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions · 2024-11-10T02:03:16Z

This issue was closed because it has been stalled for 14 days with no activity.

Lyndon-Li self-assigned this Aug 22, 2024

blackpiglet added the Performance label Aug 22, 2024

Lyndon-Li mentioned this issue Aug 30, 2024

Add doc for node-agent memory preserve #8167

Merged

github-actions bot added the staled label Oct 26, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`node-agent` pods won't give back memory after a successful backup #8138

`node-agent` pods won't give back memory after a successful backup #8138

Gui13 commented Aug 21, 2024 •

edited

Loading

Lyndon-Li commented Aug 21, 2024

Lyndon-Li commented Aug 21, 2024

Gui13 commented Aug 21, 2024

Lyndon-Li commented Aug 22, 2024

Gui13 commented Aug 22, 2024

Lyndon-Li commented Aug 23, 2024

Lyndon-Li commented Aug 23, 2024

Gui13 commented Aug 24, 2024

Lyndon-Li commented Aug 26, 2024 •

edited

Loading

Lyndon-Li commented Aug 26, 2024

Gui13 commented Aug 26, 2024

Lyndon-Li commented Aug 26, 2024

Lyndon-Li commented Aug 26, 2024

Lyndon-Li commented Aug 26, 2024

github-actions bot commented Oct 26, 2024

github-actions bot commented Nov 10, 2024

node-agent pods won't give back memory after a successful backup #8138

node-agent pods won't give back memory after a successful backup #8138

Comments

Gui13 commented Aug 21, 2024 • edited Loading

Lyndon-Li commented Aug 21, 2024

Lyndon-Li commented Aug 21, 2024

Gui13 commented Aug 21, 2024

Lyndon-Li commented Aug 22, 2024

Gui13 commented Aug 22, 2024

Lyndon-Li commented Aug 23, 2024

Lyndon-Li commented Aug 23, 2024

Gui13 commented Aug 24, 2024

Lyndon-Li commented Aug 26, 2024 • edited Loading

Lyndon-Li commented Aug 26, 2024

Gui13 commented Aug 26, 2024

Lyndon-Li commented Aug 26, 2024

Lyndon-Li commented Aug 26, 2024

Lyndon-Li commented Aug 26, 2024

github-actions bot commented Oct 26, 2024

github-actions bot commented Nov 10, 2024

`node-agent` pods won't give back memory after a successful backup #8138

`node-agent` pods won't give back memory after a successful backup #8138

Gui13 commented Aug 21, 2024 •

edited

Loading

Lyndon-Li commented Aug 26, 2024 •

edited

Loading