Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

mpryc · 2024-11-22T13:40:59Z

What steps did you take and what happened:

During Backup/Restore with the node-agents utilizing Kopia it was observed that the size of the Kopia's cache folder is growing:

$ kubectl get pods -n velero-ns -l name=node-agent -o name | while read pod; do echo -e "$(kubectl exec -n velero-ns $pod -- du -hs /var | awk '{print $1}')\t$pod"; done | sort -h | awk 'BEGIN {print "KOPIA CACHE SIZE\tPOD\n-------------------------"} {print $0}'
KOPIA CACHE SIZE	POD
-------------------------
11M	pod/node-agent-7zqm9
11M	pod/node-agent-dzl6s
244M	pod/node-agent-9tmdw

What did you expect to happen:
Velero should have mechanism to automatically clean up/run some maintenance on the node-agent's Kopia cache after a backup or restore operation.

weshayutin · 2024-11-22T17:09:57Z

@msfrucht please review and work w/ @mpryc on this

kaovilai · 2024-11-22T17:21:06Z

So we had already confirmed running immediate maintenance won't be deleting backups. Are we saying there's benefit to reducing cache directory size?

Lyndon-Li · 2024-11-25T03:36:14Z

The above tests cannot prove that Velero needs to intervene Kopia repo's cache management:

Kopia repo has its own policy to manage the cache, e.g., there are several margins to decide when and how to remove the cache
The cache management is repo-wide, not operation-wide. Or in another word, the cache will still be effective across more than one repo upper level operations, i.e., backup/restore or maintenance
At present, Velero's default cache limit(hard) is 5G for data and metadata representatively
Comparing to the cache limit, the above tests cannot approve that Velero needs to manage the cache somehow, because it is still in the scope of Kopia repo's own management; nor we can prove this theoretically as mentioned above

Generally speaking, setting the cache limit, which we already have, is a more graceful way to control Kopia repo's cache. And it should be enough, unless really necessary, we would have Kopia repo itself to manage the cache all the time.

We only need to consider Velero's intervene in the corner case that the cache is out of Kopia repo's control and so is left behind e.g., a repo is deleted or is never visited again for a long time.

reasonerjt · 2024-11-25T06:30:07Z

@mpryc
Per the comment @Lyndon-Li
It seems the cache growing is expected. Please elaborate on why this is a severe issue.

mpryc · 2024-11-25T12:23:11Z

Currently it was observed that the cache folder is growing way above 5G:

# Before node-agent pod restart (after backup & restore operation):

Filesystem      Size  Used Avail Use% Mounted o
/dev/sdb4       447G  237G  211G  53% /var

# After node-agent pod restart:

Filesystem      Size  Used Avail Use% Mounted o
/dev/sdb4       447G   25G  422G   6% /var

While this growth is expected, the cache is not being freed, even if the data used for the restore is no longer needed. This causes disk space to accumulate until the node-agent is restarted.

Possibly dynamically setting hard limit would be sufficient here? I know there is a way to set the limit, however if the limit was dynamically calculated based on the available disk size on the node as it happens that the cache is growing to the point the node is dying due to full disk space and the only option at this point is to restart the node-agent. If that happens during backup or restore operation the backup or restore is never successful.

Lyndon-Li · 2024-11-25T14:33:38Z

How many BSLs do you have and how many namespaces are you backing up?

mpryc · 2024-11-25T14:48:39Z

How many BSLs do you have and how many namespaces are you backing up?

It's a single namespace, single BSL, single pod with big PV size (1TB) that has a lot of data in it. Many files within that PV around 10GB each.

Lyndon-Li · 2024-11-25T14:52:41Z

Are you using 1.15?

mpryc · 2024-11-25T16:12:35Z

Are you using 1.15?

Forgot to mention that in the bug, sorry: it's not 1.15 it's 1.14

Lyndon-Li · 2024-11-26T02:38:00Z

This is a known issue in 1.14 and has been fixed in 1.15, so you could upgrade to 1.15 and do another test.

weshayutin · 2024-12-05T20:24:32Z

@Lyndon-Li I think @mpryc would be willing to do the work here if you want to assign it to him :)

Lyndon-Li · 2024-12-06T02:19:22Z

I am assigning this issue to @mpryc for more investigation.
As far as what we are understanding, Kopia's own cache management mechanism is enough. We don't need to interfere it, which may cause potential bugs or performance downgrade.
After the investigation, if we have new findings, we can discuss how to handle them.

msfrucht · 2024-12-09T15:57:52Z

@Lyndon-Li

I'm not sure this is the kopia cache though. On OpenShift the kopia cache is set to /home/velero. I don't know if /var is used for the location of the kopia cache in ordinary k8s. @mpryc Can you check if there is a hidden .cache folder under there? That is the typical name for the kopia cache folder.

We have observed internally that the cache controls for Kopia do not control the size of the index cache. As a result, the index cache grows pretty much as needed.

This will also cause kopia's memory use to skyrocket during connect due a combination of kernel issues and not streaming index cache files from object storage.

Index object not being streamed during download and upload: kopia/kopia#4267
Kernel issue: kopia/kopia#4162

Particularly, workloads with high counts of unique blocks will cause the index cache (and number of indices) to eventually spiral out of control. Though this is a fairly unusual workload for most applications.

Our worst case scenarios have seen cache use on the order of 30GB, though these are extraordinary cases at around 100 million+ unique blocks for kopia to keep track of.

Right now, a workaround by using a PVC to hold the kopia cache during maintenance has been successful at avoiding severe issues with local nodes up to and including malfunction. However, this requires adding a maintenance Job ttl to make sure PVC cache storage gets cleaned up. Not available in the current Velero minimum k8s version of 1.18 and that requires 1.23. Since our organization doesn't support anything older than 1.23, not an issue internally.

The same approach could work with node-agent's, right up until the PVC becomes full.

Lyndon-Li · 2024-12-10T05:39:10Z

@msfrucht
We need to treat these two problems separately --- the cache size on disk and the memory usage.
Though they may emerge at the same time, but the cache is never the cause of the memory usage, vice versa. And they may or may not be caused by the same reason:

The cache size is always related to the number of indexes and number of unique contents. The cache is on disk, itself never relies on memory directly.
The memory usage varies from different stages, i.e., connection, maintenance, backup, restore. For repository management stages, sometimes, it needs to load the full indexes into memory, sometimes, it maps part of the indexes or contents into memory; for backup/restore phase, it also needs to load the indexes and contents. At the mean while, it additionally needs to load file system objects into memory, for which, the memory usage is decided by the complexity and scale of the file system; and during the visit of file system, paging IO memory cache and file system item memory cache which are owned by the system (e.g., ~40 GB Memory leak (?) kopia/kopia#4162), are also counted by cgroup, but are out of control of us, this is why we set Velero's pods' resource policy as BestEfforts.

Conclusively, at present, we consider these two areas separately, here are the current status:

For the index and content cache, we leverage Kopia awareness ways to manage them, i.e., various cache margins and separate location for the cache (see Make kopia repo cache place configurable #7725). For 1.14, there is a known issue(Velero) to set the cache margins, as a result, the cache size may exceed the limit; 1.15 has fixed the problem.
For memory usage problems, there are some known issues, during Kopia 0.17/Velero 1.15 cycle, several of them have been fixed; meanwhile, there are still some open issues I am working on

mpryc · 2024-12-10T12:56:43Z

After some more tests, there are actually 4 parts in the equation:

For each node-agent that is running on each worker node of k8s cluster:

Kopia cache size, which is in the home directory ~/.cache which in our case it's /home/velero/.cache
ephemeralStorage within node-agent pod
CSI DataMover - mounted PVs for the node-agent
Pod memory usage

The 1 and 2 are directly connected, because Kopia cache is within home directory of the user, that is not mounted from the external storage, which is part of the pod's ephemeral storage: we have this one for it: #7725

The 4th as you mentioned is around some kopia memory leaks and other issues, which is also not part of this bug.

I also see the 1.4 problem is no longer visible on the 1.5. To add to it the observation of ever growing size was mixture of both problems that were fixed with 1.5 by moving actual transfer from node-agent to backupPods / restorePods (??) and our wrong way to look under /var folder where we miscalculated data. The data was growing because the tests were creating new workload with new restored PVs which were mounted under /var/ for each succesfull restore:

$ for file in $(ls /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/); do du -sh /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/$file; done
338M /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/4f8c141182b62cbd1d4b7c6ad87dfee9279346eb2b60832ee8c7d0a960aea94d
368M /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/f9f710147023b88d6be7e98941c95ec60fa4fab4bf6cf223402e3ce25a7fc44f

If you agree we can close this one as not a bug anymore as we will keep the #7725, unless there is something to be done for the CSI Datamover and it's mounted PVs ?

msfrucht · 2024-12-10T17:41:37Z

@mpryc I agree with that.

I have implemented internally #7725 for kopia maintenance jobs and it worked wonderfully to reduce ephemeral storage usage. Prometheus was showing about ~1MB at worst ephemeral-storage usage on kopia maintenance jobs afterwards.

The changes implemented is our organization was only interested in kopia caching. So it is not generalized to fit all existing Velero workflows. The cache directories for Restic and Kopia are in different locations and not easily accessible in the existing layering to look that up in code.

The default Restic cache location needed for PVC mounts is environmental variable VELERO_SCRATCH_DIR or /scratch by default.
The default Kopia cache location needed for PVC mount is whatever HOME is set to, typically /home/velero

Option during install --cache-dir will override for restic, but not for kopia which was very confusing for a while because there is no mention in --help about that till I looked at the code.

#7725 would be a good area to design and implement something more complete specifying cache dir for Restic and Kopia caches.

Due to the differing cache locations, consolidation to a single location regardless of datamover Restic or Kopia would allow for #7725 implementation to be less complicated.

I couldn't find a cache location design in the design documentation. It appears to have grown into incomplete options and environmental variables organically.

The basic design for a cache volumes I used was fairly simple due to our internal limited use case, add a "storage" section to maintenanceJobConfig and nodeAgentConfig. Fields of storageClassName, size, and an accessMode (ReadWriteOnce, ReadWriteMany, ReadWriteOncePod - ROX is validation failure). This is just in case of a storageclass that may exist somewhere that allows ReadWriteMany and not ReadWriteOnce.

storageClassName was default nil to allow for default storage class usage.
accessMode was default ReadWriteOnce.
size was required string of kubernetes storage request formatting.

During maintenance job creation attach an ephemeral volume using the above spec and mount into the cache dir.
During datamover deployment, also add an ephemeral volume same as maintenance.
For node-agent, that would require install work to add such a volume. I never got that far since internally we only cared about DataUpload/DataDownload controlled kopia datamovers.

Ephemeral PVC volumes attached to Jobs do not get removed until the Job is deleted. With a default 3 maintenance Jobs kept around at a time, causes excessive PVC storage without a Job ttl attached (k8s 1.23+).

Lyndon-Li · 2024-12-11T02:19:05Z

Ephemeral PVC volumes attached to Jobs do not get removed until the Job is deleted. With a default 3 maintenance Jobs kept around at a time, causes excessive PVC storage

Issue #7923 would solve this problem. So for the design of #7725, it is fine to just focus on cache location management

Lyndon-Li · 2024-12-11T02:26:44Z

The basic design for a cache volumes I used was fairly simple due to our internal limited use case, add a "storage" section to maintenanceJobConfig and nodeAgentConfig. Fields of storageClassName, size, and an accessMode

The cache location configuration should be by volume, instead of general for all DU/DD/maintenance job, because not all volume requires a separate cache volume, e.g., if the repo size/scale is small. In this way, we could reduce the number of PVC/PV.

Lyndon-Li · 2024-12-11T02:58:44Z

I am closing this issue as we agreed that it is covered by #7725. Feel free to reopen it if it is required.
For the cache location management discussion, let's go to #7725.

weshayutin added this to OADP Nov 22, 2024

reasonerjt added the Needs info Waiting for information label Nov 25, 2024

reasonerjt assigned Lyndon-Li Nov 25, 2024

shubham-pampattiwar added this to the v1.16 milestone Dec 5, 2024

Lyndon-Li assigned mpryc and unassigned Lyndon-Li Dec 6, 2024

Lyndon-Li added the 1.16-candidate label Dec 6, 2024

reasonerjt added the Needs triage We need discussion to understand problem and decide the priority label Dec 6, 2024

Lyndon-Li closed this as completed Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

mpryc commented Nov 22, 2024

weshayutin commented Nov 22, 2024

kaovilai commented Nov 22, 2024

Lyndon-Li commented Nov 25, 2024

reasonerjt commented Nov 25, 2024

mpryc commented Nov 25, 2024

Lyndon-Li commented Nov 25, 2024

mpryc commented Nov 25, 2024 •

edited

Loading

Lyndon-Li commented Nov 25, 2024

mpryc commented Nov 25, 2024

Lyndon-Li commented Nov 26, 2024

weshayutin commented Dec 5, 2024

Lyndon-Li commented Dec 6, 2024

msfrucht commented Dec 9, 2024 •

edited

Loading

Lyndon-Li commented Dec 10, 2024

mpryc commented Dec 10, 2024

msfrucht commented Dec 10, 2024

Lyndon-Li commented Dec 11, 2024

Lyndon-Li commented Dec 11, 2024

Lyndon-Li commented Dec 11, 2024

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

Comments

mpryc commented Nov 22, 2024

weshayutin commented Nov 22, 2024

kaovilai commented Nov 22, 2024

Lyndon-Li commented Nov 25, 2024

reasonerjt commented Nov 25, 2024

mpryc commented Nov 25, 2024

Lyndon-Li commented Nov 25, 2024

mpryc commented Nov 25, 2024 • edited Loading

Lyndon-Li commented Nov 25, 2024

mpryc commented Nov 25, 2024

Lyndon-Li commented Nov 26, 2024

weshayutin commented Dec 5, 2024

Lyndon-Li commented Dec 6, 2024

msfrucht commented Dec 9, 2024 • edited Loading

Lyndon-Li commented Dec 10, 2024

mpryc commented Dec 10, 2024

msfrucht commented Dec 10, 2024

Lyndon-Li commented Dec 11, 2024

Lyndon-Li commented Dec 11, 2024

Lyndon-Li commented Dec 11, 2024

mpryc commented Nov 25, 2024 •

edited

Loading

msfrucht commented Dec 9, 2024 •

edited

Loading