-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443
Comments
So we had already confirmed running immediate maintenance won't be deleting backups. Are we saying there's benefit to reducing cache directory size? |
The above tests cannot prove that Velero needs to intervene Kopia repo's cache management:
Generally speaking, setting the cache limit, which we already have, is a more graceful way to control Kopia repo's cache. And it should be enough, unless really necessary, we would have Kopia repo itself to manage the cache all the time. We only need to consider Velero's intervene in the corner case that the cache is out of Kopia repo's control and so is left behind e.g., a repo is deleted or is never visited again for a long time. |
@mpryc |
Currently it was observed that the cache folder is growing way above 5G: # Before node-agent pod restart (after backup & restore operation):
Filesystem Size Used Avail Use% Mounted o
/dev/sdb4 447G 237G 211G 53% /var
# After node-agent pod restart:
Filesystem Size Used Avail Use% Mounted o
/dev/sdb4 447G 25G 422G 6% /var While this growth is expected, the cache is not being freed, even if the data used for the restore is no longer needed. This causes disk space to accumulate until the node-agent is restarted. Possibly dynamically setting hard limit would be sufficient here? I know there is a way to set the limit, however if the limit was dynamically calculated based on the available disk size on the node as it happens that the cache is growing to the point the node is dying due to full disk space and the only option at this point is to restart the node-agent. If that happens during backup or restore operation the backup or restore is never successful. |
How many BSLs do you have and how many namespaces are you backing up? |
It's a single namespace, single BSL, single pod with big PV size (1TB) that has a lot of data in it. Many files within that PV around 10GB each. |
Are you using 1.15? |
Forgot to mention that in the bug, sorry: it's not 1.15 it's 1.14 |
This is a known issue in 1.14 and has been fixed in 1.15, so you could upgrade to 1.15 and do another test. |
@Lyndon-Li I think @mpryc would be willing to do the work here if you want to assign it to him :) |
I am assigning this issue to @mpryc for more investigation. |
I'm not sure this is the kopia cache though. On OpenShift the kopia cache is set to /home/velero. I don't know if /var is used for the location of the kopia cache in ordinary k8s. @mpryc Can you check if there is a hidden .cache folder under there? That is the typical name for the kopia cache folder. We have observed internally that the cache controls for Kopia do not control the size of the index cache. As a result, the index cache grows pretty much as needed. This will also cause kopia's memory use to skyrocket during connect due a combination of kernel issues and not streaming index cache files from object storage. Index object not being streamed during download and upload: kopia/kopia#4267 Particularly, workloads with high counts of unique blocks will cause the index cache (and number of indices) to eventually spiral out of control. Though this is a fairly unusual workload for most applications. Our worst case scenarios have seen cache use on the order of 30GB, though these are extraordinary cases at around 100 million+ unique blocks for kopia to keep track of. Right now, a workaround by using a PVC to hold the kopia cache during maintenance has been successful at avoiding severe issues with local nodes up to and including malfunction. However, this requires adding a maintenance Job ttl to make sure PVC cache storage gets cleaned up. Not available in the current Velero minimum k8s version of 1.18 and that requires 1.23. Since our organization doesn't support anything older than 1.23, not an issue internally. The same approach could work with node-agent's, right up until the PVC becomes full. |
@msfrucht
Conclusively, at present, we consider these two areas separately, here are the current status:
|
After some more tests, there are actually 4 parts in the equation: For each node-agent that is running on each worker node of k8s cluster:
The 1 and 2 are directly connected, because Kopia cache is within home directory of the user, that is not mounted from the external storage, which is part of the pod's ephemeral storage: we have this one for it: #7725 The 4th as you mentioned is around some kopia memory leaks and other issues, which is also not part of this bug. I also see the 1.4 problem is no longer visible on the 1.5. To add to it the observation of ever growing size was mixture of both problems that were fixed with 1.5 by moving actual transfer from node-agent to backupPods / restorePods (??) and our wrong way to look under $ for file in $(ls /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/); do du -sh /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/$file; done
338M /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/4f8c141182b62cbd1d4b7c6ad87dfee9279346eb2b60832ee8c7d0a960aea94d
368M /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/f9f710147023b88d6be7e98941c95ec60fa4fab4bf6cf223402e3ce25a7fc44f If you agree we can close this one as not a bug anymore as we will keep the #7725, unless there is something to be done for the CSI Datamover and it's mounted PVs ? |
@mpryc I agree with that. I have implemented internally #7725 for kopia maintenance jobs and it worked wonderfully to reduce ephemeral storage usage. Prometheus was showing about ~1MB at worst ephemeral-storage usage on kopia maintenance jobs afterwards. The changes implemented is our organization was only interested in kopia caching. So it is not generalized to fit all existing Velero workflows. The cache directories for Restic and Kopia are in different locations and not easily accessible in the existing layering to look that up in code. The default Restic cache location needed for PVC mounts is environmental variable VELERO_SCRATCH_DIR or /scratch by default. Option during install --cache-dir will override for restic, but not for kopia which was very confusing for a while because there is no mention in --help about that till I looked at the code. #7725 would be a good area to design and implement something more complete specifying cache dir for Restic and Kopia caches. Due to the differing cache locations, consolidation to a single location regardless of datamover Restic or Kopia would allow for #7725 implementation to be less complicated. I couldn't find a cache location design in the design documentation. It appears to have grown into incomplete options and environmental variables organically. The basic design for a cache volumes I used was fairly simple due to our internal limited use case, add a "storage" section to maintenanceJobConfig and nodeAgentConfig. Fields of storageClassName, size, and an accessMode (ReadWriteOnce, ReadWriteMany, ReadWriteOncePod - ROX is validation failure). This is just in case of a storageclass that may exist somewhere that allows ReadWriteMany and not ReadWriteOnce. storageClassName was default nil to allow for default storage class usage. During maintenance job creation attach an ephemeral volume using the above spec and mount into the cache dir. Ephemeral PVC volumes attached to Jobs do not get removed until the Job is deleted. With a default 3 maintenance Jobs kept around at a time, causes excessive PVC storage without a Job ttl attached (k8s 1.23+). |
The cache location configuration should be by volume, instead of general for all DU/DD/maintenance job, because not all volume requires a separate cache volume, e.g., if the repo size/scale is small. In this way, we could reduce the number of PVC/PV. |
What steps did you take and what happened:
During Backup/Restore with the node-agents utilizing Kopia it was observed that the size of the Kopia's cache folder is growing:
What did you expect to happen:
Velero should have mechanism to automatically clean up/run some maintenance on the node-agent's Kopia cache after a backup or restore operation.
The text was updated successfully, but these errors were encountered: