Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

Closed
mpryc opened this issue Nov 22, 2024 · 19 comments
Closed
Assignees
Labels
1.16-candidate Needs info Waiting for information Needs triage We need discussion to understand problem and decide the priority
Milestone

Comments

@mpryc
Copy link
Contributor

mpryc commented Nov 22, 2024

What steps did you take and what happened:

During Backup/Restore with the node-agents utilizing Kopia it was observed that the size of the Kopia's cache folder is growing:

$ kubectl get pods -n velero-ns -l name=node-agent -o name | while read pod; do echo -e "$(kubectl exec -n velero-ns $pod -- du -hs /var | awk '{print $1}')\t$pod"; done | sort -h | awk 'BEGIN {print "KOPIA CACHE SIZE\tPOD\n-------------------------"} {print $0}'
KOPIA CACHE SIZE	POD
-------------------------
11M	pod/node-agent-7zqm9
11M	pod/node-agent-dzl6s
244M	pod/node-agent-9tmdw

What did you expect to happen:
Velero should have mechanism to automatically clean up/run some maintenance on the node-agent's Kopia cache after a backup or restore operation.

@weshayutin
Copy link
Contributor

@msfrucht please review and work w/ @mpryc on this

@kaovilai
Copy link
Member

So we had already confirmed running immediate maintenance won't be deleting backups. Are we saying there's benefit to reducing cache directory size?

@weshayutin weshayutin added this to OADP Nov 22, 2024
@Lyndon-Li
Copy link
Contributor

The above tests cannot prove that Velero needs to intervene Kopia repo's cache management:

  1. Kopia repo has its own policy to manage the cache, e.g., there are several margins to decide when and how to remove the cache
  2. The cache management is repo-wide, not operation-wide. Or in another word, the cache will still be effective across more than one repo upper level operations, i.e., backup/restore or maintenance
  3. At present, Velero's default cache limit(hard) is 5G for data and metadata representatively
  4. Comparing to the cache limit, the above tests cannot approve that Velero needs to manage the cache somehow, because it is still in the scope of Kopia repo's own management; nor we can prove this theoretically as mentioned above

Generally speaking, setting the cache limit, which we already have, is a more graceful way to control Kopia repo's cache. And it should be enough, unless really necessary, we would have Kopia repo itself to manage the cache all the time.

We only need to consider Velero's intervene in the corner case that the cache is out of Kopia repo's control and so is left behind e.g., a repo is deleted or is never visited again for a long time.

@reasonerjt reasonerjt added the Needs info Waiting for information label Nov 25, 2024
@reasonerjt
Copy link
Contributor

@mpryc
Per the comment @Lyndon-Li
It seems the cache growing is expected. Please elaborate on why this is a severe issue.

@mpryc
Copy link
Contributor Author

mpryc commented Nov 25, 2024

Currently it was observed that the cache folder is growing way above 5G:

# Before node-agent pod restart (after backup & restore operation):

Filesystem      Size  Used Avail Use% Mounted o
/dev/sdb4       447G  237G  211G  53% /var

# After node-agent pod restart:

Filesystem      Size  Used Avail Use% Mounted o
/dev/sdb4       447G   25G  422G   6% /var

While this growth is expected, the cache is not being freed, even if the data used for the restore is no longer needed. This causes disk space to accumulate until the node-agent is restarted.

Possibly dynamically setting hard limit would be sufficient here? I know there is a way to set the limit, however if the limit was dynamically calculated based on the available disk size on the node as it happens that the cache is growing to the point the node is dying due to full disk space and the only option at this point is to restart the node-agent. If that happens during backup or restore operation the backup or restore is never successful.

@Lyndon-Li
Copy link
Contributor

How many BSLs do you have and how many namespaces are you backing up?

@mpryc
Copy link
Contributor Author

mpryc commented Nov 25, 2024

How many BSLs do you have and how many namespaces are you backing up?

It's a single namespace, single BSL, single pod with big PV size (1TB) that has a lot of data in it. Many files within that PV around 10GB each.

@Lyndon-Li
Copy link
Contributor

Are you using 1.15?

@mpryc
Copy link
Contributor Author

mpryc commented Nov 25, 2024

Are you using 1.15?

Forgot to mention that in the bug, sorry: it's not 1.15 it's 1.14

@Lyndon-Li
Copy link
Contributor

This is a known issue in 1.14 and has been fixed in 1.15, so you could upgrade to 1.15 and do another test.

@shubham-pampattiwar shubham-pampattiwar added this to the v1.16 milestone Dec 5, 2024
@weshayutin
Copy link
Contributor

@Lyndon-Li I think @mpryc would be willing to do the work here if you want to assign it to him :)

@Lyndon-Li Lyndon-Li assigned mpryc and unassigned Lyndon-Li Dec 6, 2024
@Lyndon-Li
Copy link
Contributor

I am assigning this issue to @mpryc for more investigation.
As far as what we are understanding, Kopia's own cache management mechanism is enough. We don't need to interfere it, which may cause potential bugs or performance downgrade.
After the investigation, if we have new findings, we can discuss how to handle them.

@reasonerjt reasonerjt added the Needs triage We need discussion to understand problem and decide the priority label Dec 6, 2024
@msfrucht
Copy link
Contributor

msfrucht commented Dec 9, 2024

@Lyndon-Li

I'm not sure this is the kopia cache though. On OpenShift the kopia cache is set to /home/velero. I don't know if /var is used for the location of the kopia cache in ordinary k8s. @mpryc Can you check if there is a hidden .cache folder under there? That is the typical name for the kopia cache folder.

We have observed internally that the cache controls for Kopia do not control the size of the index cache. As a result, the index cache grows pretty much as needed.

This will also cause kopia's memory use to skyrocket during connect due a combination of kernel issues and not streaming index cache files from object storage.

Index object not being streamed during download and upload: kopia/kopia#4267
Kernel issue: kopia/kopia#4162

Particularly, workloads with high counts of unique blocks will cause the index cache (and number of indices) to eventually spiral out of control. Though this is a fairly unusual workload for most applications.

Our worst case scenarios have seen cache use on the order of 30GB, though these are extraordinary cases at around 100 million+ unique blocks for kopia to keep track of.

Right now, a workaround by using a PVC to hold the kopia cache during maintenance has been successful at avoiding severe issues with local nodes up to and including malfunction. However, this requires adding a maintenance Job ttl to make sure PVC cache storage gets cleaned up. Not available in the current Velero minimum k8s version of 1.18 and that requires 1.23. Since our organization doesn't support anything older than 1.23, not an issue internally.

The same approach could work with node-agent's, right up until the PVC becomes full.

@Lyndon-Li
Copy link
Contributor

@msfrucht
We need to treat these two problems separately --- the cache size on disk and the memory usage.
Though they may emerge at the same time, but the cache is never the cause of the memory usage, vice versa. And they may or may not be caused by the same reason:

  • The cache size is always related to the number of indexes and number of unique contents. The cache is on disk, itself never relies on memory directly.
  • The memory usage varies from different stages, i.e., connection, maintenance, backup, restore. For repository management stages, sometimes, it needs to load the full indexes into memory, sometimes, it maps part of the indexes or contents into memory; for backup/restore phase, it also needs to load the indexes and contents. At the mean while, it additionally needs to load file system objects into memory, for which, the memory usage is decided by the complexity and scale of the file system; and during the visit of file system, paging IO memory cache and file system item memory cache which are owned by the system (e.g., ~40 GB Memory leak (?) kopia/kopia#4162), are also counted by cgroup, but are out of control of us, this is why we set Velero's pods' resource policy as BestEfforts.

Conclusively, at present, we consider these two areas separately, here are the current status:

  • For the index and content cache, we leverage Kopia awareness ways to manage them, i.e., various cache margins and separate location for the cache (see Make kopia repo cache place configurable #7725). For 1.14, there is a known issue(Velero) to set the cache margins, as a result, the cache size may exceed the limit; 1.15 has fixed the problem.
  • For memory usage problems, there are some known issues, during Kopia 0.17/Velero 1.15 cycle, several of them have been fixed; meanwhile, there are still some open issues I am working on

@mpryc
Copy link
Contributor Author

mpryc commented Dec 10, 2024

After some more tests, there are actually 4 parts in the equation:

For each node-agent that is running on each worker node of k8s cluster:

  1. Kopia cache size, which is in the home directory ~/.cache which in our case it's /home/velero/.cache
  2. ephemeralStorage within node-agent pod
  3. CSI DataMover - mounted PVs for the node-agent
  4. Pod memory usage

The 1 and 2 are directly connected, because Kopia cache is within home directory of the user, that is not mounted from the external storage, which is part of the pod's ephemeral storage: we have this one for it: #7725

The 4th as you mentioned is around some kopia memory leaks and other issues, which is also not part of this bug.

I also see the 1.4 problem is no longer visible on the 1.5. To add to it the observation of ever growing size was mixture of both problems that were fixed with 1.5 by moving actual transfer from node-agent to backupPods / restorePods (??) and our wrong way to look under /var folder where we miscalculated data. The data was growing because the tests were creating new workload with new restored PVs which were mounted under /var/ for each succesfull restore:

$ for file in $(ls /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/); do du -sh /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/$file; done
338M /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/4f8c141182b62cbd1d4b7c6ad87dfee9279346eb2b60832ee8c7d0a960aea94d
368M /var/data/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/f9f710147023b88d6be7e98941c95ec60fa4fab4bf6cf223402e3ce25a7fc44f

If you agree we can close this one as not a bug anymore as we will keep the #7725, unless there is something to be done for the CSI Datamover and it's mounted PVs ?

@msfrucht
Copy link
Contributor

@mpryc I agree with that.

I have implemented internally #7725 for kopia maintenance jobs and it worked wonderfully to reduce ephemeral storage usage. Prometheus was showing about ~1MB at worst ephemeral-storage usage on kopia maintenance jobs afterwards.

The changes implemented is our organization was only interested in kopia caching. So it is not generalized to fit all existing Velero workflows. The cache directories for Restic and Kopia are in different locations and not easily accessible in the existing layering to look that up in code.

The default Restic cache location needed for PVC mounts is environmental variable VELERO_SCRATCH_DIR or /scratch by default.
The default Kopia cache location needed for PVC mount is whatever HOME is set to, typically /home/velero

Option during install --cache-dir will override for restic, but not for kopia which was very confusing for a while because there is no mention in --help about that till I looked at the code.

#7725 would be a good area to design and implement something more complete specifying cache dir for Restic and Kopia caches.

Due to the differing cache locations, consolidation to a single location regardless of datamover Restic or Kopia would allow for #7725 implementation to be less complicated.

I couldn't find a cache location design in the design documentation. It appears to have grown into incomplete options and environmental variables organically.

The basic design for a cache volumes I used was fairly simple due to our internal limited use case, add a "storage" section to maintenanceJobConfig and nodeAgentConfig. Fields of storageClassName, size, and an accessMode (ReadWriteOnce, ReadWriteMany, ReadWriteOncePod - ROX is validation failure). This is just in case of a storageclass that may exist somewhere that allows ReadWriteMany and not ReadWriteOnce.

storageClassName was default nil to allow for default storage class usage.
accessMode was default ReadWriteOnce.
size was required string of kubernetes storage request formatting.

During maintenance job creation attach an ephemeral volume using the above spec and mount into the cache dir.
During datamover deployment, also add an ephemeral volume same as maintenance.
For node-agent, that would require install work to add such a volume. I never got that far since internally we only cared about DataUpload/DataDownload controlled kopia datamovers.

Ephemeral PVC volumes attached to Jobs do not get removed until the Job is deleted. With a default 3 maintenance Jobs kept around at a time, causes excessive PVC storage without a Job ttl attached (k8s 1.23+).

@Lyndon-Li
Copy link
Contributor

Ephemeral PVC volumes attached to Jobs do not get removed until the Job is deleted. With a default 3 maintenance Jobs kept around at a time, causes excessive PVC storage

Issue #7923 would solve this problem. So for the design of #7725, it is fine to just focus on cache location management

@Lyndon-Li
Copy link
Contributor

The basic design for a cache volumes I used was fairly simple due to our internal limited use case, add a "storage" section to maintenanceJobConfig and nodeAgentConfig. Fields of storageClassName, size, and an accessMode

The cache location configuration should be by volume, instead of general for all DU/DD/maintenance job, because not all volume requires a separate cache volume, e.g., if the repo size/scale is small. In this way, we could reduce the number of PVC/PV.

@Lyndon-Li
Copy link
Contributor

I am closing this issue as we agreed that it is covered by #7725. Feel free to reopen it if it is required.
For the cache location management discussion, let's go to #7725.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.16-candidate Needs info Waiting for information Needs triage We need discussion to understand problem and decide the priority
Projects
None yet
Development

No branches or pull requests

7 participants