Make kopia repo cache place configurable #7725

Lyndon-Li · 2024-04-23T02:31:07Z

Related to issue #7499 and #7718.
The cache policy decides the root file system disk usage in the pod where data movement is running, and on the other hand, it also impact the restore performance significantly.
Therefore, besides storing the cache in the root file system, we should allow users to add a dedicate volume to hold the cache.

github-actions · 2024-06-29T01:49:35Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

blackpiglet · 2024-07-01T02:47:52Z

unstale

itayrin · 2024-09-03T20:16:51Z

Hi @Lyndon-Li and hi all,
I'm trying currently to use Velero 6.7.0 with CLI 13.2 and need it to work for production.
I'm experiencing the same issue described in:
#7620 (comment)

I saw that in the next version of Velero (15) there should be a fix by allowing configuration of the Kopia cache - but until then is there a recommended WA?
Currently the only thing I managed to do that allowed both the restore to succeed and the ephemeral storage not to explode is:
Polling on '/var/lib/containerd' for the biggest disk consuming locations, and recursively reaching the path of "../root/.cache/kopia//contents" and deleting that directory.

Other things I tried which solved the disk space issue but caused the restore operations to fail:

Assigning ephemeral-storage requests and limits on the node-agent pod
Assigning a PVC to the daemonset of the node-agent, in order to mount the cache directory on CSI-Ceph. Couldn't make the PVC to be dynamic, I don't want to create a one-off PVC.
Can't access the node-agent pod as it does not have any shell
AFAIK Can't configure Velero charts to somehow give Kopia the relevant flags to limit the cache.

If you have any better ideas for a WA I would be happy to hear, thanks.

itayrin · 2024-09-04T09:34:54Z

Update - This solution of mine also doesn't work - it makes the restore to fail on:
Operation Error: error to initialize data path: error to boost backup repository connection default-xdr-mt-kopia: error to connect backup repo: error to connect repo with storage: error to connect to repository: unable to create shared content manager: error setting up read manager caches: unable to initialize content cache: unable to create base cache: error during initial scan of contents: error listing contents: error processing directory shards: error reading directory: readdirent /root/.cache/kopia/b3adfaac80582fd3/contents: no such file or directory Progress description: Failed

s4ndalHat · 2024-09-12T10:19:58Z

Hi, does this enhancement can solve the DiskPressure problem that occurs when restoring large data (like 500 Go as example) and if we do not have 500 Go free space on nodes ?

Lyndon-Li · 2024-09-12T10:25:15Z

Hi, does this enhancement can solve the DiskPressure problem that occurs when restoring large data (like 500 Go as example) and if we do not have 500 Go free space on nodes ?

Issue #7620 fixes it in 1.15. This issue is a further enhancement.

s4ndalHat · 2024-09-12T10:42:07Z

@Lyndon-Li thanks for your response, waiting for the v1.15 ;) may I ask if there is a programmed release date for this version ?

Lyndon-Li · 2024-09-12T10:43:40Z

See 1.15 roadmap https://github.com/vmware-tanzu/velero/wiki/1.15-Roadmap

reasonerjt · 2024-10-12T08:51:23Z

tentatively move it out of the milestone for now, b/c there may be more complexities.

We need a design b/c we need to handle the cases for velero pod, node-agent, and data-mover pod.

reasonerjt · 2024-10-31T07:16:19Z

The imapct is low given we have limitation to the size fo the cache.

msfrucht · 2024-11-08T19:27:06Z

@reasonerjt Our systems have some workloads that cause the index cache to get to be very large, 10+GB. The existing metadata and content cache limits don't cover this from kopia.

Admittedly, these workload are unusual and approach worst case possible for deduplication with very high counts of unique blocks.

Combining with Job ttl (required k8s 1.23 minimum) is likely required. As during testing of local implementation found that Jobs without this and a PVC cache the PVC remains past Job completion. Excessive amounts of reserved storage may otherwise be going unused attached to maintenance jobs.

Lyndon-Li · 2024-12-09T02:52:53Z

@shubham-pampattiwar
I am adding you to the assignee of this issue. There are several details to be covered, e.g., how to differentially assign volumes to backupPods/restorePods and how/whether to allow shared volumes among multiple backupPods/restorePods, so we need a design for it.
Feel free to add it to 1.16 milestone if you regard this as high priority from your side.

shubham-pampattiwar · 2024-12-10T07:06:10Z

@mpryc will be assisting with this issue. Thank you @mpryc !

Lyndon-Li · 2024-12-11T02:21:27Z

@mpryc Please comment in this issue, I couldn't add you to assignee. Please also share your plan about this issue, if you want to make some progress (e.g., pure design or design + implementation) in 1.16, we can move this issue to 1.16 milestone.

msfrucht · 2024-12-11T22:53:13Z

Cache location divergence for datamovers Restic and Velero to be resolved would be a useful element to this item.

The default Restic cache location needed for PVC mounts is environmental variable VELERO_SCRATCH_DIR or /scratch by default.
The default Kopia cache location needed for PVC mount is whatever HOME is set to, typically /home/velero

Option during install --cache-dir will override for restic, but not for kopia.

Lyndon-Li · 2024-12-12T02:21:54Z

Cache location divergence for datamovers Restic and Velero to be resolved would be a useful element to this item.

The default Restic cache location needed for PVC mounts is environmental variable VELERO_SCRATCH_DIR or /scratch by default. The default Kopia cache location needed for PVC mount is whatever HOME is set to, typically /home/velero

Option during install --cache-dir will override for restic, but not for kopia.

Restic is announced as deprecation since 1.15, so we will try to cover this gap with the design, but if we finally find we need to do special things for Restic path, we probably would drop it.

mpryc · 2024-12-12T09:08:13Z

Based on this answer from the Kopia's founder, it seems feasible to share Kopia's cache across node-agents using the same persistent volume (PV).

If separate caches are required for each node-agent, PVs would need to be created before deployment with clear nodeAffinity, which would ensure that specific PVs are accessible only by specific nodes (pinned to node-agents).

Let me know if above makes sense, however I don't know if the part of sharing same cache would be feasible for restic and if the node affinity is worth exploring further.

Challenge here is that the node-agents are created using DeploymentConfig so we are limited in how the PVs can be dynamically attached to the pods, that's why I am thinking of above approach.

Lyndon-Li · 2024-12-12T09:11:55Z

@mpryc I think we only need to consider data mover for now, fs-backup which requires node-agent changes is with lower priority. cc @reasonerjt

For the question you mentioned, the answer is yes, we can share the same volume for multiple backupRepository and it is actually the recommended way, as it saves the PVC/PV resource created. But for sure, it requires RWX volumes, if they don't exist, we have to place them into separate volumes.

msfrucht · 2024-12-13T23:13:59Z

ReadWriteOnce access mode does not stop multiple Pods from attaching to the same volume. Just forces the scheduler to attach the Pods to the same node in order for the additional Pods to start.

Even with ReadWriteOnce sharing the same volume is possible provided one keeps track of which node is running what datamover for which backup repository repository. It would be a major increase in implementation difficulty.

reasonerjt · 2024-12-19T00:24:45Z

Based on the discussion we don't have agreement on the scope and solution for this issue.
I'm removing the "candidate" label before "Feature Freeze" of v1.16, and leave it in "backlog".
It doesn't block us continue discussing and writing a design though.

mpryc · 2024-12-19T13:38:45Z

There are two parts of this, each rather separate option:

Datamover with option to attach external storage (as @Lyndon-Li wrote)
Here we have two approaches, if more then please let me know also let me know if my thinking is right:
- Allowing users to specify an existing Persistent Volume that will be mounted to the new microservice
- Allowing users to create a new PV and then mount it to the new microservice. This however might add significant complexity, especially if it involves user-specified parameters like size / storage class and other sub-options that PVc allows. Eventually handling cases when the PVc already exists - use it or bail out.
Node agent using similarly mounted PVs - @Lyndon-Li why you think this is much more complex?

Lyndon-Li · 2024-12-20T02:14:28Z

Node agent using similarly mounted PVs - @Lyndon-Li why you think this is much more complex?

Whenever we attach a new volume to node-agent, all node-agent pods will be restarted, this is not accepted if we do this out of controller, because once node-agent pods restart, all fs-backup and data mover backup/restore are affected.
Therefore, we need to design a sophisticated user interaction, so that when users configure the volumes, no running fs-backup and data mover backup/restore are running.

Additionally, I think doing this for node-agent is with lower priority since from the backup method data mover is preferred, fs-backup is not used unless data mover is unavailable.

Lyndon-Li mentioned this issue Apr 23, 2024

Velero pod gets evicted because the node is running low on ephemeral-storage #7718

Closed

Lyndon-Li self-assigned this Apr 23, 2024

reasonerjt added the backlog label Apr 29, 2024

danfengliu mentioned this issue May 7, 2024

Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel #7761

Closed

Lyndon-Li mentioned this issue May 7, 2024

Make kopia repo cache policy configurable #7620

Closed

github-actions bot added the staled label Jun 29, 2024

github-actions bot removed the staled label Jul 2, 2024

Lyndon-Li added the 1.16-candidate label Sep 24, 2024

reasonerjt added Reviewed Q3 2024 and removed Reviewed Q3 2024 labels Sep 29, 2024

reasonerjt added this to the v1.16 milestone Sep 29, 2024

reasonerjt removed the 1.16-candidate label Sep 29, 2024

catalinpan mentioned this issue Oct 3, 2024

Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

Open

reasonerjt added Needs Design 1.16-candidate labels Oct 12, 2024

Lyndon-Li mentioned this issue Oct 21, 2024

Allow Ephemeral Storage Customization for backupPod in Velero #8317

Open

reasonerjt added 1.16-candidate and removed 1.16-candidate labels Oct 31, 2024

reasonerjt removed this from the v1.16 milestone Oct 31, 2024

Lyndon-Li assigned shubham-pampattiwar Dec 9, 2024

Lyndon-Li mentioned this issue Dec 10, 2024

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

Closed

reasonerjt assigned mpryc and unassigned shubham-pampattiwar Dec 18, 2024

reasonerjt removed the 1.16-candidate label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make kopia repo cache place configurable #7725

Make kopia repo cache place configurable #7725

Lyndon-Li commented Apr 23, 2024

github-actions bot commented Jun 29, 2024

blackpiglet commented Jul 1, 2024

itayrin commented Sep 3, 2024 •

edited

Loading

itayrin commented Sep 4, 2024

s4ndalHat commented Sep 12, 2024

Lyndon-Li commented Sep 12, 2024 •

edited

Loading

s4ndalHat commented Sep 12, 2024

Lyndon-Li commented Sep 12, 2024

reasonerjt commented Oct 12, 2024 •

edited

Loading

reasonerjt commented Oct 31, 2024

msfrucht commented Nov 8, 2024 •

edited

Loading

Lyndon-Li commented Dec 9, 2024

shubham-pampattiwar commented Dec 10, 2024

Lyndon-Li commented Dec 11, 2024 •

edited

Loading

msfrucht commented Dec 11, 2024

Lyndon-Li commented Dec 12, 2024

mpryc commented Dec 12, 2024

Lyndon-Li commented Dec 12, 2024 •

edited

Loading

msfrucht commented Dec 13, 2024 •

edited

Loading

reasonerjt commented Dec 19, 2024

mpryc commented Dec 19, 2024

Lyndon-Li commented Dec 20, 2024

Make kopia repo cache place configurable #7725

Make kopia repo cache place configurable #7725

Comments

Lyndon-Li commented Apr 23, 2024

github-actions bot commented Jun 29, 2024

blackpiglet commented Jul 1, 2024

itayrin commented Sep 3, 2024 • edited Loading

itayrin commented Sep 4, 2024

s4ndalHat commented Sep 12, 2024

Lyndon-Li commented Sep 12, 2024 • edited Loading

s4ndalHat commented Sep 12, 2024

Lyndon-Li commented Sep 12, 2024

reasonerjt commented Oct 12, 2024 • edited Loading

reasonerjt commented Oct 31, 2024

msfrucht commented Nov 8, 2024 • edited Loading

Lyndon-Li commented Dec 9, 2024

shubham-pampattiwar commented Dec 10, 2024

Lyndon-Li commented Dec 11, 2024 • edited Loading

msfrucht commented Dec 11, 2024

Lyndon-Li commented Dec 12, 2024

mpryc commented Dec 12, 2024

Lyndon-Li commented Dec 12, 2024 • edited Loading

msfrucht commented Dec 13, 2024 • edited Loading

reasonerjt commented Dec 19, 2024

mpryc commented Dec 19, 2024

Lyndon-Li commented Dec 20, 2024

itayrin commented Sep 3, 2024 •

edited

Loading

Lyndon-Li commented Sep 12, 2024 •

edited

Loading

reasonerjt commented Oct 12, 2024 •

edited

Loading

msfrucht commented Nov 8, 2024 •

edited

Loading

Lyndon-Li commented Dec 11, 2024 •

edited

Loading

Lyndon-Li commented Dec 12, 2024 •

edited

Loading

msfrucht commented Dec 13, 2024 •

edited

Loading