-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make kopia repo cache place configurable #7725
Comments
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands. |
unstale |
Hi @Lyndon-Li and hi all, I saw that in the next version of Velero (15) there should be a fix by allowing configuration of the Kopia cache - but until then is there a recommended WA? Other things I tried which solved the disk space issue but caused the restore operations to fail:
If you have any better ideas for a WA I would be happy to hear, thanks. |
Update - This solution of mine also doesn't work - it makes the restore to fail on: |
Hi, does this enhancement can solve the DiskPressure problem that occurs when restoring large data (like 500 Go as example) and if we do not have 500 Go free space on nodes ? |
Issue #7620 fixes it in 1.15. This issue is a further enhancement. |
@Lyndon-Li thanks for your response, waiting for the v1.15 ;) may I ask if there is a programmed release date for this version ? |
See 1.15 roadmap https://github.com/vmware-tanzu/velero/wiki/1.15-Roadmap |
tentatively move it out of the milestone for now, b/c there may be more complexities. We need a design b/c we need to handle the cases for velero pod, node-agent, and data-mover pod. |
The imapct is low given we have limitation to the size fo the cache. |
@reasonerjt Our systems have some workloads that cause the index cache to get to be very large, 10+GB. The existing metadata and content cache limits don't cover this from kopia. Admittedly, these workload are unusual and approach worst case possible for deduplication with very high counts of unique blocks. Combining with Job ttl (required k8s 1.23 minimum) is likely required. As during testing of local implementation found that Jobs without this and a PVC cache the PVC remains past Job completion. Excessive amounts of reserved storage may otherwise be going unused attached to maintenance jobs. |
@shubham-pampattiwar |
@mpryc Please comment in this issue, I couldn't add you to assignee. Please also share your plan about this issue, if you want to make some progress (e.g., pure design or design + implementation) in 1.16, we can move this issue to 1.16 milestone. |
Cache location divergence for datamovers Restic and Velero to be resolved would be a useful element to this item. The default Restic cache location needed for PVC mounts is environmental variable VELERO_SCRATCH_DIR or /scratch by default. Option during install --cache-dir will override for restic, but not for kopia. |
Restic is announced as deprecation since 1.15, so we will try to cover this gap with the design, but if we finally find we need to do special things for Restic path, we probably would drop it. |
Based on this answer from the Kopia's founder, it seems feasible to share Kopia's cache across node-agents using the same persistent volume (PV). If separate caches are required for each node-agent, PVs would need to be created before deployment with clear nodeAffinity, which would ensure that specific PVs are accessible only by specific nodes (pinned to node-agents). Let me know if above makes sense, however I don't know if the part of sharing same cache would be feasible for restic and if the node affinity is worth exploring further. Challenge here is that the node-agents are created using DeploymentConfig so we are limited in how the PVs can be dynamically attached to the pods, that's why I am thinking of above approach. |
@mpryc I think we only need to consider data mover for now, fs-backup which requires node-agent changes is with lower priority. cc @reasonerjt For the question you mentioned, the answer is yes, we can share the same volume for multiple backupRepository and it is actually the recommended way, as it saves the PVC/PV resource created. But for sure, it requires RWX volumes, if they don't exist, we have to place them into separate volumes. |
ReadWriteOnce access mode does not stop multiple Pods from attaching to the same volume. Just forces the scheduler to attach the Pods to the same node in order for the additional Pods to start. Even with ReadWriteOnce sharing the same volume is possible provided one keeps track of which node is running what datamover for which backup repository repository. It would be a major increase in implementation difficulty. |
Based on the discussion we don't have agreement on the scope and solution for this issue. |
There are two parts of this, each rather separate option:
|
Whenever we attach a new volume to node-agent, all node-agent pods will be restarted, this is not accepted if we do this out of controller, because once node-agent pods restart, all fs-backup and data mover backup/restore are affected. Additionally, I think doing this for node-agent is with lower priority since from the backup method data mover is preferred, fs-backup is not used unless data mover is unavailable. |
Related to issue #7499 and #7718.
The cache policy decides the root file system disk usage in the pod where data movement is running, and on the other hand, it also impact the restore performance significantly.
Therefore, besides storing the cache in the root file system, we should allow users to add a dedicate volume to hold the cache.
The text was updated successfully, but these errors were encountered: