Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gc: remove cache files that were already pushed to remote #2036

Closed
shaunirwin opened this issue May 22, 2019 · 22 comments · Fixed by #9350
Closed

gc: remove cache files that were already pushed to remote #2036

shaunirwin opened this issue May 22, 2019 · 22 comments · Fixed by #9350
Assignees
Labels
A: data-management Related to dvc add/checkout/commit/move/remove A: gc Related go garbage collection enhancement Enhances DVC feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@shaunirwin
Copy link

It would be useful to be able to clear up disk space on one's local machine by removing the cache files of specific targets.

@efiop efiop changed the title Add option to dvc gc to remove cache files for specific targets gc: remove cache files for specific targets May 22, 2019
@efiop efiop added the feature request Requesting a new feature label May 22, 2019
@efiop
Copy link
Contributor

efiop commented May 22, 2019

Hi @shaunirwin !

The problem with applying gc to specific dvc files is that the same cache files might be used by other dvc files, so it is not trivial to solve in general case. Maybe you could describe your scenario so we could try to see if there are better approaches in your case?

Thanks,
Ruslan

@nik123
Copy link
Contributor

nik123 commented Jul 17, 2019

I think I have possible use case for this request.

TL;DR: there should be a way to dvc gc cache files which are already pushed to remote DVC storage.

Detailed use case

I have two dirs under dvc control: data1 and data2. Here is the file structure:

├── data1
├── data1.dvc
├── data2
└── data2.dvc

At first I worked with data1:

$dvc add data1
$git add .
$git commit
$dvc push
$git push

Since I commited and pushed data1 it hasn't changed. Now I'm working with data2. The problem is that data1 is big and I'm running out of disk space on my local machine. I want to delete data1 both from my workspace and cache. Deleting data1 cache is safe because it is already pushed to remote DVC storage. I can delete data1 from workspace by dvc remove but I can't clear my cache because dvc gc considers data1 cache as "current" and does nothing.

P.S.:

I'm not sure that dvc gc for specific targets is the best option. Probably dvc gc should provide an option like "clear all cache files which are already synchronized with specific remote storage"

@efiop efiop changed the title gc: remove cache files for specific targets gc: remove cache files that were already pushed to remote Jul 22, 2019
@efiop efiop added enhancement Enhances DVC p3-nice-to-have It should be done this or next sprint labels Jul 22, 2019
@kaiogu
Copy link
Contributor

kaiogu commented Nov 27, 2019

My use case:
I'm working on multiple projects in parallel that all have large datasets tracked by dvc. They are located on a (small) SSD so that I can't have all datasets on disk simultaneously. DVC tracked files are backed up on an SSH-remote. I want to quickly be able to clear the space of files from one project so that i can dvc checkout the files from another.

@efiop efiop added p2-medium Medium priority, should be done, but less important p1-important Important, aka current backlog of things to do and removed p3-nice-to-have It should be done this or next sprint p2-medium Medium priority, should be done, but less important p1-important Important, aka current backlog of things to do labels Nov 27, 2019
@efiop
Copy link
Contributor

efiop commented Nov 28, 2019

In terms of implementation, it would look like this:

  1. Add a new flag to https://github.com/iterative/dvc/blob/master/dvc/command/gc.py . Not sure how to call it though, maybe anyone has any ideas about it? 🙂 Seems like it should be an additional flag for -c|--cloud though.
  2. Add support for that new flag to https://github.com/iterative/dvc/blob/master/dvc/repo/gc.py . To support it, if the flag is specified, in the last if cloud block, instead of used we should supply all local cache. E.g. POC would look something like
from dvc.cache import NamedCache
...
if cloud:
    if our_new_flag:
        used = [NamedCache.make("local", checksum, None) for checksum in self.cache.local.all()]
    _do_gc("remote", self.cloud.get_remote(remote, "gc -c").gc, used)

That last "used =" is needed to transform str checksums returned by all() to NamedCache, which is expected by gc(). Might be a nicer way to organize this, but it def gets the job done 🙂 And that is pretty much it in terms of implementation.

@kaiogu
Copy link
Contributor

kaiogu commented Nov 28, 2019

@efiop I'd like to take a swing at it :)

@efiop
Copy link
Contributor

efiop commented Nov 28, 2019

@kaiogu Please let us know if you need any help 🙂

@kaiogu
Copy link
Contributor

kaiogu commented Jan 23, 2020

Hi, I was on long vacation, but am back now and will start this today.

@efiop
Copy link
Contributor

efiop commented Jan 29, 2020

@kaiogu Sounds good, let us know if you'll have any questions :) Btw, we have a dev-general channel in our discord, please feel free to join 🙂

@kaiogu
Copy link
Contributor

kaiogu commented Jan 30, 2020

@efiop, I was thinking of calling it "safe" because it would only clear the cache if there was a backup safe in a remote. I'll let you know if I get stuck :)

@efiop
Copy link
Contributor

efiop commented Feb 3, 2020

@kaiogu safe is too generic for my personal taste 🙂 , maybe @iterative/engineering have any thoughts/ideas about it?

@casperdcl
Copy link
Contributor

local, pushed, synced?

@Suor
Copy link
Contributor

Suor commented Feb 4, 2020

Another option is to include this into push somehow:

dvc push --remove

@efiop
Copy link
Contributor

efiop commented Feb 6, 2020

Or dvc remote gc as it was suggested during the gc discussion.

@Suor
Copy link
Contributor

Suor commented Feb 19, 2020

I would expect dvc remote gc to gc on remote not in local cache.

@Viktor2k
Copy link

Viktor2k commented Nov 9, 2021

Any updates on this feature?

@efiop efiop removed the help wanted label Nov 17, 2021
@efiop
Copy link
Contributor

efiop commented Nov 17, 2021

@Viktor2k Not actively working on this directly yet 🙁

@pmrowla pmrowla added the A: data-management Related to dvc add/checkout/commit/move/remove label May 11, 2022
@aschuh-hf
Copy link

Me and my company would also be interested in this! It seems this feature would be quite useful to be able to use DVC for sharing locally downloaded datasets across multiple projects on remote servers. For this, we need a way to clean up the local cache from currently no longer needed data, though the data is still required to reproduce results in the future (hence they are still referenced in dvc.yaml files and a backup stored in the remote storage).

@pmrowla pmrowla added the A: gc Related go garbage collection label May 11, 2022
@alexmojaki
Copy link
Contributor

@efiop this comment:

In terms of implementation, it would look like this:

... in the last if cloud block, instead of used we should supply all local cache. E.g. POC would look something like

from dvc.cache import NamedCache
...
if cloud:
    if our_new_flag:
        used = [NamedCache.make("local", checksum, None) for checksum in self.cache.local.all()]
    _do_gc("remote", self.cloud.get_remote(remote, "gc -c").gc, used)

and this one:

Or dvc remote gc as it was suggested during the gc discussion.

seem to suggest that you're thinking about removing remote files, not local ones. Isn't that the opposite of the requested feature? Am I misunderstanding something?

@mvkvc
Copy link

mvkvc commented Sep 15, 2022

So is there currently no supported command to clear the local cache (keeping files backed up in remote)? And as a workaround could you just delete the cache folder and run dvc pull to repopulate it when needed?

@dberenbaum
Copy link
Collaborator

That all sounds right @mvkvc

@paulmueller
Copy link

I came here looking for an equivalent of "git annex drop" for dvc. Is manually deleting the cache and the dvc-pulled file really the only option to free disk space?

@dberenbaum
Copy link
Collaborator

It seems the primary blocker here is UI. Let's make a decision so we could move forward. How about dvc gc --not-in-remote?

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Apr 15, 2023
@daavoo daavoo self-assigned this Apr 18, 2023
@daavoo daavoo added this to DVC Apr 18, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Apr 18, 2023
@daavoo daavoo moved this from Backlog to Todo in DVC Apr 18, 2023
@daavoo daavoo moved this from Todo to In Progress in DVC Apr 19, 2023
daavoo added a commit that referenced this issue Apr 20, 2023
@daavoo daavoo linked a pull request Apr 20, 2023 that will close this issue
daavoo added a commit that referenced this issue Apr 20, 2023
daavoo added a commit that referenced this issue Apr 20, 2023
daavoo added a commit that referenced this issue Apr 21, 2023
@daavoo daavoo moved this from In Progress to Review In Progress in DVC Apr 21, 2023
daavoo added a commit that referenced this issue Apr 27, 2023
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in DVC Apr 27, 2023
daavoo added a commit that referenced this issue Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove A: gc Related go garbage collection enhancement Enhances DVC feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.