-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mechanism to update a dataset w/o downloading it first #4657
Comments
I'm facing a similar issue and I think this can be easily fixed in the case where the entire folder is tracked by dvc. If someone wants to upload a single image to a folder tracked by dvc
just 2 file uploads and 1 download. Maybe the solution is to allow the user to run |
Fully agree with @MetalBlueberry: it will be nice to have an option to pull only directories from remote ( dvc add --merge-into data.dvc data/ which will perform the following operations:
As additional feature we can implement merge of two different dvc files of same dir: dvc add data/ --file data_update.dvc
dvc add --merge-into data.dvc data_update.dvc This feature can be useful when teammates working in different branches and different data were collected. @shcheklein @efiop, I can start the implementation of this feature, if you're okay with this view. |
@MetalBlueberry @puhoshville Great ideas! My 2c:
Usually we just automatically pull .dir files when we need them without asking. E.g. Line 383 in fcdb503
--dir-only flag.
No need to remove the old .dir file. Dvc doesn't ever remove cache files except for @puhoshville I see that you use
Hm, if I understand you correctly, you probably mean https://dvc.org/doc/user-guide/merge-conflicts#append-only-directories ? |
We could also create a separate command for it: $ dvc append data.dvc file --path data/raw/file1
# append `file` as `data/raw/file1` into `data` directory (i.e. the `out` of `data.dvc`). It will pull |
I like the idea of
Ideally, a user should not create a separate data dir and mess with dvc files (which might not exist if we decide to use run-cache for data files at some point). We should utilize the existing dir structure and regular CLI experience ( It might look like:
NavigationWe might need a separate functionality for the navigation in these virtual/partaly-downloaded dirs - like
NamingAgree with @skshetry - a separate command might be helpful. But I'd not limit by append. We also need remote, update and navigate. It might look like PS: the navigation and file deletion is not part of this issue. But we should come up with the API that won't block us from these scenarios. |
Hey fellows, @shcheklein asked me to chime in on this ticket with my use case. So, here it is. I have created a DVC Data Registry with the intention of updating it automatically whenever one of my edge devices decides to push data to it. It looks something like this: What I have tried: Initially, I thought I could just create a The other way I was guided on the discord channel was to add data with more granularity, instead of just adding a directory as a whole, so that each individual machine can push data to a directory independently. Something like this: ___________
Data Registry
___________
> Session
> Data_A_timestamp.dvc
> Data_A_timestamp
...
...
> Data_C_timestamp.dvc
> Data_C_timestamp
...
...
In this case, there is no monolithic sessions directory (no more I have not settled on a definitive answer to this problem. Hopefully, adding my two cents to this issue will help me in finding a definitive answer. |
Hello again! We've developed a system to upload files to a dvc tracked folder without actually using dvc. It basically performs the steps described in the previous comment. seeing this post, looks like there are more people interested in having cron jobs to upload data to dvc tracked repositories from edge devices. is it worth developing a small cli app for this purpose?
If I find people interested in this, I will be really happy to implement it. |
@MetalBlueberry that sounds great. I would love to have that feature in DVC or try out your system (if you've open-sourced it). If any help is required in implementing this feature, I'm completely up for it. |
@MetalBlueberry you got me intrigued :) Could you share some details please? Especially, regarding the @RafayAK thanks for taking your time, btw sharing the use case. We are working right now on a POC for this problem. I can't promise that it will be no-git/no-python, but at least update won't require pulling the whole dataset. If multiple machine update it simultaneously- one of the will get a Git conflict and will need to run the operation again. Would it be a reasonable workflow for you? |
These are the steps:
I'm planning to use the following packages
So the idea is to download the git repo in memory with go-git. Then parsing the .dvc/config to discover the remote configuration. Now we establish the connection using go-cloud/blob. All the other steps should be straight forward. The good thing about doing this in go is that the program can be compiled into a small binary around 25Mb that doesn't require any dependency. Not even python or git. |
@MetalBlueberry got it, it makes sense. Clearly it won't be easy to make it part of DVC, but I think the whole community can benefit and it would be cool to see some parts of DVC written in Go. |
@MetalBlueberry very cool way of updating the repo. I like it. One question though, when you add a new piece of data into your DVC tracked directory, the hash of the new directory should change. How will you retrieve the old hash of the directory? I suppose calculating the hash without the new additions would do the trick, but then at some point wouldn't you also have to update the hash of the tracked directory in Git/Github? Also, race-conditions could become a problem if the md5.dir is being changed by multiple machines @shcheklein I would love to check out your POC whenever it's ready. As far as resolving Git conflicts I think your solution may workout, One more thing to add, I was going through the DVC docs and stumbled on "Append only directories" for DVC, do you think that this is a better way than resolving merge conflicts directly? |
@RafayAK yes! I think it's a good solution for merging append only dirs! It doesn't solve the problem of pulling the whole data though. |
@RafayAK here is you answer As promised, I've implemented a simple cli to upload data to a dvc tracked folder. The current status is POC, but I would like to get feedback from you ASAP. Also, we can continue the discussion in that repository because this is not the right place to have a long discussion. |
Hello everyone! 👋 I've created a POC in how we could potentially integrate such functionality into DVC (based on all of the discussion here so far). The flow would essentially be:
I have an initial working version on #4900 that you can start testing. You can also take a look at the PR description which tackles this new concept much more deeply. Looking forward to your thoughts on it! Cheers! |
Allow users to add files from outside of the repo. This is a first step toward achieving iterative/dvc#4657.
Allow users to add files from outside of the repo. This is a first step toward achieving iterative#4657. References iterative/dvc.org#2101
Hi all, just chiming in to say that I would really love it if this was implemented!! I'm currently using DVC for just data management, and hoping to avoid the long download before adding e.g. new results files to the model results directory. |
Hello everybody! I'd really like to have it (especially use-case described in OP). What's the status on this issue? It seems there is no much progress since PR #490. Are there any news or plans regarding this issue? |
@nik123 Thank you for your interest! Unfortunately, there is no active progress right now, we are keeping it in mind while reworking our architecture right now, but it is pretty clear that this will require serious research and design in order to start implementing. |
I'd also be interested in this feature. FWIW, I was imagining something like #4657 (comment) but as an option to |
Hello, for us this feature would be a critical feature since when a new annotated batch of images is done, it is sent to a gitlab CI pipeline that basically downloads the new data, uses DVC to update existing dataset and push new dataset. Without this feature it means that our pipelines will get slower and slower and require more and more storage as the dataset grows. This is a huge irritant since this adds a problem to the equation where we now have to worry about data storage for a pipeline that ultimately will not store the data long term (it's just needed during the update job). Also it might prevent the pipeline to run entirely if the dataset becomes too big as storage capacity will be reached... In my opinion, this limitation goes against the main purpose of using DVC since you have to carry the size of your dataset around. Any update on this? |
@AlexandreBrown Hey. Thanks for your interest! We are currently working on some pre-requisites for that feature (e.g. we need a nicer way to represent this virtual data structure so we could modify it). I expect this to be available for some early testing in the next month or so. So please stay tuned :) |
I too would really appreciate this feature! Otherwise, I'm stuck between two infeasible options:
|
I have a dataset that I need to update on a daily basis but the instance on which DVC runs to update my database has only 100GB of disk space and is short-lived (killed by the end of the day) while my total dataset is around 8TB. I cannot afford to pull 8TB of data every day just to update 3/4GB in this dataset. Because of this reason and seeing that this is still something impossible to achieve with DVC, I will have to just ditch it. |
There's a way upstream in
You need to tell dvc how to update the # patch.json
[
{"op": "remove", "path": "test/0/00004.png"},
{"op": "move", "path": "test/1/00003.png", "to": "test/0/00003.png"},
{"op": "copy", "path": "test/1/00003.png", "to": "test/1/11113.png"},
{"op": "test", "path": "test/1/00003.png"},
{"op": "add", "path": "local/path/to/patch.json", "to": "foo"},
{"op": "modify", "path": "local/path/to/patch.json", "to": "bar"}
] The path for And. using $ dvc-data update-tree <.dir-short-hash> <json_patch_file> eg: $ dvc-data update-tree f23d4 patch.json
object 30a856795d9872289fa45530f40884f9.dir Note that this is a very recent change and experimental, this might get changed or removed without any notice and UX might be clunky at the moment. I am sharing it in the hopes that someone might find it useful. :) For now, this might need iterative/dvc-data#82 to work properly. |
Thanks @skshetry! Great to know it's finally at least possible. In the long-term (not prioritized yet, so no timeframe available), this workflow should happen automatically by being able to:
For example:
|
Note that the only reason why dvc-data cannot handle pulling and pushing Of course, if we want to go with granular method, the solution is more complex and the issue is much more larger than just updating virtual directory. |
Pulling/pushing the |
In our case we solved many of our issues (such as updating without loading the dataset entirely), we started using Iterative new (alpha) tool ldb instead of DVC and we're really happy so far. |
Would love a feature like this. |
A proposal for this (adapted from https://www.notion.so/iterative/Auto-Manage-Directories-in-DVC-cf0b318c09384e40b4304b9434db3c5f for visibility) is to allow granular add, modify, and remove operations (to be prioritized in that order) on DVC-tracked directories. Edit: note that this mostly summarizes the discussion above into one doc. Granular dataset operationsDVC should automatically track nested directories internally and manage overlaps with existing paths (similar to Git). $ mkdir data
$ dvc add data
Added data
# Add new file to directory
$ touch data/foo
$ dvc add data/foo
Added data/foo
# Modify a single file in a directory
$ echo foo >> foo
$ dvc add data/foo
Added data/foo
# Remove file from directory
$ dvc remove data/foo
Removed data/foo Virtual directoriesUse granular dataset operations to work with virtual directories even if the directory's contents aren't available in the workspace. Assume you start with a # Add new file to empty tracked directory
$ cp newfile data/newfile
$ dvc add data/newfile
Added data/newfile
# Checkout/pull a file and modify it
$ dvc pull data/file1
$ echo newdata >> data/file1
$ dvc add data/file1
Added data/file1
# Stop tracking a file from an empty directory
$ dvc remove data/file2
Removed data/file2 Implementation notesFor virtual directory operations (where the full directory contents don’t exist in the local workspace), it will be necessary to have the contents of the These could be downloaded from the remote as needed automatically by DVC, or users might be required (or have the option) to download these |
Story
When I have a dataset with 1M images
I want to update it- add one more image file
I have to download and checkout the previous version first now
It takes long time
Request
Come up with a set of commands/options and a flow to do this efficiently w/o downloading data first
The text was updated successfully, but these errors were encountered: