dvc status: import stages takes long time #9304
Labels
A: status
Related to the dvc diff/list/status
awaiting response
we are waiting for your reply, please respond! :)
optimize
Optimizes DVC
performance
improvement over resource / time consuming tasks
Bug Report
dvc status: import stages take very long time
Description
We have a dataset repository, which contains a processed version of OpenImages. Overall, we have 17 archive files of ~2 GB size each, containing training, validation and test data. We also have a model repository which imports these archive files and some additional metadata files.
Example dvc import file
test.tar.gz.dvc
:Of these files we have 17 (15 for train, 1 val, 1 test).
Running
dvc status
takes about a minute due to these import stages. As one can see in the log, repeatedly checking out the data-repo is what consumes time.We do think, we want to have individual imports in order to allow for running tests on subsets of the data. If we e.g. instead import the entire
tar
directory,dvc status
takes 10 sec, which I find acceptable.Example dvc import file
tar.dvc
:Running
dvc status -c
takes a few seconds in either case.Reproduce
Have one data repo with several large files. Import these files into another repo. Run
dvc stats
Expected
I would expect the check to happen faster. E.g. by collecting imports from the same repo first and checking out the data-repo only once.
Environment information
Output of
dvc doctor
:Additional Information (if any):
Output of dvc status
Output of dvc status -c
The text was updated successfully, but these errors were encountered: