-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New command for updating dependencies hashes. #6386
Comments
If I'm understanding this correctly, this is already possible using |
You are absolutely right. Until now, I solved it this way, but as new datasets came in, it started to take too much time. (When I touched this idea on discord, I mentioned Problem instanceThis is my project organization structure. (This is an artificial example to reflect the structure of my project. (Names of datasets, and their number are different than in real)
.
├── data
└── preparation
├── dataset_1
│ ├── dataset_1.dvc
│ ├── dataset_1_to_standardized_structure.py
│ ├── dvc.lock
│ ├── dvc.yaml
│ └── requirements.txt
├── dataset_2
│ ├── dataset_2.dvc
│ ├── dataset_2_to_standardized_structure.py
│ ├── dvc.lock
│ ├── dvc.yaml
│ └── requirements.txt
├── dataset_3
│ ├── dataset_3.dvc
│ ├── dataset_3_to_standardized_structure.py
│ ├── dvc.lock
│ ├── dvc.yaml
│ └── requirements.txt
├── __init__.py
└── utils
├── base_dataset_processor.py
├── __init__.py
└── misc.py
Pipeline example stages:
processing_raw_dataset:
vars:
- input_path: dataset_1
- output_path: ../../data/dataset_1
- script_path: dataset_1_to_standardized_structure.py
- processing_utils_path: ../utils
cmd: python ${script_path} ${input_path} ${output_path}
deps:
- ${input_path}
- ${script_path}
- ${processing_utils_path}
outs:
- ${output_path} In my use case I'm creating data registry transforming raw datasets into standardized format. To keep my code DRY I create processing template These files affecting result produced by pipelines so they should be add as dependencies to pipeline. And this is the source of issue. Of course I can use My idea is introduce command don't require all files to be present on disc. This still force you to manually update all files, but you don't need massive files anymore, updating process will be a lot of faster than currently and you are not obligated to commit changes to local cache. (Whole process should take minutest instead hours) Summarize
Similar problem will be occur in every project where some of code is move to module used in multiple places. |
Thanks for the explanation. I think this is really asking for a combination of two existing feature requests:
I'll leave this ticket open for now in case there is any further discussion, but unless there's some additional feature that I'm missing, this can be closed as a duplicate in the future. |
Thanks for link these issues. I don't see they when I opening this one. The problem presented in this issue, will be solved if functionality provided as result of #4657, will be allow update dependencies without downloading any data and work when only dependencies are modified. |
Closing in favor of the existing discussions |
Idea
Main idea is introduce new command allowing refresh dependencies hash in dvc.lock file without running pipeline again. This command will be useful when one or more dependencies will be modified and these modifications don't affect results. For example adding comment or add new function to utils module.
Problem
Currently updating dependencies is possible by running pipeline once again. This solution have some drawbacks:
Possible solution
Introducing new command
dvc refresh
recomputed hashes.Interface
Options
-d <stage> <filename>
- recompute hash of file. If file is tracked by dvc ask for confirmation when file is modified. (Can be use multiple time to specify more targets)-f
,--force
- overwrite an existing hashes in dvc.lock file without asking for confirmation.-q
,--quiet
- do not write anything to standard output. Exit with 0 if no problems arise-h
,--help
- prints the usage/help message, and exit.-v
,--verbose
- displays detailed tracing informationBehavior
-d
option apply for all dependencies in pipeline.-d
option occur at least once apply only for these dependenciesBenefits
Drawbacks
Final Notes
I would greatly appreciate your feedback on what you think about this idea.
The text was updated successfully, but these errors were encountered: