New command for updating dependencies hashes. #6386

naibatsuteki · 2021-08-04T10:42:53Z

Idea

Main idea is introduce new command allowing refresh dependencies hash in dvc.lock file without running pipeline again. This command will be useful when one or more dependencies will be modified and these modifications don't affect results. For example adding comment or add new function to utils module.

Problem

Currently updating dependencies is possible by running pipeline once again. This solution have some drawbacks:

Pipeline execution can be time-consuming
To reproduce pipeline it's necessary to download input data.(This can be problematic when we want to make changes from new machine)
Using a module shared by multiple pipelines compounds the previous problems. (Modification in this module cause the need to update many lock files)

Possible solution

Introducing new command dvc refresh recomputed hashes.

Interface

usage: dvc commit [-h] [-q | -v] [-f] [-d <stage> <filename>] target
    

positional arguments:
  target - Limit command scope to specific pipeline.

Options

-d <stage> <filename> - recompute hash of file. If file is tracked by dvc ask for confirmation when file is modified. (Can be use multiple time to specify more targets)
-f, --force - overwrite an existing hashes in dvc.lock file without asking for confirmation.
-q, --quiet - do not write anything to standard output. Exit with 0 if no problems arise
-h, --help - prints the usage/help message, and exit.
-v, --verbose - displays detailed tracing information

Behavior

Because command can corrupt state command can be used only with specified target.
If command is executed without any -d option apply for all dependencies in pipeline.
If -d option occur at least once apply only for these dependencies
If file is tracked by dvc ask before hash update:
- yes - update file hash and raise error if file don't exist.
- no - don't modify the hash.
This command don't commit changes to cache

Benefits

Pipeline dependencies can be updated without running pipeline once again
Downloading massive data is no longer needed to update pipeline dependencies
Exporting code to shared modules will be easier
Hashes can be updated with surgeon precision

Drawbacks

Command can corrupt tracking state

Final Notes

I would greatly appreciate your feedback on what you think about this idea.

The text was updated successfully, but these errors were encountered:

pmrowla · 2021-08-05T01:13:24Z

If I'm understanding this correctly, this is already possible using dvc commit. If you have modifications to dependencies or outputs in your local workspace, dvc commit will commit the current state of those files from your workspace into dvc.lock, without the need to run dvc repro.

naibatsuteki · 2021-08-05T11:21:59Z

You are absolutely right. Until now, I solved it this way, but as new datasets came in, it started to take too much time.
In my previous comment, I did not correctly capture the nature of the problem. Example should be more reliable.

(When I touched this idea on discord, I mentioned dvc commit, and here I forgot about it. 😞 )

Problem instance

This is my project organization structure. (This is an artificial example to reflect the structure of my project. (Names of datasets, and their number are different than in real)

data - Standardized dataset destination directory
preparation/<dataset_name> - The directory that contains all components needed to process the dataset. (Without shared parts)
preparation/utils - Code shared during processing datasets

.
├── data
└── preparation
    ├── dataset_1
    │   ├── dataset_1.dvc
    │   ├── dataset_1_to_standardized_structure.py
    │   ├── dvc.lock
    │   ├── dvc.yaml
    │   └── requirements.txt
    ├── dataset_2
    │   ├── dataset_2.dvc
    │   ├── dataset_2_to_standardized_structure.py
    │   ├── dvc.lock
    │   ├── dvc.yaml
    │   └── requirements.txt
    ├── dataset_3
    │   ├── dataset_3.dvc
    │   ├── dataset_3_to_standardized_structure.py
    │   ├── dvc.lock
    │   ├── dvc.yaml
    │   └── requirements.txt
    ├── __init__.py
    └── utils
        ├── base_dataset_processor.py
        ├── __init__.py
        └── misc.py

Pipeline example

stages:
  processing_raw_dataset:
    vars:
      - input_path: dataset_1
      - output_path: ../../data/dataset_1
      - script_path: dataset_1_to_standardized_structure.py
      - processing_utils_path: ../utils
    cmd: python ${script_path} ${input_path} ${output_path}
    deps:
      - ${input_path}
      - ${script_path}
      - ${processing_utils_path}
    outs:
      - ${output_path}

In my use case I'm creating data registry transforming raw datasets into standardized format. To keep my code DRY I create processing template base_dataset_processor.py and move shared code to misc.py.

These files affecting result produced by pipelines so they should be add as dependencies to pipeline. And this is the source of issue.
When something will be change in this files I should refresh dependencies in all dvc.lock.

Of course I can use dvc commit to update dvc.lock, but as far as I know dvc commit requires all files to be present on the disk.
This solution stop working when volume of datasets exceed capabilities of disc or network connection. You can do this but this take a lot of time.

My idea is introduce command don't require all files to be present on disc. This still force you to manually update all files, but you don't need massive files anymore, updating process will be a lot of faster than currently and you are not obligated to commit changes to local cache. (Whole process should take minutest instead hours)

Summarize

If modification don't affect outputs:
- dvc commit is appropriate to update dvc.lock for the dataset you are currently working on.
- dvc commit is not enough when you want update dvc.lock for all dataset.
If modification in shared code affect outputs you must run pipeline again.

Similar problem will be occur in every project where some of code is move to module used in multiple places.

pmrowla · 2021-08-06T09:08:30Z

Thanks for the explanation.

I think this is really asking for a combination of two existing feature requests:

ability to update a (dependency) directory without needing the entire directory contents in your workspace: Mechanism to update a dataset w/o downloading it first #4657
granular commit support commit: support granularity #4297

I'll leave this ticket open for now in case there is any further discussion, but unless there's some additional feature that I'm missing, this can be closed as a duplicate in the future.

naibatsuteki · 2021-08-09T06:16:44Z

Thanks for link these issues. I don't see they when I opening this one.

The problem presented in this issue, will be solved if functionality provided as result of #4657, will be allow update dependencies without downloading any data and work when only dependencies are modified.

pmrowla · 2021-08-09T12:34:10Z

Closing in favor of the existing discussions

pmrowla added the awaiting response we are waiting for your reply, please respond! :) label Aug 5, 2021

pmrowla closed this as completed Aug 9, 2021

pmrowla removed the awaiting response we are waiting for your reply, please respond! :) label Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New command for updating dependencies hashes. #6386

New command for updating dependencies hashes. #6386

naibatsuteki commented Aug 4, 2021

pmrowla commented Aug 5, 2021

naibatsuteki commented Aug 5, 2021 •

edited

Loading

pmrowla commented Aug 6, 2021

naibatsuteki commented Aug 9, 2021

pmrowla commented Aug 9, 2021

New command for updating dependencies hashes. #6386

New command for updating dependencies hashes. #6386

Comments

naibatsuteki commented Aug 4, 2021

Idea

Problem

Possible solution

Interface

Options

Behavior

Benefits

Drawbacks

Final Notes

pmrowla commented Aug 5, 2021

naibatsuteki commented Aug 5, 2021 • edited Loading

Problem instance

Summarize

pmrowla commented Aug 6, 2021

naibatsuteki commented Aug 9, 2021

pmrowla commented Aug 9, 2021

naibatsuteki commented Aug 5, 2021 •

edited

Loading