Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New command for updating dependencies hashes. #6386

Closed
naibatsuteki opened this issue Aug 4, 2021 · 5 comments
Closed

New command for updating dependencies hashes. #6386

naibatsuteki opened this issue Aug 4, 2021 · 5 comments

Comments

@naibatsuteki
Copy link

Idea

Main idea is introduce new command allowing refresh dependencies hash in dvc.lock file without running pipeline again. This command will be useful when one or more dependencies will be modified and these modifications don't affect results. For example adding comment or add new function to utils module.

Problem

Currently updating dependencies is possible by running pipeline once again. This solution have some drawbacks:

  1. Pipeline execution can be time-consuming
  2. To reproduce pipeline it's necessary to download input data.(This can be problematic when we want to make changes from new machine)
  3. Using a module shared by multiple pipelines compounds the previous problems. (Modification in this module cause the need to update many lock files)

Possible solution

Introducing new command dvc refresh recomputed hashes.

Interface

usage: dvc commit [-h] [-q | -v] [-f] [-d <stage> <filename>] target
    

positional arguments:
  target - Limit command scope to specific pipeline.

Options

  • -d <stage> <filename> - recompute hash of file. If file is tracked by dvc ask for confirmation when file is modified. (Can be use multiple time to specify more targets)
  • -f, --force - overwrite an existing hashes in dvc.lock file without asking for confirmation.
  • -q, --quiet - do not write anything to standard output. Exit with 0 if no problems arise
  • -h, --help - prints the usage/help message, and exit.
  • -v, --verbose - displays detailed tracing information

Behavior

  • Because command can corrupt state command can be used only with specified target.
  • If command is executed without any -d option apply for all dependencies in pipeline.
  • If -d option occur at least once apply only for these dependencies
  • If file is tracked by dvc ask before hash update:
    • yes - update file hash and raise error if file don't exist.
    • no - don't modify the hash.
  • This command don't commit changes to cache

Benefits

  • Pipeline dependencies can be updated without running pipeline once again
  • Downloading massive data is no longer needed to update pipeline dependencies
  • Exporting code to shared modules will be easier
  • Hashes can be updated with surgeon precision

Drawbacks

  • Command can corrupt tracking state

Final Notes

I would greatly appreciate your feedback on what you think about this idea.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 5, 2021

If I'm understanding this correctly, this is already possible using dvc commit. If you have modifications to dependencies or outputs in your local workspace, dvc commit will commit the current state of those files from your workspace into dvc.lock, without the need to run dvc repro.

@pmrowla pmrowla added the awaiting response we are waiting for your reply, please respond! :) label Aug 5, 2021
@naibatsuteki
Copy link
Author

naibatsuteki commented Aug 5, 2021

You are absolutely right. Until now, I solved it this way, but as new datasets came in, it started to take too much time.
In my previous comment, I did not correctly capture the nature of the problem. Example should be more reliable.

(When I touched this idea on discord, I mentioned dvc commit, and here I forgot about it. 😞 )

Problem instance

This is my project organization structure. (This is an artificial example to reflect the structure of my project. (Names of datasets, and their number are different than in real)

  • data - Standardized dataset destination directory
  • preparation/<dataset_name> - The directory that contains all components needed to process the dataset. (Without shared parts)
  • preparation/utils - Code shared during processing datasets
.
├── data
└── preparation
    ├── dataset_1
    │   ├── dataset_1.dvc
    │   ├── dataset_1_to_standardized_structure.py
    │   ├── dvc.lock
    │   ├── dvc.yaml
    │   └── requirements.txt
    ├── dataset_2
    │   ├── dataset_2.dvc
    │   ├── dataset_2_to_standardized_structure.py
    │   ├── dvc.lock
    │   ├── dvc.yaml
    │   └── requirements.txt
    ├── dataset_3
    │   ├── dataset_3.dvc
    │   ├── dataset_3_to_standardized_structure.py
    │   ├── dvc.lock
    │   ├── dvc.yaml
    │   └── requirements.txt
    ├── __init__.py
    └── utils
        ├── base_dataset_processor.py
        ├── __init__.py
        └── misc.py

Pipeline example

stages:
  processing_raw_dataset:
    vars:
      - input_path: dataset_1
      - output_path: ../../data/dataset_1
      - script_path: dataset_1_to_standardized_structure.py
      - processing_utils_path: ../utils
    cmd: python ${script_path} ${input_path} ${output_path}
    deps:
      - ${input_path}
      - ${script_path}
      - ${processing_utils_path}
    outs:
      - ${output_path}

In my use case I'm creating data registry transforming raw datasets into standardized format. To keep my code DRY I create processing template base_dataset_processor.py and move shared code to misc.py.

These files affecting result produced by pipelines so they should be add as dependencies to pipeline. And this is the source of issue.
When something will be change in this files I should refresh dependencies in all dvc.lock.

Of course I can use dvc commit to update dvc.lock, but as far as I know dvc commit requires all files to be present on the disk.
This solution stop working when volume of datasets exceed capabilities of disc or network connection. You can do this but this take a lot of time.

My idea is introduce command don't require all files to be present on disc. This still force you to manually update all files, but you don't need massive files anymore, updating process will be a lot of faster than currently and you are not obligated to commit changes to local cache. (Whole process should take minutest instead hours)

Summarize

  • If modification don't affect outputs:
    • dvc commit is appropriate to update dvc.lock for the dataset you are currently working on.
    • dvc commit is not enough when you want update dvc.lock for all dataset.
  • If modification in shared code affect outputs you must run pipeline again.

Similar problem will be occur in every project where some of code is move to module used in multiple places.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 6, 2021

Thanks for the explanation.

I think this is really asking for a combination of two existing feature requests:

I'll leave this ticket open for now in case there is any further discussion, but unless there's some additional feature that I'm missing, this can be closed as a duplicate in the future.

@naibatsuteki
Copy link
Author

Thanks for link these issues. I don't see they when I opening this one.

The problem presented in this issue, will be solved if functionality provided as result of #4657, will be allow update dependencies without downloading any data and work when only dependencies are modified.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 9, 2021

Closing in favor of the existing discussions

@pmrowla pmrowla closed this as completed Aug 9, 2021
@pmrowla pmrowla removed the awaiting response we are waiting for your reply, please respond! :) label Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants