diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index af1813267c..5d0ae6c040 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -400,7 +400,7 @@ run that stage (`models/model.pkl` `data/test_data/`) while skipping the rest of the stages: ```cli -$ dvc reproduce +$ dvc repro --pull --allow-missing 'data/pool_data.dvc' didn't change, skipping Stage 'data_split' didn't change, skipping Stage 'train' didn't change, skipping diff --git a/content/docs/user-guide/pipelines/running-pipelines.md b/content/docs/user-guide/pipelines/running-pipelines.md index ac644d0e1f..19b5ecc24d 100644 --- a/content/docs/user-guide/pipelines/running-pipelines.md +++ b/content/docs/user-guide/pipelines/running-pipelines.md @@ -86,8 +86,7 @@ DVC will skip that stage: Stage 'prepare' didn't change, skipping ``` -DVC will also recover the outputs from previous runs using the -[run cache](/doc/user-guide/pipelines/run-cache): +DVC will also recover the outputs from previous runs using the [run cache]. ``` Stage 'prepare' is cached - skipping run, checking out outputs @@ -108,6 +107,173 @@ stages: always_changed: true ``` +## Pull Missing Data + +By default, DVC expects that all data to run the pipeline is available locally. +Any missing data will be considered deleted and may cause the pipeline to fail. +`--pull` will download missing dependencies (and will download the cached +outputs of previous runs saved in the [run cache]), so you don't need to pull +all data for your project before running the pipeline. `--allow-missing` will +skip stages with no other changes than missing data. You can combine the +`--pull` and `--allow-missing` flags to run a pipeline while only pulling the +data that is actually needed to run the changed stages. + +Given the pipeline used in +[example-get-started-experiments](https://github.com/iterative/example-get-started-experiments): + +```cli +$ dvc dag + +--------------------+ + | data/pool_data.dvc | + +--------------------+ + * + * + * + +------------+ + | data_split | + +------------+ + ** ** + ** ** + * ** ++-------+ * +| train | ** ++-------+ ** + ** ** + ** ** + * * + +----------+ + | evaluate | + +----------+ +``` + +If we are in a machine where all the data is missing: + +```cli +$ dvc status +Not in cache: + (use "dvc fetch ..." to download files) + models/model.pkl + data/pool_data/ + data/test_data/ + data/train_data/ +``` + +We can modify the `evaluate` stage and DVC will only pull the necessary data to +run that stage (`models/model.pkl` `data/test_data/`) while skipping the rest of +the stages: + +```cli +$ dvc exp run --pull --allow-missing --set-param evaluate.n_samples_to_save=20 +Reproducing experiment 'hefty-tils' +'data/pool_data.dvc' didn't change, skipping +Stage 'data_split' didn't change, skipping +Stage 'train' didn't change, skipping +Running stage 'evaluate': +... +``` + +## Verify Pipeline Status + +In scenarios like CI jobs, you may want to check that the pipeline is up to date +without pulling or running anything. `dvc exp run --dry` will check which +pipeline stages to run without actually running them. However, if data is +missing, `--dry` will fail because DVC does not know whether that data simply +needs to be pulled or is missing for some other reason. To check which stages to +run and ignore any missing data, use `dvc exp run --dry --allow-missing`. + +This command will succeed if nothing has changed: + +
+ +### Clean example + +In the example below, data is missing because nothing has been pulled, but +otherwise the pipeline is up to date. + +```cli +$ dvc status +data_split: + changed deps: + deleted: data/pool_data + changed outs: + not in cache: data/test_data + not in cache: data/train_data +train: + changed deps: + deleted: data/train_data + changed outs: + not in cache: models/model.pkl +evaluate: + changed deps: + deleted: data/test_data + deleted: models/model.pkl +data/pool_data.dvc: + changed outs: + not in cache: data/pool_data +``` + +
+ +```cli +$ dvc exp run --allow-missing --dry +Reproducing experiment 'agley-nuke' +'data/pool_data.dvc' didn't change, skipping +Stage 'data_split' didn't change, skipping +Stage 'train' didn't change, skipping +Stage 'evaluate' didn't change, skipping +``` + +If anything is not up to date, the command will fail: + +
+ +### Dirty example + +In the example below, the `data_split` parameter in `params.yaml` was modified, +so the pipeline is not up to date. + +```cli +$ dvc status +data_split: + changed deps: + deleted: data/pool_data + params.yaml: + modified: data_split + changed outs: + not in cache: data/test_data + not in cache: data/train_data +train: + changed deps: + deleted: data/train_data + changed outs: + not in cache: models/model.pkl +evaluate: + changed deps: + deleted: data/test_data + deleted: models/model.pkl +data/pool_data.dvc: + changed outs: + not in cache: data/pool_data +``` + +
+ +```cli +$ dvc exp run --allow-missing --dry +Reproducing experiment 'dozen-jogs' +'data/pool_data.dvc' didn't change, skipping +ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '.../example-get-started-experiments/data/pool_data' +``` + +To ensure any missing data exists, you can also check that all data exists on +the remote. The command below will succeed (set the exit code to `0`) if all +data is found in the remote. Otherwise, it will fail (set the exit code to `1`). + +```cli +$ dvc data status --not-in-remote --json | grep -v not_in_remote +true +``` + ## Debugging Stages If you are using advanced features to interpolate values for your pipeline, like @@ -132,3 +298,4 @@ stage train: {'model': {'batch_size': 512, 'latent_dim': 8, [templating]: /doc/user-guide/project-structure/pipelines-files#templating [hydra composition]: /docs/user-guide/experiment-management/hydra-composition +[run cache]: /doc/user-guide/pipelines/run-cache