Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standard way to refer to DVC-files #422

Merged
merged 7 commits into from
Jun 12, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion static/docs/changelog/0.35.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ improvements) we have done in the last few months:
- **SSH remotes (data storage) support** - config options to set port, key
files, timeouts, password, etc + improved stability and Windows support!
Introduced **HTTP remotes** - external dependencies and as a read-only cache.
- **Control over where DVC files are located in your project** - place them
- **Control over where DVC-files are located in your project** - place them
wherever you want with the `-f` option supported by all relevant commands -
`dvc add`, `dvc run`, and `dvc import`.
- 🙂A lot of **UI improvements** . Starting from the finally fixed nasty issue
Expand Down
60 changes: 32 additions & 28 deletions static/docs/commands-reference/add.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# add

Take a data file or a directory under DVC control.
Take a data file or a directory under DVC control (by creating a corresponding
DVC-file).

## Synopsis

Expand All @@ -25,7 +26,7 @@ Under the hood a few actions are taken for each file in the target(s):
1. Calculate the file checksum.
2. Move the file content to the DVC cache (default location is `.dvc/cache`).
3. Replace the file by a link to the file in the cache (see details below).
4. Create a corresponding DVC file (metafile `.dvc`) and store the checksum to
4. Create a corresponding DVC-file (`.dvc` extension) and store the checksum to
identify the cache entry.
5. Add the _target_ filename to `.gitignore` (if Git is used in this workspace)
to prevent it from being committed to the Git repository.
Expand All @@ -35,15 +36,19 @@ Under the hood a few actions are taken for each file in the target(s):
the repository.

The result is data file is added to the DVC cache, and DVC metafiles (`.dvc`)
can be tracked via Git or other version control system. The stage file
(metafile) lists the added file as an `out` (output) of the stage, and
references the DVC cache entry using the checksum. See
[DVC File Format](/doc/user-guide/dvc-file-format) for the detailed description
can be tracked via Git or other version control system. The DVC-file (metafile)
lists the added file as an `out` (output) of the DVC-file, and references the
DVC cache entry using the checksum. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for the detailed description
of the DVC _metafile_ format.

> Note that DVC-files created by this command are _orphans_: they have no
> dependencies. _Orphaned_ "stage files" are always considered _changed_ by
> `dvc repro`, which always executes them.

By default DVC tries to use reflinks (see
[File link types](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
to avoid copying any file contents and to optimize DVC file operations for large
to avoid copying any file contents and to optimize DVC-file operations for large
files. DVC also supports other link types for use on file systems without
`reflink` support, but they have to be specified manually. Refer to the
`cache.type` config option in `dvc config cache` for more information.
Expand All @@ -55,10 +60,10 @@ to work with directory hierarchies with `dvc add`.
added individually as described above. This means every file has its own
`.dvc` file, and a corresponding DVC cache entry is made (unless
`--no-commit` flag is added).
2. When not using `--recursive` a DVC stage file is created for the top of the
2. When not using `--recursive` a DVC-file is created for the top of the
directory (`dirname.dvc`), and every file in the hierarchy is added to the
DVC cache (unless `--no-commit` flag is added), but these files do not have
individual DVC files. Instead the DVC file for the directory has a
individual DVC-files. Instead the DVC-file for the directory has a
corresponding file in the DVC cache containing references to the files in the
directory hierarchy.

Expand All @@ -72,9 +77,9 @@ This way you bring data provenance and make your project reproducible.
## Options

- `-R`, `--recursive` - recursively add each file under the named directory. For
each file a new DVC file is created using the process described earlier.
each file a new DVC-file is created using the process described earlier.

- `--no-commit` - do not put files/directories into cache. A stage file is
- `--no-commit` - do not put files/directories into cache. A DVC-file is
created, and an entry is added to `.dvc/state`, while nothing is added to the
cache (`.dvc/cache`). The `dvc status` command will mention that the file is
`not in cache`. The `dvc commit` command will add the file to the DVC cache.
Expand All @@ -87,7 +92,7 @@ This way you bring data provenance and make your project reproducible.

- `-v`, `--verbose` - displays detailed tracing information.

- `-f`, `--file` - specify name of the DVC file it generates. It should be
- `-f`, `--file` - specify name of the DVC-file it generates. It should be
either `Dvcfile` or have a `.dvc` file extension (e.g. `data.dvc`) in order
for `dvc` to be able to find it later.

Expand All @@ -107,7 +112,7 @@ To track the changes with git run:
git add .gitignore data.xml.dvc
```

As the output says, stage file have been created for the file. Let us explore
As the output says, a DVC-file has been created for `data.xml`. Let us explore
the result:

```dvc
Expand All @@ -131,13 +136,12 @@ meta: #key to contain arbitary user data
email: [email protected]
```

This is a standard DVC stage file with only an `outs` entry. The checksum should
This is a standard DVC-file with only an `outs` entry. The checksum should
correspond to an entry in the cache.

If user overwrites the `.dvc` file, comments and meta values are not preserved
between multiple executions of `dvc add` command.


```dvc
$ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b

Expand Down Expand Up @@ -183,9 +187,9 @@ To track the changes with git run:
git add pics.dvc
```

There are no DVC files generated within this directory structure, but the images
There are no DVC-files generated within this directory structure, but the images
are all added to the DVC cache. DVC prints a message to that effect, saying that
`md5` values are computed for each directory. A DVC file is generated for the
`md5` values are computed for each directory. A DVC-file is generated for the
top-level directory, and it contains this:

```yaml
Expand All @@ -201,31 +205,31 @@ wdir: .
If instead you use the `--recursive` option, the output looks as so:

```dvc
$ dvc add --recursive pix
$ dvc add --recursive pics

Saving 'pix/train/cats/cat.150.jpg' to cache '.dvc/cache'.
Saving 'pix/train/cats/cat.130.jpg' to cache '.dvc/cache'.
Saving 'pix/train/cats/cat.111.jpg' to cache '.dvc/cache'.
Saving 'pix/train/cats/cat.438.jpg' to cache '.dvc/cache'.
Saving 'pics/train/cats/cat.150.jpg' to cache '.dvc/cache'.
Saving 'pics/train/cats/cat.130.jpg' to cache '.dvc/cache'.
Saving 'pics/train/cats/cat.111.jpg' to cache '.dvc/cache'.
Saving 'pics/train/cats/cat.438.jpg' to cache '.dvc/cache'.
...
```

In this case a DVC file corresponding to each file is generated, and no
top-level DVC file is generated. But this is less convenient.
In this case a DVC-file corresponding to each file is generated, and no
top-level DVC-file is generated. But this is less convenient.

With the `dvc add pics` a single DVC file is generated, `pics.dvc`, which lets
With the `dvc add pics` a single DVC-file is generated, `pics.dvc`, which lets
us treat the entire directory structure in one unit. It lets you pass the whole
directory tree as input to a `dvc run` stage like so:
directory tree as a dependency to a `dvc run` stage like so:

```dvc
$ dvc run -f train.dvc \
-d train.py -d data \
-d train.py -d pics \
-M metrics.json -o model.h5 \
python train.py
```

To see this whole example go to
[Example: Versioning](/doc/get-started/example-versioning).

Since no top-level DVC file is generated with the `--recursive` option we cannot
Since no top-level DVC-file is generated with the `--recursive` option we cannot
use the directory structure as a whole.
63 changes: 33 additions & 30 deletions static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ usage: dvc checkout [-h] [-q | -v]
[targets [targets ...]]

positional arguments:
targets DVC files.
targets DVC-files.
```

## Description
Expand All @@ -20,12 +20,12 @@ directory is to be used, using the checksum saved in the `outs` fields. The
`dvc checkout` command updates the workspace data to match with the cache files
corresponding to those checksums.

Using an SCM like Git, the DVC files are kept under version control. At a given
branch or tag of the SCM workspace, the DVC files will contain checksums for the
Using an SCM like Git, the DVC-files are kept under version control. At a given
branch or tag of the SCM workspace, the DVC-files will contain checksums for the
corresponding data files kept in the DVC cache. After an SCM command like
`git checkout` is run, the DVC files will change to the state at the specified
`git checkout` is run, the DVC-files will change to the state at the specified
branch or commit or tag. Afterwards, the `dvc checkout` command is required in
order to synchronize the data files with the currently checked out DVC files.
order to synchronize the data files with the currently checked out DVC-files.

This command must be executed after `git checkout` since Git does not handle
files that are under DVC control. For convenience a Git hook is available,
Expand All @@ -34,11 +34,12 @@ simply by running `dvc install`, that will automate running `dvc checkout` after

The execution of `dvc checkout` does:

- Scan the `outs` entries in DVC files to compare with the currently checked out
data files. The scanned DVC files is limited by the listed targets (if any) on
the command line. And if the `--with-deps` option is specified, it scans
backward in the pipeline from the named targets.
- For any data files where the checksum does not match with the DVC file entry,
- Scan the `outs` entries in DVC-files to compare with the currently checked out
data files. The scanned DVC-files is limited by the listed `targets` (if any)
on the command line. And if the `--with-deps` option is specified, it scans
backward in the [pipeline](https://dvc.org/doc/get-started/pipeline) from the
named targets.
- For any data files where the checksum does not match with the DVC-file entry,
the data file is restored from the cache. The link strategy used (`reflink`,
`hardlink`, `symlink`, or `copy`) depends on the OS and the configured value
for `cache.type` – See `dvc config cache`.
Expand Down Expand Up @@ -70,17 +71,19 @@ checked out without error will be restored.
There are two methods to restore a file missing from the cache, depending on the
situation. In some cases the pipeline must be rerun using the `dvc repro`
command. In other cases the cache can be pulled from a remote cache using the
`dvc pull` command.
`dvc pull` command. See also `dvc pipeline`

## Options

- `-d`, `--with-deps` - determines the files to download by searching backwards
in the pipeline from the named stage(s). The only files which will be updated
are associated with the named stage, and the stages which execute earlier in
the pipeline.
- `-d`, `--with-deps` - determine workspace files to update by tracking
dependencies to the named target DVC-file(s). This option only has effect when
one or more `targets` are specified. By traversing all stage dependencies, DVC
searches backward through the pipeline from the named target(s). This means
DVC will not checkout files referenced later in the pipeline than the named
target(s).

- `-f`, `--force` - does not prompt when removing workspace files. Changing the
current set of DVC files with SCM commands like `git checkout` can result in
current set of DVC-files with SCM commands like `git checkout` can result in
the need for DVC to remove files which should not exist in the current state
and are missing in the local cache (they are not committed in DVC terms). This
option controls whether the user will be asked to confirm these files removal.
Expand All @@ -97,9 +100,10 @@ command. In other cases the cache can be pulled from a remote cache using the

## Examples

To explore `dvc checkout` let's consider a simple workspace with several stages,
and a few Git tags. Then with `git checkout` and `dvc checkout` we can see what
happens as we shift from tag to tag.
To explore `dvc checkout` let's consider a simple workspace with several
[stages](/doc/commands-reference/run), and a few Git tags. Then with
`git checkout` and `dvc checkout` we can see what happens as we shift from tag
to tag.

<details>

Expand Down Expand Up @@ -183,25 +187,24 @@ Note: checking out 'baseline'.
HEAD is now at 40cc182...
```

Let's check the `model.pkl` and `train.dvc` files again:
Let's check the `model.pkl` entry in `train.dvc` again:

```yaml
outs:
md5: a66489653d1b6a8ba989799367b32c43
path: model.pkl
```

but if you check the `model.pkl` the file is still the same:
But if you check `model.pkl`, the file hash is still the same:

```dvc
$ md5 model.pkl
MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
```

What's happened is that `git checkout` changed `featurize.dvc`, `train.dvc`, and
other DVC files. But it did nothing with the `model.pkl` and `matrix.pkl` files.
Git does not manage those files. Instead DVC manages those files, and we must
therefore do this:
This is because `git checkout` changed `featurize.dvc`, `train.dvc`, and other
DVC-files. But it did nothing with the `model.pkl` and `matrix.pkl` files. Git
does not manage those files, DVC does, and we must therefore do this:

```dvc
$ dvc fetch
Expand All @@ -210,11 +213,11 @@ $ md5 model.pkl
MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43
```

What's happened is that DVC went through the sole existing DVC stage file and
adjusted the current set of files to match the `outs` of that stage. `dvc fetch`
command runs once to download missing data from the remote storage to the local
cache. Alternatively, we could have just run `dvc pull` in this case to
automatically do `dvc fetch` + `dvc checkout`.
What happened is that DVC went through the sole existing DVC-file and adjusted
the current set of files to match the `outs` of that stage. `dvc fetch` command
runs once to download missing data from the remote storage to the local cache.
Alternatively, we could have just run `dvc pull` in this case to automatically
do `dvc fetch` + `dvc checkout`.

## Automating `dvc checkout`

Expand Down
Loading