Skip to content

Commit

Permalink
Another CR pass - redirects fixes and some copy editing
Browse files Browse the repository at this point in the history
  • Loading branch information
omesser committed Apr 12, 2023
1 parent 120df9e commit c88fe8a
Show file tree
Hide file tree
Showing 4 changed files with 72 additions and 94 deletions.
118 changes: 48 additions & 70 deletions content/docs/start/experiments/experiment-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,35 +7,41 @@ description:

# Get Started: Experiment Pipelines

Eventually, managing your notebook cells may start to feel fragile, and you may
want to structure your project and code for reproducible execution. When you are
ready to
If you've been following the guide in order, you might have gone through the
chapter about [data pipelines](/doc/start/data-management/data-pipelines)
already. Here, we will use the same functionality as a basis for an
experimentation build system.

Running an <Abbr>experiment</abbr> is achieved by executing <abbr>DVC
pipelines</abbr>, and the term refers to the set of trackable changes associated
with this execution. This includes code changes and resulting artifacts like
plots, charts and models. The various `dvc exp` subcommands allow you to
execute, share and manage experiments in various ways. Below, we'll build an
experiment pipeline, and use `dvc exp run` to execute it with a few very handy
capabilities like experiment queueing and parametrization.

## Stepping up and out of the notebook

After some time spent in your IPython notebook (e.g.
[Jupyter](https://jupyter-notebook.readthedocs.io/en/latest/)) doing data
exploration and basic modeling, managing your notebook cells may start to feel
fragile, and you may want to structure your project and code for reproducible
execution, testing and further automation. When you are ready to
[migrate from notebooks to scripts](https://towardsdatascience.com/from-jupyter-notebook-to-sc-582978d3c0c),
DVC <abbr>Pipelines</abbr> can help you standardize your workflow following
software engineering best practices:
DVC <abbr>Pipelines</abbr> help you standardize your workflow following software
engineering best practices:

- **Modularization**: split the different logical steps in your notebook into
- **Modularization**: Split the different logical steps in your notebook into
separate scripts.

- **Parametrization**: adapt your scripts to decouple the configuration from the
- **Parametrization**: Adapt your scripts to decouple the configuration from the
source code.

If you've been following the guide in order, you might have gone through the
chapter about [data pipelines](/doc/start/data-management/data-pipelines)
already. We will now use the same functionality as a basis for an
experimentation build system.

Running an <Abbr>experiment</abbr> is achieved by executing the pipeline, and
the term refers to the set of trackable changes associated with that execution -
including code changes and resulting artifacts like plots, charts and models.
The various `dvc exp` sub commands allow you to execute, share and manage
experiments in various ways. Below, we'll build an experiment pipeline, and use
`dvc exp run` to execute it with a few very handy capabilities like experiment
queueing and parametrization.
## Creating the experiment pipeline

## Creating the pipeline

In our example repo, we first extract data preparation from the
In our
[example repo](https://github.com/iterative/example-get-started-experiments), we
first extract data preparation logic from the
[original notebook](https://github.com/iterative/example-get-started-experiments/blob/main/notebooks/TrainSegModel.ipynb)
into
[`data_split.py`](https://github.com/iterative/example-get-started-experiments/blob/main/src/data_split.py).
Expand All @@ -52,7 +58,9 @@ def data_split():
...
```

You can use `dvc stage add` to transform a script into a <abbr>stage</abbr>:
We now use `dvc stage add` commands to transform our scripts into individual
<abbr>stages</abbr> starting with a `data_split` stage for
[`data_split.py`](https://github.com/iterative/example-get-started-experiments/blob/main/src/data_split.py):

```cli
$ dvc stage add --name data_split \
Expand All @@ -62,8 +70,14 @@ $ dvc stage add --name data_split \
python src/data_split.py
```

A `dvc.yaml` file is generated. It includes information about the command you
want to run (`python src/data_split.py`), its <abbr>dependencies</abbr>,
A `dvc.yaml` file is automatically generated with the stage details.

<details>

### Expand to see the created `dvc.yaml`

It includes information about the stage we added, like the executable command
(`python src/data_split.py`), its <abbr>dependencies</abbr>,
<abbr>parameters</abbr>, and <abbr>outputs</abbr>:

```yaml
Expand All @@ -81,61 +95,21 @@ stages:
- data/test_data
```
`dvc exp run` will run all stages in the `dvc.yaml` file:

```cli
$ dvc exp run
'data/pool_data.dvc' didn't change, skipping
Running stage 'data_split':
> python src/data_split.py
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
...
```

<admon type="info">

Learn more about [Stages](/doc/user-guide/pipelines/defining-pipelines#stages)

</admon>

## Building a DAG

By using `dvc stage add` multiple times and defining <abbr>outputs</abbr> of a
stage as <abbr>dependencies</abbr> of another, you describe a sequence of
commands which forms a [pipeline](/doc/user-guide/pipelines/defining-pipelines),
also called a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
</details>
Let's create a `train` stage using
Now, create the `train` and `evaluate` stages using
[`train.py`](https://github.com/iterative/example-get-started-experiments/blob/main/src/train.py)
to train the model:
and
[`evaluate.py`](https://github.com/iterative/example-get-started-experiments/blob/main/src/evaluate.py)
to train the model and evaluate its performance respectively:

```cli
$ dvc stage add -n train \
-p base,train \
-d src/train.py -d data/train_data \
-o models/model.pkl \
python src/train.py
```

`dvc exp run` checks the `data_split` stage first and then the `train` stage
since it depends on the <abbr>outputs</abbr> of `data_split`. If a stage has not
changed or has been run before with the same <abbr>dependencies</abbr> and
<abbr>parameters</abbr>, it will be
[skipped](/doc/user-guide/pipelines/run-cache):

```cli
$ dvc exp run
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Running stage 'train':
> python src/train.py
...
```
Finally, let's add an `evaluate` stage:

```cli
$ dvc stage add -n evaluate \
-p base,evaluate \
-d src/evaluate.py -d models/model.pkl -d data/test_data \
Expand Down Expand Up @@ -185,6 +159,8 @@ stages:

</details>

<details>

## Visualizing the experiment DAG

As the number of stages grows, the `dvc dag` command becomes handy for
Expand Down Expand Up @@ -220,6 +196,8 @@ it by running `dvc exp run` to create and track new experiment runs. This
enables some new features in DVC like Queueing experiments, and a canonical way
to work with parameters and hyper-parameters.

</details>

## Modifying parameters

You can modify <abbr>parameters</abbr> from the CLI using
Expand Down
5 changes: 2 additions & 3 deletions content/docs/start/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,8 @@ scenarios:
code, and use DVC as a build system for reproducible, data driven pipelines.

- **Experiment Management** - Easily track your experiments and their progress
by only instrumenting your code. For more advanced control, use DVC pipelines
as a build system to run lots of experiments managed tracked and managed in
Git, and collaborate on ML experiments like software engineers do for code.
by only instrumenting your code, and collaborate on ML experiments like
software engineers do for code.

The following chapters are grouped into the above 2 trails and are all pretty
self-contained.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,10 @@
# Discovering and accessing data

<details>

### 🎬 Click to watch a video intro.

https://youtu.be/EE7Gk84OZY8

</details>

We've learned how to _track_ data and models with DVC, and how to commit their
versions to Git. The next questions are: How can we _use_ these artifacts
outside of the project? How do we download a model to deploy it? How to download
a specific version of a model? Or reuse datasets across different projects?
Assuming you've learned the basics of how to
[track and version data](/doc/start/data-management/data-versioning) with DVC,
you might wonder: How can we access and use these artifacts _outside_ of the DVC
project? How do we download a model to deploy it? How to download a specific
version of a model? How to reuse datasets across different projects?

<admon type="tip">

Expand All @@ -24,12 +17,20 @@ instead of the original file name such as `model.pkl` or `data.xml`).

</admon>

Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock`,
which we'll cover later) have their history in Git. DVC's remote storage config
is also saved in Git, and contains all the information needed to access and
download any version of datasets, files, and models. It means that a Git
repository with <abbr>DVC files</abbr> becomes an entry point, and can be used
instead of accessing files directly.
<details>

### 🎬 Click to watch a video about sharing data and models

https://youtu.be/EE7Gk84OZY8

</details>

Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock`)
have their history in Git. DVC's remote storage config is also saved in Git, and
contains all the information needed to access and download any version of
datasets, files, and models. It means that a Git repository with <abbr>DVC
files</abbr> becomes an entry point, and can be used instead of accessing files
directly.

## Find a file or directory

Expand Down
6 changes: 3 additions & 3 deletions redirects-list.json
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
"^/doc/start/data-and-model-access(/.*)?$ /doc/user-guide/data-management/discovering-and-accessing-data 302",
"^/doc/start/data-pipelines(/.*)?$ /doc/start/data-management/data-pipelines 302",
"^/doc/start/metrics-parameters-plots(/.*)?$ /doc/start/data-management/metrics-parameters-plots 302",
"^/doc/start/experiment-management(/.*)?$ /doc/start/experiment-management",
"^/doc/start/experiment-management(/.*)?$ /doc/start/experiments",
"^/doc/tutorial(/.*)?$ /doc/start",
"^/doc/tutorials(/.*)? /doc/start",
"^/doc/tutorials/get-started(/.*)?$ /doc/start",
Expand Down Expand Up @@ -98,9 +98,9 @@
"^/doc/command-reference/run$ /doc/command-reference/stage/add",
"^/doc/command-reference/exp/init$ /doc/command-reference/stage/add",

"^/doc/dvclive/dvclive-with-dvc$ /doc/start/experiment-management",
"^/doc/dvclive/dvclive-with-dvc$ /doc/start/experiments",
"^/doc/dvclive/api-reference/$ /doc/dvclive/",
"^/doc/dvclive/get-started$ /doc/start/experiment-management",
"^/doc/dvclive/get-started$ /doc/start/experiments",

"^/doc/cml(/.*)?$ https://cml.dev/doc$1",

Expand Down

0 comments on commit c88fe8a

Please sign in to comment.