Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

Commit

Permalink
GTO: updating docs (#235)
Browse files Browse the repository at this point in the history
* draft

* updating text

* add GTO+MLEM page

* move pages under user-guide

* updating dvc and mlem pages

* small enhancements

* fix sidebar

* update 'what's next' section in GS

* fix links

* Apply suggestions from code review

Co-authored-by: Jorge Orpinel <[email protected]>

* fix lint

* Apply suggestions from code review

Co-authored-by: Jorge Orpinel <[email protected]>

* lint

* remove extra details

* Apply suggestions from code review

Co-authored-by: Jorge Orpinel <[email protected]>

* Restyled by prettier-markdown (#253)

Co-authored-by: Restyled.io <[email protected]>

* hide mlem page

Co-authored-by: Jorge Orpinel <[email protected]>
Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
  • Loading branch information
4 people authored Dec 8, 2022
1 parent 9c9185d commit e306e13
Show file tree
Hide file tree
Showing 8 changed files with 355 additions and 33 deletions.
5 changes: 0 additions & 5 deletions content/docs/gto/dvc.md

This file was deleted.

8 changes: 4 additions & 4 deletions content/docs/gto/get-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ during registrations and promotions. The benefit of these Git-native mechanism
is that you can act upon GTO operations in any Git-based system downstream, for
example automating model deployments with CI/CD.

[tags]: /doc/gto/user-guide#git-tags-message-format
[tags]: /doc/gto/user-guide#git-tags-format

<details>

Expand Down Expand Up @@ -160,9 +160,9 @@ page.
Thanks for completing this Get Started!

- Learn how to
[specify important artifact's metainformation](/doc/gto/user-guide#annotations-in-artifactsyaml)
like `path`, `type` and `description` in the Artifact registry.
- Learn more about [acting in CI/CD](/doc/gto/user-guide#acting-in-ci-cd) upon
[get your artifacts](/doc/gto/user-guide#getting-artifacts-downstream) when
you need them (e.g. get the latest version or the version in specific stage).
- Learn more about [acting in CI/CD](/doc/gto/user-guide#acting-in-cicd) upon
version registrations and stage assignments.
- Reach us out in [GH issues](https://github.com/iterative/gto/issues) or in
[Discord](https://discord.com/invite/dvwXA2N) to get your questions answered!
5 changes: 0 additions & 5 deletions content/docs/gto/mlem.md

This file was deleted.

109 changes: 109 additions & 0 deletions content/docs/gto/user-guide/dvc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# GTO with DVC

Large files are typically not stored in a Git repository, so they need to be
downloaded from external sources. [DVC](https://dvc.org) is a great way to store
your GTO artifact files while keeping a pointer in the repo, and simplifying
[data management] and synchronization.

[data management]: https://dvc.org/doc/user-guide/data-management

<admon icon="book">

If you're new to DVC, [get started here](https://dvc.org/doc/start) first.

</admon>

## Tracking an artifact with DVC

First, we need to start tracking artifact with DVC. If you produce this artifact
in DVC Pipelines, it's done automatically.

If the artifact is located inside your Git repo, you can use `dvc add`:

```cli
$ dvc add model.pkl
$ git add model.pkl.dvc
```

If the artifact is located in some external storage, we can use `dvc import-url`
to still keep metainformation about it in the repo (use `--no-download` to skip
downloading it):

```cli
$ dvc import-url --no-download s3://container/model.pkl
$ git add model.pkl.dvc
```

## Annotating DVC-tracked artifacts

Once the artifact is tracked with DVC within your repo, we can annotate it with
GTO:

```cli
$ gto annotate model --path model.pkl
```

This will write the following to `artifacts.yaml`:

```yaml
model:
path: model.pkl
```
Commit the changes to Git in order to `gto register` artifact versions and
`gto assign` them to deployment stages referencing the new commit.

```cli
$ git add artifacts.yaml
$ git commit -m "version artifact binaries with DVC and annotate it with GTO"
```

To share your work, you'll need [remote storage] setup in DVC. You can then
upload the artifact files and the changes to the repo:

[remote storage]: https://dvc.org/doc/command-reference/remote

```cli
$ dvc push
$ git push
```

## Downloading artifacts

To download GTO artifact files tracked with DVC, you can use the `dvc get` or
`dvc import` commands (or simply use `dvc pull` if you `cd` inside the repo).

```cli
$ dvc get $REPO $ARTIFACT_PATH --rev $REVISION -o $OUTPUT_PATH
```

Check out [User Guide](/doc/gto/user-guide#getting-artifacts-downstream) to
learn how to find out `ARTIFACT_PATH` and `REVISION`.

<details>

### Example: downloading from outside the repo

If you need to download the latest version of `model`, that would be:

```cli
$ REVISION=$(gto show --repo $REPO model@greatest --ref)
$ ARTIFACT_PATH=$(gto describe --repo $REPO $ARTIFACT --rev $REVISION --path)
$ dvc get $REPO $ARTIFACT --rev $REVISION -o $ARTIFACT_PATH
```

</details>

<details>

### Example: downloading in CI

If you need to download an artifact from the same repo, that would be a bit
simpler (taking GH Actions as an example):

```cli
$ ARTIFACT_PATH=$(gto describe model --rev $GITHUB_REF --path)
$ dvc pull $ARTIFACT_PATH
```

</details>
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,15 @@

GTO helps you build an Artifact Registry out of your Git repository. It creates
annotated [Git tags](https://git-scm.com/book/en/v2/Git-Basics-Tagging) with a
[special format](#git-tag-message-format) in their message, and manages an
`artifacts.yaml` file. Since committing large files to Git is not a good
practice, you should consider not committing your artifacts to Git. Instead, use
[DVC](https://dvc.org/doc), Git-lfs, or any method commit pointers to the binary
files instead.
[special format](#git-tags-format), and manages an `artifacts.yaml` file.

<admon type="tip">

Storing large files in Git repos is not a good practice. Avoid committing your
ML artifacts to Git. You can [use DVC](/doc/gto/user-guide/dvc), Git LFS, or any
other method to commit pointers to the data, models, etc. instead.

</admon>

## Annotations in artifacts.yaml

Expand All @@ -17,19 +21,60 @@ or `dataset`), or any other useful information about them. For simple projects
artifacts (e.g. [in CI/CD](#acting-in-ci-cd)). But for more advanced cases, we
should codify them in the registry itself.

To keep this metainformation, GTO uses `artifacts.yaml` file. Commands like
`gto annotate` and `gto remove` are used to modify it, while `gto describe`
helps get them when they're needed.
To keep this metadata, GTO uses a human-readable `artifacts.yaml` file. The
`gto describe`, `gto annotate`, and `gto remove` commands are used to display
and manage it's contents.

<admon type="tip">

An example `artifacts.yaml` can be found
[in the `example-gto` repo](https://github.com/iterative/example-gto).

</admon>

## Getting artifacts in systems downstream

You may need to get a specific artifact version to a certain environment, most
likely the latest one or the one currently assigned to the stage. Use `gto show`
to find the [Git reference] (tag) you need (note that
[CI platforms](#acting-in-ci-cd) may expose it for you, e.g. the `GITHUB_REF`
[env var in GitHub Actions]):

[git reference]: https://git-scm.com/book/en/v2/Git-Internals-Git-References
[env var in github actions]:
https://docs.github.com/en/actions/learn-github-actions/environment-variables

If you would like to see an example of `artifacts.yaml`, check out the
[example-gto](https://github.com/iterative/example-gto/blob/main/artifacts.yaml)
repo.
<admon type="tip">

GTO doesn't provide a way to deliver the artifacts, but you can [use DVC] or
[employ MLEM] for that.

[use dvc]: /doc/gto/user-guide/dvc
[employ mlem]: https://mlem.ai

</admon>

```cli
$ gto show churn@latest --ref
[email protected]
$ gto show churn#prod --ref # by assigned stage
[email protected]
```

You may need the artifact's file path. If
[annotated](#annotations-in-artifactsyaml), it can be discovered with
`gto describe`:

```cli
$ gto describe churn --rev [email protected] --path
models/churn.pkl
```

## Acting in CI/CD

Once Git tags are pushed, you can start acting in systems downstream. A popular
options is to use CI/CD (triggered when Git tags are pushed). For general
details, check out something like
A popular deployment option is to use CI/CD (triggered when Git tags are
pushed). For general details, check out something like
[GitHub Actions](https://github.com/features/actions),
[GitLab CI/CD](https://docs.gitlab.com/ee/ci/) or
[Circle CI](https://circleci.com).
Expand Down Expand Up @@ -79,7 +124,7 @@ Alternatively, you can use environment variables (note the `GTO_` prefix)
$ GTO_EMOJIS=false gto show
```

## Git tag message format
## Git tags format

<admon type="tip">

Expand Down
Loading

0 comments on commit e306e13

Please sign in to comment.