Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

GTO: updating docs #235

Merged
merged 19 commits into from
Dec 8, 2022
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions content/docs/gto/dvc.md

This file was deleted.

8 changes: 4 additions & 4 deletions content/docs/gto/get-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ during registrations and promotions. The benefit of these Git-native mechanism
is that you can act upon GTO operations in any Git-based system downstream, for
example automating model deployments with CI/CD.

[tags]: /doc/gto/user-guide#git-tags-message-format
[tags]: /doc/gto/user-guide#git-tags-format

<details>

Expand Down Expand Up @@ -160,9 +160,9 @@ page.
Thanks for completing this Get Started!

- Learn how to
[specify important artifact's metainformation](/doc/gto/user-guide#annotations-in-artifactsyaml)
like `path`, `type` and `description` in the Artifact registry.
- Learn more about [acting in CI/CD](/doc/gto/user-guide#acting-in-ci-cd) upon
[get your artifacts](/doc/gto/user-guide#getting-artifacts-downstream) when
you need them (e.g. get the latest version or the version in specific stage).
- Learn more about [acting in CI/CD](/doc/gto/user-guide#acting-in-cicd) upon
version registrations and stage assignments.
- Reach us out in [GH issues](https://github.com/iterative/gto/issues) or in
[Discord](https://discord.com/invite/dvwXA2N) to get your questions answered!
5 changes: 0 additions & 5 deletions content/docs/gto/mlem.md

This file was deleted.

144 changes: 144 additions & 0 deletions content/docs/gto/user-guide/dvc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# GTO with DVC

In many ML projects, data isn't stored in a Git repository and needs to be
downloaded from external sources. [DVC](https://dvc.org) is a common way to
store binaries for the artifacts registered with GTO.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

<details>

### Learn about different approaches to this

1. You can commit artifacts to Git repo. If they aren't small enough, this is
not recommended. To bypass this limitation, you can use
[Git-lfs](https://git-lfs.github.com).
2. You can version binaries with [DVC](https://dvc.org/) and commit pointers to
them to the repo. This is the recommended approach for large files.
3. You can version binaries manually somewhere, specifying URL to them as `path`
in `artifacts.yaml`. This can be done, if your artifacts are already
versioned by some external system.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

</details>

If you are new to DVC, you should start with
[DVC Get Started](https://dvc.org/doc/start) first, and then come back to this
Guide.

<!-- ```

dvc init --no-scm dvc remote add az azure://container-name export
AZURE_STORAGE_CONNECTION_STRING='YOUR_CONNECTION_STRING'

dvc import-url --no-download azure://container-name/data.parquet

``` -->
aguschin marked this conversation as resolved.
Show resolved Hide resolved

## Tracking an artifact with DVC

First, we need to start tracking artifact with DVC. If you produce this artifact
Comment on lines +16 to +18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this section at all: it's DVC documentation (redundant since we already linked to the Get Started)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, this tells the bare minimum so people could understand the full workflow. DVC GS is quite huge, so just redirecting there without any context would be confusion. Do you disagree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, this shows dvc import-url, which could be an important scenario for some folks starting to use MR while having some models outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People without context will get the bare minimum from this part. People with context will just see some story here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

Copy link
Contributor

@jorgeorpinel jorgeorpinel Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that linking to https://dvc.org/doc/start/data-management/data-versioning or even straight to https://dvc.org/doc/command-reference/add would be acceptable.

Not a super strong opinion since this section is small, but I think that in principle we should not cross-document products. Logic: what if the DVC feature changes? dvc.org/doc will probably get updated, but not here.

Besides, this shows dvc import-url

Not seeing why that's particularly important. BTW, the blocks are missing git add .gitignore

in DVC Pipelines, it's done automatically.

If the artifact is located inside your Git repo, you can use `dvc add`:

```cli
$ dvc add model.pkl
$ git add model.pkl.dvc
```

If the artifact is located in some external storage, we can use `dvc import-url`
to still keep metainformation about it in the repo (use `--no-download` to skip
downloading it):

```cli
$ dvc import-url --no-download s3://container/model.pkl
$ git add model.pkl.dvc
```

## Annotating DVC-tracked artifacts

Once the artifact is tracked with DVC within your repo, we can annotate it with
GTO:
Comment on lines +39 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Instead, just link to DVC docs for specific operations of relevance, like this:

Suggested change
Once the artifact is tracked with DVC within your repo, we can annotate it with
GTO:
Once the artifact is
[tracked with DVC](https://dvc.org/doc/command-reference/add) within your repo,
`gto annotate` it:

p.s. remember to try and use the commands in-line at least once (not just in fenced code blocks) so they get auto-linked (to the cmd ref).


```cli
$ gto annotate model --path model.pkl
```

This will modify `artifacts.yaml`, adding:
aguschin marked this conversation as resolved.
Show resolved Hide resolved

```yaml
model:
path: model.pkl
```

Now you should commit changes to Git, and you can register versions and assign
stages referencing the new commit.

```cli
$ git add artifacts.yaml
$ git commit -m "version artifact binaries with DVC and annotate it with GTO"
aguschin marked this conversation as resolved.
Show resolved Hide resolved
$ git push
```

Now your changes is live in your Git repo and you can download your artifact to
use it 🙌
aguschin marked this conversation as resolved.
Show resolved Hide resolved

## Downloading artifacts

When you want to download GTO artifact which binaries are stored with DVC, you
need to use `dvc get` or `dvc import` command:
aguschin marked this conversation as resolved.
Show resolved Hide resolved

```cli
$ dvc get $REPO $ARTIFACT_PATH --rev $REVISION -o $OUTPUT_PATH
```

Check out [User Guide](/doc/gto/user-guide#getting-artifacts-downstream) to
learn how to find out `ARTIFACT_PATH` and `REVISION`.

<details>

### Example: downloading from outside the repo

If you need to download the latest version of `model`, that would be:

```cli
$ REVISION=$(gto show --repo $REPO model@greatest --ref)
$ ARTIFACT_PATH=$(gto describe --repo $REPO $ARTIFACT --rev $REVISION --path)
$ dvc get $REPO $ARTIFACT --rev $REVISION -o $ARTIFACT_PATH
```

</details>

<details>

### Example: downloading in CI

If you need to download an artifact from the same repo, that would be a bit
simpler (taking GH Actions as an example):

```cli
$ ARTIFACT_PATH=$(gto describe model --rev $GITHUB_REF --path)
$ dvc pull $ARTIFACT_PATH
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
```

</details>

<!--
Refer to DVC install guide and Get Started to learn DVC first. If you're already
familiar with DVC, let's set it up for GTO repo (use your remote storage instead
of `s3://mybucket/myrepo`):

```cli
$ dvc init
$ dvc remote add myremote -d s3://mybucket/myrepo
$ git add .dvc/config
```

##

If you want to version your artifact with DVC, you need to add it to DVC first
(skip this step if you already have DVC-tracked artifacts):

```cli
$ dvc add path/to/artifact
$ dvc push
$ git add
``` -->
aguschin marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

GTO helps you build an Artifact Registry out of your Git repository. It creates
annotated [Git tags](https://git-scm.com/book/en/v2/Git-Basics-Tagging) with a
aguschin marked this conversation as resolved.
Show resolved Hide resolved
[special format](#git-tag-message-format) in their message, and manages an
[special format](#git-tags-format) in their message, and manages an
aguschin marked this conversation as resolved.
Show resolved Hide resolved
`artifacts.yaml` file. Since committing large files to Git is not a good
aguschin marked this conversation as resolved.
Show resolved Hide resolved
practice, you should consider not committing your artifacts to Git. Instead, use
[DVC](https://dvc.org/doc), Git-lfs, or any method commit pointers to the binary
files instead.
practice, you should consider not committing your artifacts to Git. Instead,
[use DVC](/doc/gto/user-guide/dvc), Git-lfs, or any method to commit pointers to
the binary files instead.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

## Annotations in artifacts.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but I'm a bit worried about the use of term annotation here (and in gto annotate) given that Git tags also have annotations AND since we specifically use annotated Git tags only. May be confusing (esp. for users familiar with Git tags).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Don't remember why we decided on using "annotations" considering this collision.


Expand All @@ -17,19 +17,62 @@ itself doesn't contain path to the artifact, type of it (it could be `model` or
[in a CI/CD system](#acting-in-ci-cd) downstream. But for more advanced cases,
we should codify them in the registry itself.

To keep this metainformation, GTO uses `artifacts.yaml` file. Commands like
To keep this metadata, GTO uses `artifacts.yaml` file. Commands like
`gto annotate` and `gto remove` are used to modify it, while `gto describe`
helps get them when they're needed.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

If you would like to see an example of `artifacts.yaml`, check out the
<admon type="tip">

An example of `artifacts.yaml` can be found in
[example-gto](https://github.com/iterative/example-gto/blob/main/artifacts.yaml)
repo.
aguschin marked this conversation as resolved.
Show resolved Hide resolved

</admon>

## Getting artifacts in systems downstream

You may need to get a specific artifact version to a certain environment, most
likely the latest one or the one currently assigned to the stage. Use `gto show`
to find the [Git reference] (tag) you need (note that
[CI platforms](#acting-in-ci-cd) may expose it for you, e.g. the `GITHUB_REF`
[env var in GitHub Actions]):

[git reference]: https://git-scm.com/book/en/v2/Git-Internals-Git-References
[env var in github actions]:
https://docs.github.com/en/actions/learn-github-actions/environment-variables

<admon type="tip">

GTO doesn't provide a way to deliver the artifacts, but you can [use DVC] or
[employ MLEM] for that.

[use dvc]: /doc/gto/user-guide/dvc
[employ mlem]: /doc/gto/user-guide/mlem

</admon>

```cli
# getting the Git reference for the latest version
$ gto show churn@greatest --ref
[email protected]
aguschin marked this conversation as resolved.
Show resolved Hide resolved

$ gto show churn#prod --ref # by assigned stage
[email protected]
```

You may need the artifact's file path. If
[annotated](#annotations-in-artifactsyaml), it can be discovered with
`gto describe`:

```cli
$ gto describe churn --rev [email protected] --path
models/churn.pkl
```

## Acting in CI/CD

Once Git tags are pushed, you can start acting in systems downstream. A popular
options is to use CI/CD (triggered when Git tags are pushed). For general
details, check out something like
A popular deployment option is to use CI/CD (triggered when Git tags are
pushed). For general details, check out something like
[GitHub Actions](https://github.com/features/actions),
[GitLab CI/CD](https://docs.gitlab.com/ee/ci/) or
[Circle CI](https://circleci.com).
Expand Down Expand Up @@ -79,7 +122,7 @@ Alternatively, you can use environment variables (note the `GTO_` prefix)
$ GTO_EMOJIS=false gto show
```

## Git tag message format
## Git tags format
Comment on lines -82 to +127
Copy link
Contributor

@jorgeorpinel jorgeorpinel Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's really the tag's annotation message that has a format right? Not all tags are annotated in Git (in fact it's rare to use annotated tags, I think) so this could be confusing/misleading wording.

Copy link
Contributor Author

@aguschin aguschin Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's Git tag name. E.g. to register a version, you need a Git tag called [email protected], annotated with some message. It could be annotated with any text, but without a text annotation it would be a lightweight tag, not an annotated one (and GTO ignores lightweight tags). For example, see https://github.com/iterative/example-gto/tags

image

[email protected] is a tag name and Registering artifact churn version v3.1.1 is a Git tag message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be annotated with any text, but without a text annotation it would be a lightweight tag...

I see. May I ask why does GTO ignore lightweight tags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we had a discussion about it here iterative/gto#127


<admon type="tip">

Expand Down
Loading