This repository has been archived by the owner on Oct 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* draft * updating text * add GTO+MLEM page * move pages under user-guide * updating dvc and mlem pages * small enhancements * fix sidebar * update 'what's next' section in GS * fix links * Apply suggestions from code review Co-authored-by: Jorge Orpinel <[email protected]> * fix lint * Apply suggestions from code review Co-authored-by: Jorge Orpinel <[email protected]> * lint * remove extra details * Apply suggestions from code review Co-authored-by: Jorge Orpinel <[email protected]> * Restyled by prettier-markdown (#253) Co-authored-by: Restyled.io <[email protected]> * hide mlem page Co-authored-by: Jorge Orpinel <[email protected]> Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com> Co-authored-by: Restyled.io <[email protected]>
- Loading branch information
1 parent
9c9185d
commit e306e13
Showing
8 changed files
with
355 additions
and
33 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# GTO with DVC | ||
|
||
Large files are typically not stored in a Git repository, so they need to be | ||
downloaded from external sources. [DVC](https://dvc.org) is a great way to store | ||
your GTO artifact files while keeping a pointer in the repo, and simplifying | ||
[data management] and synchronization. | ||
|
||
[data management]: https://dvc.org/doc/user-guide/data-management | ||
|
||
<admon icon="book"> | ||
|
||
If you're new to DVC, [get started here](https://dvc.org/doc/start) first. | ||
|
||
</admon> | ||
|
||
## Tracking an artifact with DVC | ||
|
||
First, we need to start tracking artifact with DVC. If you produce this artifact | ||
in DVC Pipelines, it's done automatically. | ||
|
||
If the artifact is located inside your Git repo, you can use `dvc add`: | ||
|
||
```cli | ||
$ dvc add model.pkl | ||
$ git add model.pkl.dvc | ||
``` | ||
|
||
If the artifact is located in some external storage, we can use `dvc import-url` | ||
to still keep metainformation about it in the repo (use `--no-download` to skip | ||
downloading it): | ||
|
||
```cli | ||
$ dvc import-url --no-download s3://container/model.pkl | ||
$ git add model.pkl.dvc | ||
``` | ||
|
||
## Annotating DVC-tracked artifacts | ||
|
||
Once the artifact is tracked with DVC within your repo, we can annotate it with | ||
GTO: | ||
|
||
```cli | ||
$ gto annotate model --path model.pkl | ||
``` | ||
|
||
This will write the following to `artifacts.yaml`: | ||
|
||
```yaml | ||
model: | ||
path: model.pkl | ||
``` | ||
Commit the changes to Git in order to `gto register` artifact versions and | ||
`gto assign` them to deployment stages referencing the new commit. | ||
|
||
```cli | ||
$ git add artifacts.yaml | ||
$ git commit -m "version artifact binaries with DVC and annotate it with GTO" | ||
``` | ||
|
||
To share your work, you'll need [remote storage] setup in DVC. You can then | ||
upload the artifact files and the changes to the repo: | ||
|
||
[remote storage]: https://dvc.org/doc/command-reference/remote | ||
|
||
```cli | ||
$ dvc push | ||
$ git push | ||
``` | ||
|
||
## Downloading artifacts | ||
|
||
To download GTO artifact files tracked with DVC, you can use the `dvc get` or | ||
`dvc import` commands (or simply use `dvc pull` if you `cd` inside the repo). | ||
|
||
```cli | ||
$ dvc get $REPO $ARTIFACT_PATH --rev $REVISION -o $OUTPUT_PATH | ||
``` | ||
|
||
Check out [User Guide](/doc/gto/user-guide#getting-artifacts-downstream) to | ||
learn how to find out `ARTIFACT_PATH` and `REVISION`. | ||
|
||
<details> | ||
|
||
### Example: downloading from outside the repo | ||
|
||
If you need to download the latest version of `model`, that would be: | ||
|
||
```cli | ||
$ REVISION=$(gto show --repo $REPO model@greatest --ref) | ||
$ ARTIFACT_PATH=$(gto describe --repo $REPO $ARTIFACT --rev $REVISION --path) | ||
$ dvc get $REPO $ARTIFACT --rev $REVISION -o $ARTIFACT_PATH | ||
``` | ||
|
||
</details> | ||
|
||
<details> | ||
|
||
### Example: downloading in CI | ||
|
||
If you need to download an artifact from the same repo, that would be a bit | ||
simpler (taking GH Actions as an example): | ||
|
||
```cli | ||
$ ARTIFACT_PATH=$(gto describe model --rev $GITHUB_REF --path) | ||
$ dvc pull $ARTIFACT_PATH | ||
``` | ||
|
||
</details> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,11 +2,15 @@ | |
|
||
GTO helps you build an Artifact Registry out of your Git repository. It creates | ||
annotated [Git tags](https://git-scm.com/book/en/v2/Git-Basics-Tagging) with a | ||
[special format](#git-tag-message-format) in their message, and manages an | ||
`artifacts.yaml` file. Since committing large files to Git is not a good | ||
practice, you should consider not committing your artifacts to Git. Instead, use | ||
[DVC](https://dvc.org/doc), Git-lfs, or any method commit pointers to the binary | ||
files instead. | ||
[special format](#git-tags-format), and manages an `artifacts.yaml` file. | ||
|
||
<admon type="tip"> | ||
|
||
Storing large files in Git repos is not a good practice. Avoid committing your | ||
ML artifacts to Git. You can [use DVC](/doc/gto/user-guide/dvc), Git LFS, or any | ||
other method to commit pointers to the data, models, etc. instead. | ||
|
||
</admon> | ||
|
||
## Annotations in artifacts.yaml | ||
|
||
|
@@ -17,19 +21,60 @@ or `dataset`), or any other useful information about them. For simple projects | |
artifacts (e.g. [in CI/CD](#acting-in-ci-cd)). But for more advanced cases, we | ||
should codify them in the registry itself. | ||
|
||
To keep this metainformation, GTO uses `artifacts.yaml` file. Commands like | ||
`gto annotate` and `gto remove` are used to modify it, while `gto describe` | ||
helps get them when they're needed. | ||
To keep this metadata, GTO uses a human-readable `artifacts.yaml` file. The | ||
`gto describe`, `gto annotate`, and `gto remove` commands are used to display | ||
and manage it's contents. | ||
|
||
<admon type="tip"> | ||
|
||
An example `artifacts.yaml` can be found | ||
[in the `example-gto` repo](https://github.com/iterative/example-gto). | ||
|
||
</admon> | ||
|
||
## Getting artifacts in systems downstream | ||
|
||
You may need to get a specific artifact version to a certain environment, most | ||
likely the latest one or the one currently assigned to the stage. Use `gto show` | ||
to find the [Git reference] (tag) you need (note that | ||
[CI platforms](#acting-in-ci-cd) may expose it for you, e.g. the `GITHUB_REF` | ||
[env var in GitHub Actions]): | ||
|
||
[git reference]: https://git-scm.com/book/en/v2/Git-Internals-Git-References | ||
[env var in github actions]: | ||
https://docs.github.com/en/actions/learn-github-actions/environment-variables | ||
|
||
If you would like to see an example of `artifacts.yaml`, check out the | ||
[example-gto](https://github.com/iterative/example-gto/blob/main/artifacts.yaml) | ||
repo. | ||
<admon type="tip"> | ||
|
||
GTO doesn't provide a way to deliver the artifacts, but you can [use DVC] or | ||
[employ MLEM] for that. | ||
|
||
[use dvc]: /doc/gto/user-guide/dvc | ||
[employ mlem]: https://mlem.ai | ||
|
||
</admon> | ||
|
||
```cli | ||
$ gto show churn@latest --ref | ||
[email protected] | ||
$ gto show churn#prod --ref # by assigned stage | ||
[email protected] | ||
``` | ||
|
||
You may need the artifact's file path. If | ||
[annotated](#annotations-in-artifactsyaml), it can be discovered with | ||
`gto describe`: | ||
|
||
```cli | ||
$ gto describe churn --rev [email protected] --path | ||
models/churn.pkl | ||
``` | ||
|
||
## Acting in CI/CD | ||
|
||
Once Git tags are pushed, you can start acting in systems downstream. A popular | ||
options is to use CI/CD (triggered when Git tags are pushed). For general | ||
details, check out something like | ||
A popular deployment option is to use CI/CD (triggered when Git tags are | ||
pushed). For general details, check out something like | ||
[GitHub Actions](https://github.com/features/actions), | ||
[GitLab CI/CD](https://docs.gitlab.com/ee/ci/) or | ||
[Circle CI](https://circleci.com). | ||
|
@@ -79,7 +124,7 @@ Alternatively, you can use environment variables (note the `GTO_` prefix) | |
$ GTO_EMOJIS=false gto show | ||
``` | ||
|
||
## Git tag message format | ||
## Git tags format | ||
|
||
<admon type="tip"> | ||
|
||
|
Oops, something went wrong.