diff --git a/content/docs/gto/dvc.md b/content/docs/gto/dvc.md deleted file mode 100644 index 315a042b..00000000 --- a/content/docs/gto/dvc.md +++ /dev/null @@ -1,5 +0,0 @@ -## Working with large artifacts - -...Git-lfs, DVC and what else is there... - -...(I'll remove this page before merging this PR)... diff --git a/content/docs/gto/get-started.md b/content/docs/gto/get-started.md index 1a5b2321..6a7d2768 100644 --- a/content/docs/gto/get-started.md +++ b/content/docs/gto/get-started.md @@ -89,7 +89,7 @@ during registrations and promotions. The benefit of these Git-native mechanism is that you can act upon GTO operations in any Git-based system downstream, for example automating model deployments with CI/CD. -[tags]: /doc/gto/user-guide#git-tags-message-format +[tags]: /doc/gto/user-guide#git-tags-format
@@ -160,9 +160,9 @@ page. Thanks for completing this Get Started! - Learn how to - [specify important artifact's metainformation](/doc/gto/user-guide#annotations-in-artifactsyaml) - like `path`, `type` and `description` in the Artifact registry. -- Learn more about [acting in CI/CD](/doc/gto/user-guide#acting-in-ci-cd) upon + [get your artifacts](/doc/gto/user-guide#getting-artifacts-downstream) when + you need them (e.g. get the latest version or the version in specific stage). +- Learn more about [acting in CI/CD](/doc/gto/user-guide#acting-in-cicd) upon version registrations and stage assignments. - Reach us out in [GH issues](https://github.com/iterative/gto/issues) or in [Discord](https://discord.com/invite/dvwXA2N) to get your questions answered! diff --git a/content/docs/gto/mlem.md b/content/docs/gto/mlem.md deleted file mode 100644 index be396e3a..00000000 --- a/content/docs/gto/mlem.md +++ /dev/null @@ -1,5 +0,0 @@ -# Deploy models with MLEM - -...tell about MLEM and show `example-gto` repo `mlem` branch... - -...(I'll remove this page before merging this PR)... diff --git a/content/docs/gto/user-guide/dvc.md b/content/docs/gto/user-guide/dvc.md new file mode 100644 index 00000000..7e8af6e3 --- /dev/null +++ b/content/docs/gto/user-guide/dvc.md @@ -0,0 +1,109 @@ +# GTO with DVC + +Large files are typically not stored in a Git repository, so they need to be +downloaded from external sources. [DVC](https://dvc.org) is a great way to store +your GTO artifact files while keeping a pointer in the repo, and simplifying +[data management] and synchronization. + +[data management]: https://dvc.org/doc/user-guide/data-management + + + +If you're new to DVC, [get started here](https://dvc.org/doc/start) first. + + + +## Tracking an artifact with DVC + +First, we need to start tracking artifact with DVC. If you produce this artifact +in DVC Pipelines, it's done automatically. + +If the artifact is located inside your Git repo, you can use `dvc add`: + +```cli +$ dvc add model.pkl +$ git add model.pkl.dvc +``` + +If the artifact is located in some external storage, we can use `dvc import-url` +to still keep metainformation about it in the repo (use `--no-download` to skip +downloading it): + +```cli +$ dvc import-url --no-download s3://container/model.pkl +$ git add model.pkl.dvc +``` + +## Annotating DVC-tracked artifacts + +Once the artifact is tracked with DVC within your repo, we can annotate it with +GTO: + +```cli +$ gto annotate model --path model.pkl +``` + +This will write the following to `artifacts.yaml`: + +```yaml +model: + path: model.pkl +``` + +Commit the changes to Git in order to `gto register` artifact versions and +`gto assign` them to deployment stages referencing the new commit. + +```cli +$ git add artifacts.yaml +$ git commit -m "version artifact binaries with DVC and annotate it with GTO" +``` + +To share your work, you'll need [remote storage] setup in DVC. You can then +upload the artifact files and the changes to the repo: + +[remote storage]: https://dvc.org/doc/command-reference/remote + +```cli +$ dvc push +$ git push +``` + +## Downloading artifacts + +To download GTO artifact files tracked with DVC, you can use the `dvc get` or +`dvc import` commands (or simply use `dvc pull` if you `cd` inside the repo). + +```cli +$ dvc get $REPO $ARTIFACT_PATH --rev $REVISION -o $OUTPUT_PATH +``` + +Check out [User Guide](/doc/gto/user-guide#getting-artifacts-downstream) to +learn how to find out `ARTIFACT_PATH` and `REVISION`. + +
+ +### Example: downloading from outside the repo + +If you need to download the latest version of `model`, that would be: + +```cli +$ REVISION=$(gto show --repo $REPO model@greatest --ref) +$ ARTIFACT_PATH=$(gto describe --repo $REPO $ARTIFACT --rev $REVISION --path) +$ dvc get $REPO $ARTIFACT --rev $REVISION -o $ARTIFACT_PATH +``` + +
+ +
+ +### Example: downloading in CI + +If you need to download an artifact from the same repo, that would be a bit +simpler (taking GH Actions as an example): + +```cli +$ ARTIFACT_PATH=$(gto describe model --rev $GITHUB_REF --path) +$ dvc pull $ARTIFACT_PATH +``` + +
diff --git a/content/docs/gto/user-guide.md b/content/docs/gto/user-guide/index.md similarity index 67% rename from content/docs/gto/user-guide.md rename to content/docs/gto/user-guide/index.md index a785c4c0..050e0ae6 100644 --- a/content/docs/gto/user-guide.md +++ b/content/docs/gto/user-guide/index.md @@ -2,11 +2,15 @@ GTO helps you build an Artifact Registry out of your Git repository. It creates annotated [Git tags](https://git-scm.com/book/en/v2/Git-Basics-Tagging) with a -[special format](#git-tag-message-format) in their message, and manages an -`artifacts.yaml` file. Since committing large files to Git is not a good -practice, you should consider not committing your artifacts to Git. Instead, use -[DVC](https://dvc.org/doc), Git-lfs, or any method commit pointers to the binary -files instead. +[special format](#git-tags-format), and manages an `artifacts.yaml` file. + + + +Storing large files in Git repos is not a good practice. Avoid committing your +ML artifacts to Git. You can [use DVC](/doc/gto/user-guide/dvc), Git LFS, or any +other method to commit pointers to the data, models, etc. instead. + + ## Annotations in artifacts.yaml @@ -17,19 +21,60 @@ or `dataset`), or any other useful information about them. For simple projects artifacts (e.g. [in CI/CD](#acting-in-ci-cd)). But for more advanced cases, we should codify them in the registry itself. -To keep this metainformation, GTO uses `artifacts.yaml` file. Commands like -`gto annotate` and `gto remove` are used to modify it, while `gto describe` -helps get them when they're needed. +To keep this metadata, GTO uses a human-readable `artifacts.yaml` file. The +`gto describe`, `gto annotate`, and `gto remove` commands are used to display +and manage it's contents. + + + +An example `artifacts.yaml` can be found +[in the `example-gto` repo](https://github.com/iterative/example-gto). + + + +## Getting artifacts in systems downstream + +You may need to get a specific artifact version to a certain environment, most +likely the latest one or the one currently assigned to the stage. Use `gto show` +to find the [Git reference] (tag) you need (note that +[CI platforms](#acting-in-ci-cd) may expose it for you, e.g. the `GITHUB_REF` +[env var in GitHub Actions]): + +[git reference]: https://git-scm.com/book/en/v2/Git-Internals-Git-References +[env var in github actions]: + https://docs.github.com/en/actions/learn-github-actions/environment-variables -If you would like to see an example of `artifacts.yaml`, check out the -[example-gto](https://github.com/iterative/example-gto/blob/main/artifacts.yaml) -repo. + + +GTO doesn't provide a way to deliver the artifacts, but you can [use DVC] or +[employ MLEM] for that. + +[use dvc]: /doc/gto/user-guide/dvc +[employ mlem]: https://mlem.ai + + + +```cli +$ gto show churn@latest --ref +churn@v3.1.1 + +$ gto show churn#prod --ref # by assigned stage +churn@v3.0.0 +``` + +You may need the artifact's file path. If +[annotated](#annotations-in-artifactsyaml), it can be discovered with +`gto describe`: + +```cli +$ gto describe churn --rev churn@v3.0.0 --path +models/churn.pkl +``` ## Acting in CI/CD -Once Git tags are pushed, you can start acting in systems downstream. A popular -options is to use CI/CD (triggered when Git tags are pushed). For general -details, check out something like +A popular deployment option is to use CI/CD (triggered when Git tags are +pushed). For general details, check out something like [GitHub Actions](https://github.com/features/actions), [GitLab CI/CD](https://docs.gitlab.com/ee/ci/) or [Circle CI](https://circleci.com). @@ -79,7 +124,7 @@ Alternatively, you can use environment variables (note the `GTO_` prefix) $ GTO_EMOJIS=false gto show ``` -## Git tag message format +## Git tags format diff --git a/content/docs/gto/user-guide/mlem.md b/content/docs/gto/user-guide/mlem.md new file mode 100644 index 00000000..de746a4e --- /dev/null +++ b/content/docs/gto/user-guide/mlem.md @@ -0,0 +1,172 @@ +# Deploy models with MLEM + +Creating model versions and assigning stages in Model Registry is usually done +to trigger some action downstream. To easily build Docker images with your +models for new model versions, or simply deploy them upon stage assignments, you +can use [MLEM](/doc). + +If you're new to MLEM, please head to [Get Started](/doc/get-started) to learn +MLEM basics. + +## Annotating MLEM models with GTO + +A model saved by MLEM typically consists of a model binary file (for example, +`nn.pkl`) and a metadata file (`nn.pkl.mlem`). + +To annotate a model with GTO: + +```cli +$ gto annotate model --path nn.pkl +``` + +This will modify `artifacts.yaml`, adding: + +```yaml +model: + path: nn.pkl +``` + +Now you should commit changes to Git, and you can register versions and assign +stages referencing the new commit. + +```cli +$ git add artifacts.yaml +$ git commit -m "annotate MLEM model with GTO" +$ git push +``` + +Now your changes is live in your Git repo and ready to be used 🙌 + +## Using GTO artifacts with MLEM + +When you want to use a GTO artifact with MLEM, you need to get the right +revision and path (see +[User Guide](/doc/gto/user-guide#getting-artifacts-downstream)). Since MLEM can +work with remote artifacts, just point to it in any MLEM command (taking +`mlem build` as an example): + +```cli +$ mlem build docker \ + --project $REPO \ + --model $ARTIFACT_PATH \ + --rev $REVISION \ + --image.name mlem-model +``` + +## Creating a CI/CD workflow + +Since Git tags can trigger a CI/CD workflow, now we need to add the workflow we +need to the repo. + + + + +This workflow will build a docker image out of the model and push it to a +DockerHub ([learn more](/doc/user-guide/building) about configuring build and +using other destinations). + +```yaml +# .github/workflows/build.yaml +on: + push: + tags: + - '*' + +jobs: + act: + name: Build a Docker image for new model versions + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - id: gto + uses: iterative/gto-action@v1 + - uses: actions/setup-python@v2 + - name: Install dependencies + run: | + pip install --upgrade pip setuptools wheel + pip install -r requirements.txt + - if: steps.gto.outputs.event == 'registration' + run: | + # TODO: check this works + # What credentials we need to specify to publish image somewhere? + mlem build docker \ + --model '${{ steps.gto.outputs.path }}' \ + --image.name ${{ steps.gto.outputs.name }} \ + --image.tag '${{ steps.gto.outputs.version }}' \ + --env.registry docker_io +``` + +Note that builder can be +[pre-configured](/doc/user-guide/building#pre-configured-builders) to specify +some options that should be fixed. + + + + +This workflow will deploy a model to Heroku upon stage assignment: + +```yaml +# .github/workflows/deploy.yaml +on: + push: + tags: + - '*' + +# specify credentials needed to run deployment and keep the deployment state +env: + HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }} + AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} + AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} + +jobs: + act: + name: Deploy a model upon stage assignment + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - id: gto + uses: iterative/gto-action@v1 + - uses: actions/setup-python@v2 + - name: Install dependencies + run: | + pip install --upgrade pip setuptools wheel + pip install -r requirements.txt + - if: steps.gto.outputs.event == 'assignment' + run: | + # TODO: check this works + mlem deployment run \ + --load deploy/${{ steps.gto.outputs.stage }} \ + --model ${{ steps.gto.outputs.path }} +``` + +This relies on having [deployment declarations](/doc/user-guide/deploying) in +the `deploy/` directory, such as: + +```yaml +# deploy/dev.yaml +object_type: deployment +type: heroku +app_name: mlem-dev +``` + +This declaration is read by MLEM in CI and the model promoted to `dev` is +deployed to https://mlem-dev.herokuapp.com. + +Note, that you need to provide environment variables to deploy to Heroku and +update the [deployment state](/doc/user-guide/deploying). The location for the +state should be +[configured](/doc/user-guide/deploying#setting-up-remote-state-manager) in MLEM +config file: + +```yaml +# .mlem.yaml +core: + state: + uri: s3://bucket/path +``` + +Check out [another example](https://github.com/iterative/example-gto/tree/mlem) +of MLEM model deployment in the `main` branch of the `example-gto` repo. + + + diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 50ca8c31..ca484e4b 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -617,7 +617,14 @@ { "slug": "user-guide", "label": "User Guide", - "source": "user-guide.md" + "source": "user-guide/index.md", + "children": [ + { + "slug": "dvc", + "label": "GTO with DVC", + "source": "dvc.md" + } + ] }, { "slug": "command-reference", diff --git a/content/docs/use-cases/model-registry/index.md b/content/docs/use-cases/model-registry/index.md index 4b267bcb..c7a11f23 100644 --- a/content/docs/use-cases/model-registry/index.md +++ b/content/docs/use-cases/model-registry/index.md @@ -2,15 +2,14 @@ A **model registry** is a tool to catalog ML models and their versions. Models from your data science projects can be discovered, tested, shared, deployed, and -audited from there. [DVC](https://github.com/iterative/dvc), [GTO], and [MLEM] +audited from there. [DVC](https://github.com/iterative/dvc), [GTO], and MLEM enable these capabilities on top of Git, so you can stick to en existing software engineering stack. No more divide between ML engineering and operations! -[gto]: https://github.com/iterative/gto -[mlem]: https://mlem.ai/ +[gto]: /doc/gto ML model registries give your team key capabilities: