Skip to content

Commit

Permalink
docs: rework README (#617)
Browse files Browse the repository at this point in the history
* start on reworking README

* more README reworking

* BitBucket => Bitbucket

* ghcr.io on GHA

* consistent backend naming

* consistent backend ordering

* review comments

* example PAT requirements
  • Loading branch information
casperdcl authored Jul 4, 2021
1 parent 62c05fc commit 9dc794e
Show file tree
Hide file tree
Showing 4 changed files with 125 additions and 97 deletions.
188 changes: 108 additions & 80 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,28 +5,32 @@
[![GHA](https://img.shields.io/github/v/tag/iterative/setup-cml?label=GitHub%20Actions&logo=GitHub)](https://github.com/iterative/setup-cml)
[![npm](https://img.shields.io/npm/v/@dvcorg/cml?logo=npm)](https://www.npmjs.com/package/@dvcorg/cml)

**What is CML?** Continuous Machine Learning (CML) is an open-source library for
implementing continuous integration & delivery (CI/CD) in machine learning
projects. Use it to automate parts of your development workflow, including model
training and evaluation, comparing ML experiments across your project history,
and monitoring changing datasets.
**What is CML?** Continuous Machine Learning (CML) is an open-source CLI tool
for implementing continuous integration & delivery (CI/CD) with a focus on
MLOps. Use it to automate development workflows — including machine
provisioning, model training and evaluation, comparing ML experiments across
project history, and monitoring changing datasets.

![](https://static.iterative.ai/img/cml/github_cloud_case_lessshadow.png) _On
every pull request, CML helps you automatically train and evaluate models, then
generates a visual report with results and metrics. Above, an example report for
a [neural style transfer model](https://github.com/iterative/cml_cloud_case)._
CML can help train and evaluate models — and then generate a visual report with
results and metrics — automatically on every pull request.

We built CML with these principles in mind:
![](https://static.iterative.ai/img/cml/github_cloud_case_lessshadow.png) _An
example report for a
[neural style transfer model](https://github.com/iterative/cml_cloud_case)._

CML principles:

- **[GitFlow](https://nvie.com/posts/a-successful-git-branching-model/) for data
science.** Use GitLab or GitHub to manage ML experiments, track who trained ML
models or modified data and when. Codify data and models with
[DVC](#using-cml-with-dvc) instead of pushing to a Git repo.
- **Auto reports for ML experiments.** Auto-generate reports with metrics and
plots in each Git Pull Request. Rigorous engineering practices help your team
plots in each Git pull request. Rigorous engineering practices help your team
make informed, data-driven decisions.
- **No additional services.** Build your own ML platform using just GitHub or
GitLab and your favourite cloud services: AWS, Azure, GCP. No databases,
- **No additional services.** Build your own ML platform using GitLab,
Bitbucket, or GitHub. Optionally, use
[cloud storage](#configuring-cloud-storage-providers) as well as either
self-hosted or cloud runners (such as AWS EC2, Azure, or GCP). No databases,
services or complex setup needed.

:question: Need help? Just want to chat about continuous integration for ML?
Expand All @@ -36,29 +40,40 @@ We built CML with these principles in mind:
[YouTube video series](https://www.youtube.com/playlist?list=PL7WG7YrwYcnDBDuCkFbcyjnZQrdskFsBz)
for hands-on MLOps tutorials using CML!

## Table of contents
## Table of Contents

1. [Usage](#usage)
2. [Getting started (tutorial)](#getting-started)
3. [Using CML with DVC](#using-cml-with-dvc)
4. [Using self-hosted runners](#using-self-hosted-runners)
5. [Install CML as a package](#install-cml-as-a-package)
6. [Example Projects](#see-also)
1. [Setup (GitLab, GitHub, Bitbucket)](#setup)
2. [Usage](#usage)
3. [Getting started (tutorial)](#getting-started)
4. [Using CML with DVC](#using-cml-with-dvc)
5. [Advanced Setup (Self-hosted, local package)](#advanced-setup)
6. [Example projects](#see-also)

## Usage
## Setup

You'll need a GitHub or GitLab account to begin. Users may wish to familiarize
themselves with [Github Actions](https://help.github.com/en/actions) or
You'll need a GitLab, GitHub, or Bitbucket account to begin. Users may wish to
familiarize themselves with [Github Actions](https://help.github.com/en/actions)
or
[GitLab CI/CD](https://about.gitlab.com/stages-devops-lifecycle/continuous-integration).
Here, will discuss the GitHub use case.

- **GitLab users**: Please see our
[docs about configuring CML with GitLab](https://github.com/iterative/cml/wiki/CML-with-GitLab).
- **Bitbucket Cloud users**: Please see our
[docs on CML with Bitbucket Cloud](https://github.com/iterative/cml/wiki/CML-with-Bitbucket-Cloud).
_Bitbucket Server support estimated to arrive by May 2021._
- **GitHub Actions users**: The key file in any CML project is
`.github/workflows/cml.yaml`:
### GitLab

Please see our docs on
[CML with GitLab CI/CD](https://github.com/iterative/cml/wiki/CML-with-GitLab)
and in particular the
[personal access token](https://github.com/iterative/cml/wiki/CML-with-GitLab#variables)
requirement.

### Bitbucket

Please see our docs on
[CML with Bitbucket Cloud](https://github.com/iterative/cml/wiki/CML-with-Bitbucket-Cloud).
_Bitbucket Server support estimated to arrive by mid 2021._

### GitHub

The key file in any CML project is `.github/workflows/cml.yaml`:

```yaml
name: your-workflow-name
Expand All @@ -68,6 +83,7 @@ jobs:
runs-on: [ubuntu-latest]
# optionally use a convenient Ubuntu LTS + CUDA + DVC + CML image
# container: docker://dvcorg/cml:0-dvc2-base1-gpu
# container: docker://ghcr.io/iterative/cml:0-dvc2-base1-gpu
steps:
- uses: actions/checkout@v2
# may need to setup NodeJS & Python3 on e.g. self-hosted
Expand All @@ -92,38 +108,42 @@ jobs:
cml-send-comment report.md
```
## Usage
We helpfully provide CML and other useful libraries pre-installed on our
[custom Docker images](https://github.com/iterative/cml/blob/master/Dockerfile).
In the above example, uncommenting the field
`container: docker://dvcorg/cml:0-dvc2-base1-gpu` will make the GitHub Actions
`container: docker://dvcorg/cml:0-dvc2-base1-gpu` (or
`container: docker://ghcr.io/iterative/cml:0-dvc2-base1-gpu`) will make the
runner pull the CML Docker image. The image already has NodeJS, Python 3, DVC
and CML set up on an Ubuntu LTS base with CUDA libraries and
[Terraform](https://www.terraform.io) installed for convenience.

### CML Functions

CML provides a number of helper functions to help package the outputs of ML
workflows (including numeric data and visualizations about model performance)
into a CML report.
CML provides a number of functions to help package the outputs of ML workflows
(including numeric data and visualizations about model performance) into a CML
report.

Below is a table of CML functions for writing markdown reports and delivering
those reports to your CI system (GitHub Actions or GitLab CI).
those reports to your CI system.

| Function | Description | Inputs |
| ----------------------- | -------------------------------------------------------------- | ----------------------------------------------------------- |
| `cml-runner` | Starts a runner locally or in cloud providers | See [Arguments](https://github.com/iterative/cml#arguments) |
| `cml-publish` | Publish an image for writing to CML report. | `<path to image> --title <image title> --md` |
| `cml-send-comment` | Return CML report as a comment in your GitHub/GitLab workflow. | `<path to report> --head-sha <sha>` |
| `cml-send-github-check` | Return CML report as a check in GitHub | `<path to report> --head-sha <sha>` |
| `cml-pr` | Create a pull request. | TODO |
| `cml-tensorboard-dev` | Return a link to a Tensorboard.dev page | `--logdir <path to logs> --title <experiment title> --md` |
| Function | Description | Example Inputs |
| ----------------------- | ---------------------------------------------------------------- | ----------------------------------------------------------- |
| `cml-runner` | Launch a runner locally or hosted by a cloud provider | See [Arguments](https://github.com/iterative/cml#arguments) |
| `cml-publish` | Publicly host an image for displaying in a CML report | `<path to image> --title <image title> --md` |
| `cml-send-comment` | Return CML report as a comment in your GitLab/GitHub workflow | `<path to report> --head-sha <sha>` |
| `cml-send-github-check` | Return CML report as a check in GitHub | `<path to report> --head-sha <sha>` |
| `cml-pr` | Commit the given files to a new branch and create a pull request | `<path>...` |
| `cml-tensorboard-dev` | Return a link to a Tensorboard.dev page | `--logdir <path to logs> --title <experiment title> --md` |

### Customizing your CML report
#### CML Reports

CML reports are written in
[GitHub Flavored Markdown](https://github.github.com/gfm/). That means they can
contain images, tables, formatted text, HTML blocks, code snippets and more —
really, what you put in a CML report is up to you. Some examples:
The `cml-send-comment` command can be used to post reports. CML reports are
written in [GitHub Flavored Markdown](https://github.github.com/gfm/). That
means they can contain images, tables, formatted text, HTML blocks, code
snippets and more — really, what you put in a CML report is up to you. Some
examples:

:spiral_notepad: **Text** Write to your report using whatever method you prefer.
For example, copy the contents of a text file containing the results of ML model
Expand All @@ -142,7 +162,7 @@ report. For example, if `graph.png` is output by `python train.py`, run:
cml-publish graph.png --md >> report.md
```

## Getting Started
### Getting Started

1. Fork our
[example project repository](https://github.com/iterative/example_cml).
Expand Down Expand Up @@ -196,13 +216,13 @@ git add . && git commit -m "modify forest depth"
git push origin experiment
```

5. In GitHub, open up a Pull Request to compare the `experiment` branch to
5. In GitHub, open up a pull request to compare the `experiment` branch to
`master`.

![](https://static.iterative.ai/img/cml/make_pr.png)

Shortly, you should see a comment from `github-actions` appear in the Pull
Request with your CML report. This is a result of the `cml-send-comment`
Shortly, you should see a comment from `github-actions` appear in the pull
request with your CML report. This is a result of the `cml-send-comment`
function in your workflow.

![](https://static.iterative.ai/img/cml/first_report.png)
Expand All @@ -218,7 +238,7 @@ performance metrics and visualizations — in GitHub checks and comments. What
kind of workflow you want to run, and want to put in your CML report, is up to
you.

## Using CML with DVC
### Using CML with DVC

In many ML projects, data isn't stored in a Git repository, but needs to be
downloaded from external sources. [DVC](https://dvc.org) is a common way to
Expand All @@ -235,7 +255,7 @@ on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
container: docker://dvcorg/cml:0-dvc2-base1
container: docker://ghcr.io/iterative/cml:0-dvc2-base1
steps:
- uses: actions/checkout@v2
- name: Train model
Expand Down Expand Up @@ -273,7 +293,11 @@ jobs:
> :warning: If you're using DVC with cloud storage, take note of environment
> variables for your storage format.

### Environment variables for supported cloud providers
#### Configuring Cloud Storage Providers

There are many
[supported could storage providers](https://dvc.org/doc/command-reference/remote/modify#available-parameters-per-storage-type).
Here are a few examples for some of the most frequently used providers:

<details>
<summary>
Expand Down Expand Up @@ -356,7 +380,9 @@ env:

</details>

## Using self-hosted runners
## Advanced Setup

### Self-hosted Runners

GitHub Actions are run on GitHub-hosted runners by default. However, there are
many great reasons to use your own runners: to take advantage of GPUs; to
Expand All @@ -367,7 +393,7 @@ data.
> [official GitHub documentation](https://help.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners)
> to get started setting up your own self-hosted runner.

### Allocating cloud resources with CML
#### Allocating Cloud Compute Resources with CML

When a workflow requires computational resources (such as GPUs), CML can
automatically allocate cloud instances using `cml-runner`. You can spin up
Expand Down Expand Up @@ -400,8 +426,8 @@ jobs:
cml-runner \
--cloud aws \
--cloud-region us-west \
--cloud-type=t2.micro \
--labels=cml-runner
--cloud-type t2.micro \
--labels cml-runner
model-training:
needs: [deploy-runner]
runs-on: [self-hosted, cml-runner]
Expand All @@ -424,10 +450,12 @@ instance in the `us-west` region. The `model-training` step then runs on the
newly-launched instance.

> :tada: **Note that you can use any container with this workflow!** While you
> must [have CML and its dependencies set up](#install-cml-as-a-package) to use
> functions such `cml-send-comment` from your instance, you can create your
> favourite training environment in the cloud by pulling the Docker container of
> your choice.
> must [have CML and its dependencies set up](#local-package) to use functions
> such `cml-send-comment` from your instance, you can create your favourite
> training environment in the cloud by pulling the Docker container of your
> choice.

#### Docker Images

We like the CML container (`docker://dvcorg/cml`) because it comes loaded with
Python, CUDA, `git`, `node` and other essentials for full-stack data science.
Expand All @@ -442,7 +470,7 @@ image tags. The tag convention is `{CML_VER}-dvc{DVC_VER}-base{BASE_VER}{-gpu}`:
For example, `docker://dvcorg/cml:0-dvc2-base1-gpu`, or
`docker://ghcr.io/iterative/cml:0-dvc2-base1`.

### Arguments
#### Arguments

The `cml-runner` function accepts the following arguments:

Expand Down Expand Up @@ -497,10 +525,10 @@ Options:
-h Show help [boolean]
```

### Environment variables
#### Environment Variables

> :warning: You will need to
> [create a personal access token](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line)
> [create a personal access token (PAT)](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line)
> with repository read/write access and workflow privileges. In the example
> workflow, this token is stored as `PERSONAL_ACCESS_TOKEN`.

Expand All @@ -509,26 +537,25 @@ compute resources as secrets. In the above example, `AWS_ACCESS_KEY_ID` and
`AWS_SECRET_ACCESS_KEY` are required to deploy EC2 instances.

Please see our docs about
[environment variables needed to authenticate with supported cloud services](#environment-variables-for-supported-cloud-providers).
[configuring cloud storage providers](#configuring-cloud-storage-providers).

### On-premise (local) runners
#### On-premise (Local) Runners

This means using on-premise machines as self-hosted runners. The `cml-runner`
function is used to set up a local self-hosted runner. On your local machine or
on-premise GPU cluster, [install CML as a package](#install-cml-as-a-package)
and then run:
on-premise GPU cluster, [install CML as a package](#local-package) and then run:

```bash
cml-runner \
--repo $your_project_repository_url \
--token=$PERSONAL_ACCESS_TOKEN \
--token $PERSONAL_ACCESS_TOKEN \
--labels tf \
--idle-timeout 180
```

Now your machine will be listening for workflows from your project repository.

## Install CML as a package
### Local Package

In the examples above, CML is installed by the `setup-cml` action, or comes
pre-installed in a custom Docker image pulled by a CI runner. You can also
Expand All @@ -550,21 +577,19 @@ npm install -g vega-cli vega-lite
CML and Vega-Lite package installation require the NodeJS package manager
(`npm`) which ships with NodeJS. Installation instructions are below.

### Install NodeJS in GitHub
#### Install NodeJS

This is probably not necessary when using GitHub's default containers or one of
CML's Docker containers. Self-hosted runners may need to use a set up action to
install NodeJS:
- **GitHub**: This is probably not necessary when using GitHub's default
containers or one of CML's Docker containers. Self-hosted runners may need to
use a set up action to install NodeJS:

```bash
uses: actions/setup-node@v2
with:
node-version: '12'
```

### Install NodeJS in GitLab

GitLab requires direct installation of NodeJS:
- **GitLab**: Requires direct installation.

```bash
curl -sL https://deb.nodesource.com/setup_12.x | bash
Expand All @@ -580,4 +605,7 @@ These are some example projects using CML.
- [CML with DVC to pull data](https://github.com/iterative/cml_dvc_case)
- [CML with Tensorboard](https://github.com/iterative/cml_tensorboard_case)
- [CML with a small EC2 instance](https://github.com/iterative/cml-runner-base-case)
- [CML with EC2 GPU](https://github.com/iterative/cml_cloud_case)
:key:
- [CML with EC2 GPU](https://github.com/iterative/cml_cloud_case) :key:

:key: needs a [PAT](#environment-variables).
4 changes: 2 additions & 2 deletions src/cml.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ const git = require('simple-git/promise')('./');

const Gitlab = require('./drivers/gitlab');
const Github = require('./drivers/github');
const BitBucketCloud = require('./drivers/bitbucket_cloud');
const BitbucketCloud = require('./drivers/bitbucket_cloud');
const { upload, exec, watermarkUri } = require('./utils');

const {
Expand Down Expand Up @@ -65,7 +65,7 @@ const getDriver = (opts) => {

if (driver === GITHUB) return new Github({ repo, token });
if (driver === GITLAB) return new Gitlab({ repo, token });
if (driver === BB) return new BitBucketCloud({ repo, token });
if (driver === BB) return new BitbucketCloud({ repo, token });

throw new Error(`driver ${driver} unknown!`);
};
Expand Down
Loading

0 comments on commit 9dc794e

Please sign in to comment.