-
Notifications
You must be signed in to change notification settings - Fork 54
/
github-actions.md
179 lines (123 loc) · 8.05 KB
/
github-actions.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
layout: default
nav_order: 3
parent: Resources
grand_parent: Maintainer Docs
title: GitHub Actions
---
# {{ page.title }}
{:.no_toc}
## Overview
{:.no_toc}
The RAPIDS team is in the process of migrating from Jenkins to GitHub Actions for CI/CD. The page below outlines some helpful information pertaining to the implementation of GitHub Actions provided by the RAPIDS Ops team. The official GitHub documentation for GitHub Actions is also useful and can be viewed [here](https://docs.github.com/en/actions).
### Intended audience
{: .no_toc }
Operations
{: .label .label-purple}
Developers
{: .label .label-green}
## Table of contents
{: .no_toc .text-delta }
1. TOC
{:toc}
## Implementation
The RAPIDS Ops team provides GPU enabled self-hosted runners for use with GitHub Actions to the RAPIDS and other select GitHub organizations.
To ensure proper usage of these GPU enabled CI machines, the RAPIDS Ops team has adopted a strategy known as _Marking code as trusted by pushing upstream_ which is described in [this CircleCI blog post](https://circleci.com/blog/triggering-trusted-ci-jobs-on-untrusted-forks/).
The gist of the strategy is that the source code from trusted pull requests can be copied to a prefixed branch (e.g. `pull-request/<PR_NUMBER>`) within the source repository and CI can be configured to test only those prefixed branches rather than the pull requests themselves.
Pull requests authored by members of the given GitHub organization are considered trusted and therefore are copied to a `pull-request/*` branch for testing automatically.
Pull requests from authors outside of the GitHub organization must first be reviewed by a repository member with `write` permissions (or greater) to ensure that the code changes are legitimate and benign. That reviewer must leave an `/ok to test` (or `/okay to test`) comment on the pull request before its code is copied to a `pull-request/*` branch for testing.
The `/ok to test` comment is only valid for a single commit. Subsequent commits must be re-reviewed and validated with another `/ok to test` comment.
### Ignoring Pull Request Branches in `git`
One consequence of the strategy described above is that a lot of `pull-request/*` branches will be created and deleted in GitHub as pull requests are opened and closed. To avoid having these branches fetched locally, you can run the following `git config` command, where `upstream` in `remote.upstream.fetch` is the `git` remote name corresponding to the source repository:
```sh
git config \
--global \
--add "remote.upstream.fetch" \
'^refs/heads/pull-request/*'
```
Note that this `git` configuration option requires `git` version `2.29` or greater to support negative refspecs ([source](https://github.blog/2020-10-19-git-2-29-released/#user-content-negative-refspecs)).
### Downloading CI Artifacts
For NVIDIA employees with VPN access, artifacts from both pull-requests and branch builds can be accessed on [https://downloads.rapids.ai/](https://downloads.rapids.ai/).
There is a link provided at the end of every C++ and Python build job where the build artifacts for that particular workflow run can be accessed.
![](/assets/images/downloads.png)
### Skipping CI for Commits
See the GitHub Actions documentation page below on how to prevent GitHub Actions from running on certain commits. This is useful for preventing GitHub Actions from running on pull requests that are not fully complete. This also helps preserve the finite GPU resources provided by the RAPIDS Ops team.
With GitHub Actions, it is not possible to configure all commits for a pull request to be skipped. It must be specified at the commit level.
**Link**: [https://docs.github.com/en/actions/managing-workflow-runs/skipping-workflow-runs](https://docs.github.com/en/actions/managing-workflow-runs/skipping-workflow-runs)
### Rerunning Failed GitHub Actions
See the GitHub Actions documentation page below on how to rerun failed workflows. In addition to rerunning an entire workflow, GitHub Actions also provides the ability to rerun only the failed jobs in a workflow.
At this time there are no alternative ways to rerun tests with GitHub Actions beyond what is described in the documentation (e.g. there is no `rerun tests` comment for GitHub Actions).
**Link**: [https://docs.github.com/en/actions/managing-workflow-runs/re-running-workflows-and-jobs](https://docs.github.com/en/actions/managing-workflow-runs/re-running-workflows-and-jobs)
## Self-Hosted Runners
The RAPIDS Ops team provides a set of self-hosted runners that can be used in GitHub Action workflows throughout supported organizations. The tables below outline the labels that can be utilized and their related specifications.
### CPU Labels
The CPU labeled runners are backed by various EC2 instances and do not have any GPUs installed.
| Label | EC2 Machine Type |
| ------------------- | --------------------------- |
| `linux-amd64-cpu4` | `m5d.xlarge` <sub>1</sub> |
| `linux-amd64-cpu8` | `m5d.2xlarge` <sub>1</sub> |
| `linux-amd64-cpu16` | `m5d.4xlarge` <sub>1</sub> |
| `linux-arm64-cpu4` | `m6gd.xlarge` <sub>2</sub> |
| `linux-arm64-cpu8` | `m6gd.2xlarge` <sub>2</sub> |
| `linux-arm64-cpu16` | `m6gd.4xlarge` <sub>2</sub> |
Additional specifications:
1. [https://aws.amazon.com/ec2/instance-types/m5/](https://aws.amazon.com/ec2/instance-types/m5/)
2. [https://aws.amazon.com/ec2/instance-types/m6g/](https://aws.amazon.com/ec2/instance-types/m6g/)
The CPU label names consist of the following components:
```text
linux-amd64-cpu4
^ ^ ^ ^
| | | |
| | | CPU Core Count
| | CPU Designator
| Architecture
Operating System
```
### GPU Labels
{% assign earliest_driver_version = "470" %}
{% assign latest_driver_version = "525" %}
The GPU labeled runners are backed by lab machines and have the GPUs specified in the table below installed.
**IMPORTANT**: GPU jobs have two requirements. If these requirements aren't met, the GitHub Actions job will fail. See the _Usage_ section below for an example.
1. They must run in a container (i.e. `nvidia/cuda:11.8.0-base-ubuntu22.04`)
2. They must set the {% raw %}`NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}`{% endraw %} container environment variable
Due to our limited GPU capacity and the overhead associated with manually rotating self-hosted runner labels when GPU drivers are updated, there are no driver-specific self-hosted runner labels (e.g. `linux-amd64-gpu-t4-525-1`).
Instead, the driver-version designators `earliest` and `latest` are used. The values of these designators represent the GPU driver version that RAPIDS uses for testing.
The chart below will be kept up-to-date with the corresponding driver versions at any given time.
Supported organizations will be notified whenever these versions are scheduled to be updated.
{% include gpu-labels-table.html %}
The GPU label names consist of the following components:
```text
linux-amd64-gpu-t4-latest-1
^ ^ ^ ^ ^ ^
| | | | | |
| | | | | Number of GPUs Available
| | | | GPU Driver Version
| | | GPU Type
| | GPU Designator
| Architecture
Operating System
```
### Usage
The code snippet below shows how the labels above may be utilized in a GitHub Action workflow.
```yaml
name: Test Self Hosted Runners
on: push
jobs:
job1_cpu:
runs-on: linux-amd64-cpu8
steps:
- name: hello
run: echo "hello"
job2_gpu:
runs-on: linux-amd64-gpu-v100-latest-1
container: # GPU jobs must run in a container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
env:
NVIDIA_VISIBLE_DEVICES: {% raw %}${{ env.NVIDIA_VISIBLE_DEVICES }}{% endraw %} # GPU jobs must set this container env variable
steps:
- name: hello
run: |
echo "hello"
nvidia-smi
```
For additional details on self-hosted runner usage, see the official GitHub Action documentation page here: [https://docs.github.com/en/actions/hosting-your-own-runners/using-self-hosted-runners-in-a-workflow](https://docs.github.com/en/actions/hosting-your-own-runners/using-self-hosted-runners-in-a-workflow)