Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some quick notes on how to get GPU opertor working #10067

Merged
merged 1 commit into from
Oct 18, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 78 additions & 2 deletions docs/gpu.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,81 @@
# GPU Support

You can use [kops hooks](./cluster_spec.md#hooks) to install [Nvidia kubernetes device plugin](https://github.com/NVIDIA/k8s-device-plugin) and enable GPU support in cluster.
You can use [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) to install NVIDIA device drivers and tools to your cluster.

See instructions in [kops hooks for nvidia-device-plugin](../hooks/nvidia-device-plugin).
## Creating a cluster with GPU nodes

Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the [getting started documentation](https://kops.sigs.k8s.io/getting_started/aws/).

Once the cluster is running, add an instance group with GPUs:

```yaml
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: <cluster name>
name: gpu-nodes
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
nodeLabels:
kops.k8s.io/instancegroup: gpu-nodes
machineType: g4dn.xlarge
maxSize: 1
minSize: 1
role: Node
subnets:
- eu-central-1c
taints:
- nvidia.com/gpu=present:NoSchedule
```

Note the taint used above. This will prevent pods from being scheduled on GPU nodes unless we explicitly want to. The GPU Operator resources tolerate this taint by default.
Also note the node label we set. This will be used to ensure the GPU Operator resources runs on GPU nodes.

## Install GPU Operator
GPU Operator is installed using `helm`. See the [general install instructions for GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator).

In order to match the _kops_ environment, create a `values.yaml` file with the following content:

```yaml
operator:
nodeSelector:
kops.k8s.io/instancegroup: gpu-nodes

driver:
nodeSelector:
kops.k8s.io/instancegroup: gpu-nodes

toolkit:
nodeSelector:
kops.k8s.io/instancegroup: gpu-nodes

devicePlugin:
nodeSelector:
kops.k8s.io/instancegroup: gpu-nodes

dcgmExporter:
nodeSelector:
kops.k8s.io/instancegroup: gpu-nodes

gfd:
nodeSelector:
kops.k8s.io/instancegroup: gpu-nodes

node-feature-discovery:
worker:
nodeSelector:
kops.k8s.io/instancegroup: gpu-nodes
```

Once you have installed the the _helm chart_ you should be able to see the GPU operator resources being spawned in the `gpu-operator-resources` namespace.

You should now be able to schedule other workloads on the GPU by adding the following properties to the pod spec:
```yaml
spec:
nodeSelector:
kops.k8s.io/instancegroup: gpu-nodes
tolerations:
- key: nvidia.com/gpu
operator: Exists
```