Monitor Nvidia GPU utilization and ship metrics to Datadog.
- Import the dashboard into Datadog
- Update your
DD_API_KEY
in deploy.yaml kubectl apply -f deploy.yaml
Alternative install method (Helm Chart) coming soon!
python3 gpu_watchdog.py
or
docker run -it gpu-watchdog:latest
Note that this application requires
- access to the host's PID space (set via the
hostPID
flag) - permission to list pods in all namespaces
The following configuration options are available as env vars:
LOG_LEVEL
: change the logging verbosity; defaults toINFO
DD_SITE
: Datadog site to use for metrics submission; defaults todatadoghq.com
, but can be set todatadoghq.eu
DD_API_KEY
: API Key for DatadogSECONDS_BETWEEN_SAMPLES
: how often to querynvidia-smi
; defaults to"5"
We query nvidia-smi --query-compute-apps=pid,used_memory --format=csv
, and use the returned PIDs to extract the container ID from the host's /proc/<PID>/cgroup
file(s). (inspired by pid2pod)
It works by looking up the target process's cgroup metadata in /proc/$PID/cgroup. This metadata contains the names of each cgroup assigned to the process. In the case of Docker containers created using the docker CLI or created by the kubelet, these cgroup names contain the Docker container ID. We can map this container ID to a Kubernetes pod by doing a lookup against the local kubelet API.
We enrich the used_memory
stat with metadata from the associated Pod, and publish to Datadog as kubernetes.gpu.usage
.