Skip to content

Watches GPU usage via nvidia-smi and ships metrics to Datadog

License

Notifications You must be signed in to change notification settings

josephhaaga/gpu-watchdog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🐕 gpu-watchdog

Monitor Nvidia GPU utilization and ship metrics to Datadog.

Example Datadog Dashboard

installation

  1. Import the dashboard into Datadog
  2. Update your DD_API_KEY in deploy.yaml
  3. kubectl apply -f deploy.yaml

Alternative install method (Helm Chart) coming soon!

usage

python3 gpu_watchdog.py or docker run -it gpu-watchdog:latest

Note that this application requires

configuration

The following configuration options are available as env vars:

  • LOG_LEVEL: change the logging verbosity; defaults to INFO
  • DD_SITE: Datadog site to use for metrics submission; defaults to datadoghq.com, but can be set to datadoghq.eu
  • DD_API_KEY: API Key for Datadog
  • SECONDS_BETWEEN_SAMPLES: how often to query nvidia-smi; defaults to "5"

how it works

We query nvidia-smi --query-compute-apps=pid,used_memory --format=csv, and use the returned PIDs to extract the container ID from the host's /proc/<PID>/cgroup file(s). (inspired by pid2pod)

It works by looking up the target process's cgroup metadata in /proc/$PID/cgroup. This metadata contains the names of each cgroup assigned to the process. In the case of Docker containers created using the docker CLI or created by the kubelet, these cgroup names contain the Docker container ID. We can map this container ID to a Kubernetes pod by doing a lookup against the local kubelet API.

We enrich the used_memory stat with metadata from the associated Pod, and publish to Datadog as kubernetes.gpu.usage.

About

Watches GPU usage via nvidia-smi and ships metrics to Datadog

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published