🐕 gpu-watchdog

Monitor Nvidia GPU utilization and ship metrics to Datadog.

installation

Import the dashboard into Datadog
Update your DD_API_KEY in deploy.yaml
kubectl apply -f deploy.yaml

Alternative install method (Helm Chart) coming soon!

usage

python3 gpu_watchdog.py or docker run -it gpu-watchdog:latest

Note that this application requires

access to the host's PID space (set via the hostPID flag)
permission to list pods in all namespaces

configuration

The following configuration options are available as env vars:

LOG_LEVEL: change the logging verbosity; defaults to INFO
DD_SITE: Datadog site to use for metrics submission; defaults to datadoghq.com, but can be set to datadoghq.eu
DD_API_KEY: API Key for Datadog
SECONDS_BETWEEN_SAMPLES: how often to query nvidia-smi; defaults to "5"

how it works

We query nvidia-smi --query-compute-apps=pid,used_memory --format=csv, and use the returned PIDs to extract the container ID from the host's /proc/<PID>/cgroup file(s). (inspired by pid2pod)

It works by looking up the target process's cgroup metadata in /proc/$PID/cgroup. This metadata contains the names of each cgroup assigned to the process. In the case of Docker containers created using the docker CLI or created by the kubelet, these cgroup names contain the Docker container ID. We can map this container ID to a Kubernetes pod by doing a lookup against the local kubelet API.

We enrich the used_memory stat with metadata from the associated Pod, and publish to Datadog as kubernetes.gpu.usage.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
tests		tests
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
deploy.yaml		deploy.yaml
gpu_watchdog.py		gpu_watchdog.py
logging.conf		logging.conf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐕 gpu-watchdog

installation

usage

configuration

how it works

About

Releases

Packages

Languages

License

josephhaaga/gpu-watchdog

Folders and files

Latest commit

History

Repository files navigation

🐕 gpu-watchdog

installation

usage

configuration

how it works

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages