Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable DCGM_FI_DEV_GPU_UTIL && update docker file #8

Closed
wants to merge 1 commit into from

Conversation

zhu733756
Copy link

Hello, friends.
Thanks to this repo, we can install DCGM exporter using Helm and fetch metrics using Prometheus.
However, I have a puzzle right now.
Why do we close the metrics label DCGM_FI_DEV_GPU_UTIL ?
Hope that someone can help to answer this question.
Thank you in advance.

BTW, this pr aims to enable metrics label DCGM_FI_DEV_GPU_UTIL and fix a build error in Dockerfile.
The error like below:

E: Failed to fetch https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/751939d95516afc289908a19e447f0acc1506367f72ed356431a2b1a469cc8ca  404  Not Found [IP: 45.43.32.211 443]
E: Some index files failed to download. They have been ignored, or old ones used instead.

Thank you for your reply.

@nikkon-dev
Copy link
Collaborator

@zhu733756,

Thank you for the PR.
The explanation behind removing DCGM_FI_DEV_GPU_UTIL from the metrics enabled by default could be found here: NVIDIA/gpu-monitoring-tools#143 (comment)
In general - that metric is deprecated because its value is misleading and expensive to acquire. There are better metrics that are meant to replace the deprecated ones - DCGM_FI_PROF_* family of metrics.

WBR,
Nik

@zhu733756
Copy link
Author

zhu733756 commented Aug 27, 2021

Great, thanks. @nikkon-dev

@zhu733756 zhu733756 closed this Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants