-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to monitor GPU metrics on container level #2095
Comments
/cc |
Hi @WanLinghao what you mean for /cc ? Best Regards! |
@Cherishty hello, I am interested in both GPU and cadvisor, I want to know any update about this issue. And I will share if I find any solution about this^_^ |
cc @mindprince |
https://github.com/google/cadvisor/blob/master/docs/running.md#hardware-accelerator-monitoring has all the details. You need to make sure that however you are running cAdvisor, it satisfies the two conditions mentioned there: access to NVML library and access to GPU devices. If you are running cAdvisor embedded into the kubelet, then kubelet should have access to NVML library (i.e. its LD_LIBRARY_PATH should contain the location where nvml is present.) Similarly, it should have access to GPU devices. If you are running cAdvisor/kubelet inside a container itself, things are complicated but the two requirements are the same. The link above explains how to satisfy these two requirements when running cAdvisor inside the container. It's hard to debug individual cases without access to the environment. |
I got a container running using NVML a couple weeks ago: https://github.com/dashpole/example-gpu-monitor/blob/master/deploy/kubernetes/daemonset.yaml. It probably isn't a great example in terms of security, as it just makes the pod privileged, but it is somewhere to start. |
Quite appreciate for your attention and help! @mindprince Yes, for embedded cAdvisor, I have set the right LD_LIBRARY_PATH, I am not sure what you mean for have access to GPU devices but it should be since tensorflow training for GPU can run successfully on my GPU node. Additionally, may you double confirm what is the metrics representing for GPU? Thanks !!! |
@Cherishty just exec into the pod, and |
@dashpole Thanks for your guidance.
As I mentioned before, I am using |
Please do not use the example GPU monitor. I just shared that as an example of how to get NVML to work from inside a container. I am adding this patch for the cAdvisor daemonset to show how to get cAdvisor working. |
It shouldn't matter how your container is consuming GPUs. cAdvisor interacts directly with the cgroup tree. |
@dashpole I can not appreciate more for your patient clarification and guidance!
While it occurs CrashLoopBackOff and I collect the error log as below:
and
Seems |
@mindprince Sorry to disturb you, but still want to make it clear that how can the cadvisor embedded in kubelet access to expose gpu metrics? Best Regards! |
hmmm. I confirmed that the @Cherishty it looks like NVML loaded successfully: |
By access to GPU devices, I meant the process running cAdvisor, should have access to the GPU devices in /dev/ (/dev/nvidiactl, /dev/nvidia0, and so on). You should see the log like "NVML initialized, number of nvidia devices: N" for the process running cAdvisor. |
@dashpole After fixing the issue you mentioned, The only error it occurs is that prometheus-to-sd container said:
So I remove it and re-create cadvisor, now it can export Again, thanks a lot for your contribution and kindness 💯 |
@mindprince Sorry to say that I am not quite familiar with how cadvisor embedded in k8s, so I assume cadvisor running in
|
Glad you got cAdvisor working. I don't see any unusual errors in the kubelet log you provided. It doesn't look like it is actually the full log from kubelet startup (as it usually starts with the flags provided by to the kubelet). Try |
Sorry but it return nothing :( |
Can you provide the full kubelet log from a run after you restart the kubelet? |
Sure, here is my environemnt:
I have two VM with the same configuration and the same behavior And below is the log provided by
|
Additionally, below is my configuration for
Best Regards! |
Ah, it looks like the log line in question is emitted at |
That's right.
I have set $LD_LIBRARY_PATH=/usr/lib64/nvidia, where does locate the Any clues or suggestion? |
BTW, for the cadivsor running in container, which has been proved to run, I can only find should any other metrics which reflect gpu-utilization be provided? |
|
@ dashpole: I am using KOPS cluster. AWS cloud environment. 1 Master and 2 Worker nodes (p2.xlarge and t2.medium) I am trying to collect GPU metrics by using cAdvisor but unable to gather them. The below is my (cadvisor.yaml) script. **--- - name: kubelet-podresourcesmountPath: /var/lib/kubelet/
- --device-cgroup-rule 'c 195:* mrw'
- name: kubelet-podresoureshostPath:path: /var/lib/kubelet/
---** It would be great if you could please check and update, what did I miss here as it help me to resolve the issue at the earliest. Thanks a tone in advance. Regards, |
@reachmeselva the first glance looks like everything you need should be present. Can you check the prometheus API at |
Yeah, you can see all of the series pushed to influxDB. If the prometheus endpoint returns the correct metrics, then all that needs to be done is add accelerator metrics there. |
@ dashpole: Thanks for your reply. It would be great, if you could please provide gpu metrics names and sample link for how to add metrics in influxdb Thanks in advance. |
In storage/influxdb/influxdb.go#L224, you add the influxDB "points" from the v1 container stats. You just need to take the stats in the Accelerators portion of container stats (info/v1/container.go#L583), and change them to the influxdb format. |
@dashpole hi |
@GuiBin2013 cAdvisor monitors GPUs in two steps:
Step (2) can only give us metrics for the entire GPU, as NVIDIA GPUs aren't natively supported in linux cgroups. So if you are sharing GPUs between two containers, you should get two identical metrics for each container. I am closing this issue, as the original question has been answered. For further questions, please open a new issue. |
I am a new comer in cadvisor and when I attempt to deploy kube-prometheus on my k8s cluster to monitoring my GPU. There is no GPU usage info in container level and machine level.
My k8s version is v1.9.5 and I use Nvidia GPU in container via setting
--feature-aget=Accelerators=true
in kubelet instead ofdevice-plugin
, It does work when running tensorflow with GPU in container.I check Not able to collect metrics for nvidia GPU where @mindprince claimed that
Also find add accelerator info to container spec where @mindprince claimed that
I check the running.md of cadvisor where said it can support monitoring GPU by adding some parameter when starting cadvisor
So my questions are :
Can anyone kindly gives me a hand since we do really need to monitor our GPU jobs running in k8s? Thanks!
The text was updated successfully, but these errors were encountered: