-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
operator-inventory should wait for the nvdp to fully init before deciding on the GPU count #207
Comments
it's possible that provider 0.5.12 fixed this; but yet to confirm |
@andy108369 will validate in near future but not a critical matter. |
It just happened to me. Sometimes when you shutdown provider it can show wrong values on gpus, whenever you shutdown a node for more than 5 minutes it may forget to label node again and it shows 0 gpus, normal restart won't cause this, you have to shutdown a node for ~5 minutes. Then a node isn't labeled properly and it causes that 0 gpu value. So after each shutdown I have to verify if a node is actually labeled, if not then just bouncing the inventory pod fixes it. Here is my versions:
|
I think the issue is primary related to
Yet, I think |
Currently, the operator-inventory reports 0 GPUs upon first install or server reboot unless manually restarted.
I think operator-inventory should wait for the nvdp to fully init before deciding on the GPU count.
me and Damir have been deploying ~5 providers with provider v0.5.11 and generally have been working with the providers quite extensively over this week.
We have noticed that operator-inventory would almost always report 0 GPU's upon first provider install or after server reboot.
The workaround is to simply bounce it, e.g.:
Here is that, operator-inventory should be more robust, i.e. if it doesn't detect the GPU upon first run or after server reboot - it should not wait for admin to kick it.
And the "first install" case could be explained as the operator-inventory gets installed first, before we install nvdp (nvidia-device) plugin.
The restart might be the same case - it is highly likely nvdp plugin can't init in time and detect the GPU, while operator-inventory has already been initialized.
The text was updated successfully, but these errors were encountered: