Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator-inventory should wait for the nvdp to fully init before deciding on the GPU count #207

Open
andy108369 opened this issue Apr 4, 2024 · 5 comments
Labels
repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

Currently, the operator-inventory reports 0 GPUs upon first install or server reboot unless manually restarted.
I think operator-inventory should wait for the nvdp to fully init before deciding on the GPU count.

me and Damir have been deploying ~5 providers with provider v0.5.11 and generally have been working with the providers quite extensively over this week.

We have noticed that operator-inventory would almost always report 0 GPU's upon first provider install or after server reboot.

The workaround is to simply bounce it, e.g.:

kubectl rollout restart deployment/operator-inventory -n akash-services

Here is that, operator-inventory should be more robust, i.e. if it doesn't detect the GPU upon first run or after server reboot - it should not wait for admin to kick it.

And the "first install" case could be explained as the operator-inventory gets installed first, before we install nvdp (nvidia-device) plugin.

The restart might be the same case - it is highly likely nvdp plugin can't init in time and detect the GPU, while operator-inventory has already been initialized.

@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Apr 4, 2024
@andy108369
Copy link
Contributor Author

it's possible that provider 0.5.12 fixed this; but yet to confirm

@chainzero
Copy link
Collaborator

@andy108369 will validate in near future but not a critical matter.

@TormenTeDx
Copy link

It just happened to me.

Sometimes when you shutdown provider it can show wrong values on gpus, whenever you shutdown a node for more than 5 minutes it may forget to label node again and it shows 0 gpus, normal restart won't cause this, you have to shutdown a node for ~5 minutes. Then a node isn't labeled properly and it causes that 0 gpu value. So after each shutdown I have to verify if a node is actually labeled, if not then just bouncing the inventory pod fixes it.

Here is my versions:

root@node1:~# helm list -A
NAME                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                            APP VERSION
akash-hostname-operator akash-services          4               2024-04-16 14:30:27.443699966 +0000 UTC deployed        akash-hostname-operator-9.1.3    0.5.13     
akash-node              akash-services          2               2024-04-16 14:36:42.381124574 +0000 UTC deployed        akash-node-9.0.3                 0.32.3     
akash-provider          akash-services          16              2024-04-16 14:31:46.980977588 +0000 UTC deployed        provider-9.2.6                   0.5.13     
ingress-nginx           ingress-nginx           3               2024-03-05 12:54:08.500925969 +0000 UTC deployed        ingress-nginx-4.10.0             1.10.0     
inventory-operator      akash-services          5               2024-04-16 14:30:35.530259476 +0000 UTC deployed        akash-inventory-operator-9.1.3   0.5.13     
nvdp                    nvidia-device-plugin    3               2024-03-05 14:26:41.729594744 +0000 UTC deployed        nvidia-device-plugin-0.14.5      0.14.5   

@andy108369
Copy link
Contributor Author

andy108369 commented Jun 5, 2024

Looks like still happens:
image

I think the issue is primary related to k8s-device-plugin limitations:

This functionality is not production ready and includes a number of known issues including:

  • The device plugin may show as started before it is ready to allocate shared GPUs while waiting for the CUDA MPS control daemon to come online.

Source: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0

Yet, I think operator-inventory introduce some sort of workarounds to that.

@andy108369 andy108369 removed their assignment Jun 5, 2024
@andy108369
Copy link
Contributor Author

The issue still persists after node reboots, when nvdp didn't fully init yet but operator-inventory starts querying it too earily:

image

provider 0.6.2
akash 0.36.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

3 participants