operator-inventory should wait for the nvdp to fully init before deciding on the GPU count #207

andy108369 · 2024-04-04T23:28:27Z

Currently, the operator-inventory reports 0 GPUs upon first install or server reboot unless manually restarted.
I think operator-inventory should wait for the nvdp to fully init before deciding on the GPU count.

me and Damir have been deploying ~5 providers with provider v0.5.11 and generally have been working with the providers quite extensively over this week.

We have noticed that operator-inventory would almost always report 0 GPU's upon first provider install or after server reboot.

The workaround is to simply bounce it, e.g.:

kubectl rollout restart deployment/operator-inventory -n akash-services

Here is that, operator-inventory should be more robust, i.e. if it doesn't detect the GPU upon first run or after server reboot - it should not wait for admin to kick it.

And the "first install" case could be explained as the operator-inventory gets installed first, before we install nvdp (nvidia-device) plugin.

The restart might be the same case - it is highly likely nvdp plugin can't init in time and detect the GPU, while operator-inventory has already been initialized.

The text was updated successfully, but these errors were encountered:

andy108369 · 2024-04-12T09:08:01Z

it's possible that provider 0.5.12 fixed this; but yet to confirm

chainzero · 2024-04-17T15:11:34Z

@andy108369 will validate in near future but not a critical matter.

TormenTeDx · 2024-04-29T14:20:53Z

It just happened to me.

Sometimes when you shutdown provider it can show wrong values on gpus, whenever you shutdown a node for more than 5 minutes it may forget to label node again and it shows 0 gpus, normal restart won't cause this, you have to shutdown a node for ~5 minutes. Then a node isn't labeled properly and it causes that 0 gpu value. So after each shutdown I have to verify if a node is actually labeled, if not then just bouncing the inventory pod fixes it.

Here is my versions:

root@node1:~# helm list -A
NAME                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                            APP VERSION
akash-hostname-operator akash-services          4               2024-04-16 14:30:27.443699966 +0000 UTC deployed        akash-hostname-operator-9.1.3    0.5.13     
akash-node              akash-services          2               2024-04-16 14:36:42.381124574 +0000 UTC deployed        akash-node-9.0.3                 0.32.3     
akash-provider          akash-services          16              2024-04-16 14:31:46.980977588 +0000 UTC deployed        provider-9.2.6                   0.5.13     
ingress-nginx           ingress-nginx           3               2024-03-05 12:54:08.500925969 +0000 UTC deployed        ingress-nginx-4.10.0             1.10.0     
inventory-operator      akash-services          5               2024-04-16 14:30:35.530259476 +0000 UTC deployed        akash-inventory-operator-9.1.3   0.5.13     
nvdp                    nvidia-device-plugin    3               2024-03-05 14:26:41.729594744 +0000 UTC deployed        nvidia-device-plugin-0.14.5      0.14.5

andy108369 · 2024-06-05T12:32:38Z

Looks like still happens:

I think the issue is primary related to k8s-device-plugin limitations:

This functionality is not production ready and includes a number of known issues including:

The device plugin may show as started before it is ready to allocate shared GPUs while waiting for the CUDA MPS control daemon to come online.

Source: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0

Yet, I think operator-inventory introduce some sort of workarounds to that.

andy108369 · 2024-08-19T07:22:34Z

The issue still persists after node reboots, when nvdp didn't fully init yet but operator-inventory starts querying it too earily:

provider 0.6.2
akash 0.36.0

andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Apr 4, 2024

chainzero assigned andy108369 Apr 17, 2024

chainzero removed the awaiting-triage label Apr 17, 2024

andy108369 removed their assignment Jun 5, 2024

andy108369 mentioned this issue Aug 21, 2024

inventory-operator: doesn't detect when nvdp-nvidia-device-plugin marks GPU as unhealthy #249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator-inventory should wait for the nvdp to fully init before deciding on the GPU count #207

operator-inventory should wait for the nvdp to fully init before deciding on the GPU count #207

andy108369 commented Apr 4, 2024

andy108369 commented Apr 12, 2024

chainzero commented Apr 17, 2024

TormenTeDx commented Apr 29, 2024

andy108369 commented Jun 5, 2024 •

edited

Loading

andy108369 commented Aug 19, 2024

operator-inventory should wait for the nvdp to fully init before deciding on the GPU count #207

operator-inventory should wait for the nvdp to fully init before deciding on the GPU count #207

Comments

andy108369 commented Apr 4, 2024

andy108369 commented Apr 12, 2024

chainzero commented Apr 17, 2024

TormenTeDx commented Apr 29, 2024

andy108369 commented Jun 5, 2024 • edited Loading

andy108369 commented Aug 19, 2024

andy108369 commented Jun 5, 2024 •

edited

Loading