provider status endpoint: hurricane provider reports excessively large amount of available CPUs #232

andy108369 · 2024-06-27T15:51:51Z

hurricane provider reports excessively large amount of available CPUs

$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 405.635368
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"                                "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"                                 "0/0/0"       "1.82/1.69/0.13"       "25.54/25.54/0"
"worker-01.hurricane2"   "102/18446744073709504/-18446744073709404"  "1/1/0"       "196.45/57.48/138.97"  "1808.76/1443.1/365.67"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
34.2          0      64.88       314.4             0             0             11

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          575.7

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

provider_info2.sh script https://github.com/arno01/akash-tools/blob/main/provider_info2.sh

Versions

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                                          IMAGE
akash-node-1-0                                                ghcr.io/akash-network/node:0.36.0
akash-provider-0                                              ghcr.io/akash-network/provider:0.6.2
operator-hostname-6dddc6db79-hmmxd                            ghcr.io/akash-network/provider:0.6.2
operator-inventory-6fdf575d44-rnfj4                           ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-control-01.hurricane2   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-worker-01.hurricane2    ghcr.io/akash-network/provider:0.6.2
operator-ip-d9d6df8cd-t9zw9                                   ghcr.io/akash-network/provider:0.6.2

Logs

I've tried restarting the operator-inventory which previously used to "fix" this issue, but to no avail this time.

kubectl -n akash-services rollout restart deployment/operator-inventory

$ kubectl -n akash-services logs deployment/operator-inventory --timestamps
2024-06-27T15:25:29.979755238Z I[2024-06-27|15:25:29.979] using in cluster kube config                 cmp=provider
2024-06-27T15:25:30.993714193Z INFO	rook-ceph	   ADDED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:25:31.022718552Z INFO	rest listening on ":8080"
2024-06-27T15:25:31.022730122Z INFO	nodes.nodes	waiting for nodes to finish
2024-06-27T15:25:31.022777911Z INFO	grpc listening on ":8081"
2024-06-27T15:25:31.022824901Z INFO	watcher.storageclasses	started
2024-06-27T15:25:31.022976338Z INFO	watcher.config	started
2024-06-27T15:25:31.027880682Z INFO	rook-ceph	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-06-27T15:25:31.029378292Z INFO	nodes.node.monitor	starting	{"node": "worker-01.hurricane2"}
2024-06-27T15:25:31.029383612Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "control-01.hurricane2"}
2024-06-27T15:25:31.029386222Z INFO	nodes.node.monitor	starting	{"node": "control-01.hurricane2"}
2024-06-27T15:25:31.029390481Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "worker-01.hurricane2"}
2024-06-27T15:25:31.063512161Z INFO	rancher	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-06-27T15:25:31.066705538Z W0627 15:25:31.066598       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:31.066875795Z W0627 15:25:31.066601       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:32.087372741Z W0627 15:25:32.087218       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:32.087522389Z W0627 15:25:32.087456       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:33.093759624Z W0627 15:25:33.093649       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:33.096327860Z W0627 15:25:33.096250       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:35.614448848Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:25:35.664476772Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "control-01.hurricane2"}
2024-06-27T15:25:35.780999348Z INFO	nodes.node.monitor	started	{"node": "control-01.hurricane2"}
2024-06-27T15:25:36.239976215Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "worker-01.hurricane2"}
2024-06-27T15:25:36.454713184Z INFO	nodes.node.monitor	started	{"node": "worker-01.hurricane2"}
2024-06-27T15:26:36.900875467Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:27:38.206330676Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:28:39.486188220Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:29:40.787165193Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}

The text was updated successfully, but these errors were encountered:

andy108369 · 2024-07-02T13:48:41Z

The cpu value returned back to normal even without the need to bump the opreator-inventory, after deleting containers stuck in "ContainerStatusUnknown" state:

$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 408.364243
^R
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"                                "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"                                 "0/0/0"       "1.82/1.69/0.13"       "25.54/25.54/0"
"worker-01.hurricane2"   "102/18446744073709490/-18446744073709384"  "1/1/0"       "196.45/49.67/146.78"  "1808.76/1435.28/373.48"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
34.2          0      64.88       314.4             0             0             11

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          575.7

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

arno@x1:~$ kubectl get pods -A --field-selector status.phase=Failed 
NAMESPACE                                       NAME                   READY   STATUS                   RESTARTS   AGE
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-bbgqz   0/1     ContainerStatusUnknown   1          2d22h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-f2fpj   0/1     ContainerStatusUnknown   1          3d20h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-g7xbd   0/1     ContainerStatusUnknown   1          3d3h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-hv4qs   0/1     ContainerStatusUnknown   1          9h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-p4h7j   0/1     ContainerStatusUnknown   1          4d7h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-rcr45   0/1     ContainerStatusUnknown   1          30h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-4cq86   0/1     ContainerStatusUnknown   1          20d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-5ddrg   0/1     ContainerStatusUnknown   1          13d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-7nl6p   0/1     ContainerStatusUnknown   1          5d7h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-9jsn7   0/1     ContainerStatusUnknown   1          19d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-bnjfh   0/1     ContainerStatusUnknown   1          20d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-d2nfr   0/1     ContainerStatusUnknown   1          7d12h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-dk95v   0/1     ContainerStatusUnknown   1          17d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-fgfl4   0/1     ContainerStatusUnknown   1          7d19h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-gh9bb   0/1     ContainerStatusUnknown   1          16d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-gltgh   0/1     ContainerStatusUnknown   1          9d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-j9tnr   0/1     ContainerStatusUnknown   1          15d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-mmqfk   0/1     ContainerStatusUnknown   1          6d5h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-ph89h   0/1     ContainerStatusUnknown   1          11d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-pjrg4   0/1     ContainerStatusUnknown   1          17d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-pwbzv   0/1     ContainerStatusUnknown   1          13d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-rd7z5   0/1     ContainerStatusUnknown   1          12d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-t6vt9   0/1     ContainerStatusUnknown   1          6d15h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-vht5l   0/1     ContainerStatusUnknown   1          9d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-wd8w4   0/1     ContainerStatusUnknown   1          7d23h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-xnsvt   0/1     ContainerStatusUnknown   1          13d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-zmzbf   0/1     ContainerStatusUnknown   1          12d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-zw2st   0/1     ContainerStatusUnknown   1          10d

arno@x1:~$ ns=2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu
arno@x1:~$ kubectl -n $ns get deployment
NAME   READY   UP-TO-DATE   AVAILABLE   AGE
web    1/1     1            1           116d
arno@x1:~$ kubectl -n $ns get rs
NAME             DESIRED   CURRENT   READY   AGE
web-57478ff56c   0         0         0       4d17h
web-5df9f7c798   1         1         1       4d17h
web-7f5fdfd87c   0         0         0       53d
web-85fc6b7694   0         0         0       54d
web-85ff75fdc5   0         0         0       70d
arno@x1:~$ kubectl -n $ns delete rs web-85ff75fdc5
replicaset.apps "web-85ff75fdc5" deleted
arno@x1:~$ kubectl -n $ns delete rs web-85fc6b7694
replicaset.apps "web-85fc6b7694" deleted
arno@x1:~$ kubectl -n $ns delete rs web-7f5fdfd87c
replicaset.apps "web-7f5fdfd87c" deleted
arno@x1:~$ kubectl -n $ns delete rs web-57478ff56c
replicaset.apps "web-57478ff56c" deleted
arno@x1:~$ kubectl -n $ns get rs
NAME             DESIRED   CURRENT   READY   AGE
web-5df9f7c798   1         1         1       4d17h
arno@x1:~$ kubectl get pods -A --field-selector status.phase=Failed 
NAMESPACE                                       NAME                   READY   STATUS                   RESTARTS   AGE
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-bbgqz   0/1     ContainerStatusUnknown   1          2d22h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-f2fpj   0/1     ContainerStatusUnknown   1          3d20h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-g7xbd   0/1     ContainerStatusUnknown   1          3d3h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-hv4qs   0/1     ContainerStatusUnknown   1          9h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-p4h7j   0/1     ContainerStatusUnknown   1          4d7h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-rcr45   0/1     ContainerStatusUnknown   1          30h
arno@x1:~$ kubectl delete pods -A --field-selector status.phase=Failed 
pod "web-5df9f7c798-bbgqz" deleted
pod "web-5df9f7c798-f2fpj" deleted
pod "web-5df9f7c798-g7xbd" deleted
pod "web-5df9f7c798-hv4qs" deleted
pod "web-5df9f7c798-p4h7j" deleted
pod "web-5df9f7c798-rcr45" deleted

$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 408.364243
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"         "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"          "0/0/0"       "1.82/1.69/0.13"       "25.54/25.54/0"
"worker-01.hurricane2"   "102/47.995/54.005"  "1/1/0"       "196.45/104.36/92.09"  "1808.76/1489.97/318.79"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
34.2          0      64.88       314.4             0             0             11

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          575.7

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

andy108369 · 2024-07-29T20:29:33Z

Checked with Max C., they said they are deploying the updates through the Console. Dockerfile

Interestingly, while I'm seeing few more new ContainerStatusUnknown pods, it doesn't appear to be impacting the provider status endpoint (8443/status) 🤔

andy108369 · 2024-07-31T13:07:54Z

I think that this issue (the excessively high CPU count report) happens more when there are replicaset leftovers in the K8s.
As you can see in this screenshot there is just one replicaset and is active, that's probably why this issue isn't getting triggered.

chainzero · 2024-07-31T13:31:39Z

Have attempted to reproduce excessively high CPU count stats associated with issue 232 across several test cluster builds. The issue is difficult to replicate due to:

As mentioned in prior issue update - reproducing the theorized root cause of the inventory operator being impacted by pods in a ContainerStatusUnknown state will not necessarily provoke the reporting issue. Currently the Hurricane provider has five such pods in this status - all associated with the Console stats page deployment - and CPU reporting in the inventory operator is not impacted.
It is difficult to replicate ContainerStatusUnknown status intentionally. Easy to get pods in a crash backoff or out of memory status - which has not proved effective in provoking the issue - but intentionally provoking ContainerStatusUnknown is more challenging. Killing kubelet on the worker node is the best mechanism to provoke such pod statuses - in my experience - as the control plane is then unable to determine pod status. But this is not a feasible approach to reproduce the issue as this will also impact the inventory operator discovery pod on the same node. And this renders this testing as ineffective as without a running discovery pod able to communicate with the inventory operator - we cannot check feature discovery stats on that node to determine if this reproduces the issue. Have attempted other techniques such as manipulating the iptables and network policies of the worker node to attempt to block control plane node communication to the control plane to specific pods to disrupt status checks by the master node but this has not rendered the pods into the status necessary to reproduce the issue. And again - even after all of this effort to get a pod/pods into ContainerStatusUnknown - it may not provoke the issue based on current Hurricane provider observations.

Based on these difficulties in intentionally provoking the issue on a test provider - would suggest that we wait for the Hurricane provider and/or other provider (OCL owned preferably) to encounter this issue naturally, leave provider in that state during which time only new bidding may be affected, and troubleshoot further on such a provider.

It's interesting that the only pods in state ContainerStatusUnknown are console stat pods. Believe the most likely cause of such a state is a scenario such as: rolling update is provoked > updated pod spawned > Kubernetes attempts to delete the original pod post update > pod is already deleted from containerd for some reason > K8s reports ContainerStatusUnknown as it cannot determine the state of a pod not present in the container runtime. But have attempted multiple deployments of the console stats app on test providers > provoked updates to the deployment > and no such issues. To my knowledge these deployments are the sole contributor to the ContainerStatusUnknown status pods. The stats deployment also have many Completed status pods.

andy108369 · 2024-08-13T16:40:02Z

this has just occurred again on the Hurricane provider and below are the steps I've taken to patch it (but not a permanent fix):

can see the cpu reporting issue on the worker node

arno@x1:~$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 419.278529
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"                                "gpu(t/a/u)"  "mem(t/a/u GiB)"      "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"                                 "0/0/0"       "1.82/1.69/0.13"      "25.54/25.54/0"
"worker-01.hurricane2"   "102/18446744073709550/-18446744073709450"  "1/1/0"       "186.61/7.43/179.18"  "1808.76/1066.36/742.41"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
61.45         0      141.71      727.76            0             0             495

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          -0.63

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

removed the Failed pods (those in ContainerStatusUnknown state)

arno@x1:~$ kubectl get events -A --sort-by='.lastTimestamp' 
NAMESPACE                                       LAST SEEN   TYPE      REASON             OBJECT                               MESSAGE
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   32m         Normal    Scheduled          pod/web-6647cd677-n2z9g              Successfully assigned 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu/web-6647cd677-n2z9g to worker-01.hurricane2
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   32m         Warning   Evicted            pod/web-6647cd677-grjc2              Pod ephemeral local storage usage exceeds the total limit of containers 524288k.
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   32m         Normal    Killing            pod/web-6647cd677-grjc2              Stopping container web
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   32m         Normal    Pulled             pod/web-6647cd677-n2z9g              Container image "redm4x/console-stats-web:0.19.7" already present on machine
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   32m         Normal    Created            pod/web-6647cd677-n2z9g              Created container web
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   32m         Normal    Started            pod/web-6647cd677-n2z9g              Started container web
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   32m         Normal    SuccessfulCreate   replicaset/web-6647cd677             Created pod: web-6647cd677-n2z9g

arno@x1:~$ ns=2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu
arno@x1:~$ kubectl -n $ns get pods
NAME                  READY   STATUS                   RESTARTS      AGE
web-6647cd677-4bg4s   0/1     Completed                0             8d
web-6647cd677-69cp7   0/1     Completed                0             7d13h
web-6647cd677-7d9g2   0/1     ContainerStatusUnknown   1             5d19h
web-6647cd677-7tsgt   0/1     Completed                0             8d
web-6647cd677-87pzn   0/1     Error                    0             3d3h
web-6647cd677-87x5v   0/1     Completed                0             24h
web-6647cd677-8pfks   0/1     Completed                0             9d
web-6647cd677-9h5vn   0/1     ContainerStatusUnknown   1             42h
web-6647cd677-9l9n9   0/1     Completed                0             9d
web-6647cd677-bwnq7   0/1     Completed                0             46h
web-6647cd677-crn7q   0/1     Completed                0             3d19h
web-6647cd677-d6xj6   0/1     Completed                0             7d7h
web-6647cd677-dn2d8   0/1     ContainerStatusUnknown   1             6d21h
web-6647cd677-dpc79   0/1     Completed                0             3d12h
web-6647cd677-frvg8   0/1     ContainerStatusUnknown   1             4d20h
web-6647cd677-gjcv2   0/1     ContainerStatusUnknown   1             6d13h
web-6647cd677-grjc2   0/1     Completed                0             7h7m
web-6647cd677-gtg5b   0/1     ContainerStatusUnknown   1             5d8h
web-6647cd677-gth6w   0/1     ContainerStatusUnknown   1             15h
web-6647cd677-lq9cq   0/1     Completed                0             2d6h
web-6647cd677-n2862   0/1     Completed                0             7d19h
web-6647cd677-n2z9g   1/1     Running                  0             33m
web-6647cd677-pf4b8   0/1     Completed                0             4d3h
web-6647cd677-pt5kz   0/1     ContainerStatusUnknown   1             36h
web-6647cd677-qdm9d   0/1     Completed                1 (10d ago)   10d
web-6647cd677-r6m4q   0/1     ContainerStatusUnknown   1             4d12h
web-6647cd677-t8gwk   0/1     Completed                0             31h
web-6647cd677-vcsdx   0/1     Completed                0             6d6h
web-6647cd677-vn7xt   0/1     ContainerStatusUnknown   1             8d
web-6647cd677-xhbkf   0/1     Completed                0             9d
web-6647cd677-zbz5v   0/1     ContainerStatusUnknown   1             2d18h

arno@x1:~$ kubectl get pods -A --field-selector status.phase=Failed 
NAMESPACE                                       NAME                  READY   STATUS                   RESTARTS   AGE
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-7d9g2   0/1     ContainerStatusUnknown   1          5d19h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-87pzn   0/1     Error                    0          3d3h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-9h5vn   0/1     ContainerStatusUnknown   1          42h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-dn2d8   0/1     ContainerStatusUnknown   1          6d21h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-frvg8   0/1     ContainerStatusUnknown   1          4d20h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-gjcv2   0/1     ContainerStatusUnknown   1          6d13h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-gtg5b   0/1     ContainerStatusUnknown   1          5d8h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-gth6w   0/1     ContainerStatusUnknown   1          15h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-pt5kz   0/1     ContainerStatusUnknown   1          36h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-r6m4q   0/1     ContainerStatusUnknown   1          4d12h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-vn7xt   0/1     ContainerStatusUnknown   1          8d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-6647cd677-zbz5v   0/1     ContainerStatusUnknown   1          2d18h

arno@x1:~$ kubectl delete pods -A --field-selector status.phase=Failed 
pod "web-6647cd677-7d9g2" deleted
pod "web-6647cd677-87pzn" deleted
pod "web-6647cd677-9h5vn" deleted
pod "web-6647cd677-dn2d8" deleted
pod "web-6647cd677-frvg8" deleted
pod "web-6647cd677-gjcv2" deleted
pod "web-6647cd677-gtg5b" deleted
pod "web-6647cd677-gth6w" deleted
pod "web-6647cd677-pt5kz" deleted
pod "web-6647cd677-r6m4q" deleted
pod "web-6647cd677-vn7xt" deleted
pod "web-6647cd677-zbz5v" deleted

cpu report looks good after that

arno@x1:~$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 419.284058
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"         "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/0.7/1.3"          "0/0/0"       "1.82/1.19/0.63"       "25.54/23.54/2"
"worker-01.hurricane2"   "102/10.745/91.255"  "1/1/0"       "186.61/11.29/175.32"  "1808.76/1070.22/738.55"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
61.45         0      141.71      727.76            0             0             495

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          -0.68

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
1.5           0      2.5         4                 0             0             0
arno@x1:~$

andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Jun 27, 2024

chainzero assigned troian Jul 24, 2024

chainzero added P1 and removed awaiting-triage labels Jul 24, 2024

devalpatel67 mentioned this issue Dec 3, 2024

Inventory Operator Crash Loop Due to nil pointer dereference After Rollout Restart on provider pdx.nb.akash.pub #268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provider status endpoint: hurricane provider reports excessively large amount of available CPUs #232

provider status endpoint: hurricane provider reports excessively large amount of available CPUs #232

andy108369 commented Jun 27, 2024

andy108369 commented Jul 2, 2024

andy108369 commented Jul 29, 2024

andy108369 commented Jul 31, 2024

chainzero commented Jul 31, 2024

andy108369 commented Aug 13, 2024

provider status endpoint: hurricane provider reports excessively large amount of available CPUs #232

provider status endpoint: hurricane provider reports excessively large amount of available CPUs #232

Comments

andy108369 commented Jun 27, 2024

Versions

Logs

andy108369 commented Jul 2, 2024

andy108369 commented Jul 29, 2024

andy108369 commented Jul 31, 2024

chainzero commented Jul 31, 2024

andy108369 commented Aug 13, 2024