You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The customer had their RKE2 certs expire on their downstream elemental cluster. So we tried to rotate the certs from Rancher and this caused Rancher to get stuck on "waiting for [nodename] certificate rotation". This would not proceed because all the rancher-system-agent processes on the server nodes were restarting with the error: "Fatal error running: unable to parse connection info file: invalid character 'n' looking for beginning of value". When we checked the "/var/lib/rancher/agent/rancher2_connection_info.json" file it said: "machine not found by request". Even if we deleted the node so that the node-pool re-registers it, it would not proceed. Hostbusters engineering said this is happening because of a problem with the elemental tokens.
Business impact:
This made the customer's cluster inaccessible through the Rancher UI
Troubleshooting steps:
Unpause the capi cluster by editing the clusters.cluster.x-k8s.io object and setting spec.paused: false
Making sure the Machine object for the node being provisioned had the rke.cattle.io/machine-id label which should be set to the value of what's in the /etc/rancher/agent/cattle_id file
stopping the rancher-system-agent and elemental-system-agent
symlink /opt/rke2/bin/rke2 to /usr/local/bin/rke2 (problem specific to SLE Micro)
re-register the node by pulling the registration command from kubectl get clusterregistrationtokens.management.cattle.io/default-token -n -o yaml and it's in the status.nodeCommand section.
scale down the nodePool to two nodes
once elemental was able to add the node back to the inventory,
scale the cluster back up to 3 server nodes
The text was updated successfully, but these errors were encountered:
SURE-8954
Issue description:
The customer had their RKE2 certs expire on their downstream elemental cluster. So we tried to rotate the certs from Rancher and this caused Rancher to get stuck on "waiting for [nodename] certificate rotation". This would not proceed because all the rancher-system-agent processes on the server nodes were restarting with the error: "Fatal error running: unable to parse connection info file: invalid character 'n' looking for beginning of value". When we checked the "/var/lib/rancher/agent/rancher2_connection_info.json" file it said: "machine not found by request". Even if we deleted the node so that the node-pool re-registers it, it would not proceed. Hostbusters engineering said this is happening because of a problem with the elemental tokens.
Business impact:
This made the customer's cluster inaccessible through the Rancher UI
Troubleshooting steps:
once elemental was able to add the node back to the inventory,
The text was updated successfully, but these errors were encountered: