Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-8954] problem with elemental tokens preventing rancher from re-provisioning nodes #1671

Open
kkaempf opened this issue Jan 14, 2025 · 0 comments
Labels
JIRA must shout kind/bug Something isn't working

Comments

@kkaempf
Copy link
Contributor

kkaempf commented Jan 14, 2025

SURE-8954

Issue description:

The customer had their RKE2 certs expire on their downstream elemental cluster. So we tried to rotate the certs from Rancher and this caused Rancher to get stuck on "waiting for  [nodename] certificate rotation". This would not proceed because all the rancher-system-agent processes on the server nodes were restarting with the error: "Fatal error running: unable to parse connection info file: invalid character 'n' looking for beginning of value". When we checked the "/var/lib/rancher/agent/rancher2_connection_info.json" file it said: "machine not found by request". Even if we deleted the node so that the node-pool re-registers it, it would not proceed. Hostbusters engineering said this is happening because of a problem with the elemental tokens.

Business impact:

This made the customer's cluster inaccessible through the Rancher UI

Troubleshooting steps:

  • Unpause the capi cluster by editing the clusters.cluster.x-k8s.io object and setting spec.paused: false
  • Making sure the Machine object for the node being provisioned had the rke.cattle.io/machine-id label which should be set to the value of what's in the /etc/rancher/agent/cattle_id file
  • stopping the rancher-system-agent and elemental-system-agent
  • symlink /opt/rke2/bin/rke2 to /usr/local/bin/rke2 (problem specific to SLE Micro)
  • re-register the node by pulling the registration command from kubectl get clusterregistrationtokens.management.cattle.io/default-token -n -o yaml and it's in the status.nodeCommand section.
  • scale down the nodePool to two nodes
    once elemental was able to add the node back to the inventory,
  • scale the cluster back up to 3 server nodes
@kkaempf kkaempf added kind/bug Something isn't working JIRA must shout labels Jan 14, 2025
@kkaempf kkaempf moved this to 🗳️ To Do in Elemental Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
JIRA must shout kind/bug Something isn't working
Projects
Status: 🗳️ To Do
Development

No branches or pull requests

1 participant