Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.8.7 pre-start script fails after rollback from 0.8.7 to 0.8.6 #87

Closed
gberche-orange opened this issue Jan 9, 2025 · 1 comment · Fixed by #86
Closed

0.8.7 pre-start script fails after rollback from 0.8.7 to 0.8.6 #87

gberche-orange opened this issue Jan 9, 2025 · 1 comment · Fixed by #86

Comments

@gberche-orange
Copy link
Member

gberche-orange commented Jan 9, 2025

Given a bosh deployment running a k3s-wrapper 0.8.6 with K3S_DATA_DIR set to /var/vcap/store/k3s-xx
And bosh deploy operation upgrading to 0.8.7 which fails
And bosh cck which restores the last known version 0.8.6 (*)
And a new bosh deploy upgrades to 0.8.7
Then we observe the pre-start script failing with message with 0.8.7

cat pre-start.stderr.log
#> mv: cannot move '/var/vcap/store/k3s-agent' to '/var/vcap/store/k3s-datadir': Directory not empty

This is because the start with version 0.8.6 (*) has created a new '/var/vcap/store/k3s-agent' directory

Note: During this process, k3s would consider installing a fresh node, and would overide the '/etc/rancher/node/password'

As a result, even after fixing the pre-start issue, the new node password that would prevent joining other existing server nodes with error

time="2025-01-06T11:53:24Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: /var/vcap/store/k3s-datadir/agent/serving-kubelet.crt: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"

The workaround is then to delete the k8s node (e.g. kubectl delete node/edge-agents-r1-z2-0)

See https://docs.k3s.io/architecture#how-agent-node-registration-works

If the /etc/rancher/node directory of an agent is removed, or you wish to rejoin a node using an existing name, the node should be deleted from the cluster. This will clean up both the old node entry, and the node password secret, and allow the node to (re)join the cluster.

@gberche-orange gberche-orange changed the title pre-start script fails after rollback from 0.8.7 to 0.8.6 0.8.7 pre-start script fails after rollback from 0.8.7 to 0.8.6 Jan 9, 2025
@gberche-orange
Copy link
Member Author

By default, k3s uses the hostname as the node name (aka node identifier). However, bosh assign the agent_id as hostname, see cloudfoundry/bosh-dns-release#30. The agent_id changes each time a new vm is created. Therefore, in order to support bosh recreation of instances, the k3s wrapper release sets a stable predictable k8s node name in

export K3S_NODE_NAME=<%= spec.name %>-<%= spec.index %>
<% if_p('k3s.node_name_prefix') do |prefix| %>
export K3S_NODE_NAME=<%= prefix %>-<%= spec.index %>
<% end %>

This leverages the K3S_NODE_NAME environment variable in k3s:

https://docs.k3s.io/networking/basic-network-options?_highlight=k3s_node_name#nodes-without-a-hostname

You can run K3s with the --node-name flag or K3S_NODE_NAME environment variable and this will pass the node name to resolve this issue.

The host passwords hash stored in the k8s api can be inspected with

k get -n kube-system secrets edge-agents-r1-z2-3.node-password.k3s -o  "jsonpath={.data['hash']}" | base64 -d
#>$1:9cd6c33604e7df7c:15:8:1:PRTNHSo8kopiYCFpHb0wClsK47rnDiH2mnRqnWU6yx0ym3PTdLKuX8vJJB0Y+J1Swg0DFgQcNf0xkj1J5NeA8A

# compare with content of /etc/rancher/node/password fetch from bosh ssh
93c9d53f362229674c341014fdee4b5b

The function creating the secret hash from node password is available at https://github.com/k3s-io/k3s/blob/f345697c0a65c5c427817ea60b712a27c6c159d9/pkg/nodepassword/nodepassword.go#L70-L88

The function for verifying node password from hash is available at the following golang sources, but with no documented corresponding CLI commands for external verification

https://github.com/k3s-io/k3s/blob/f345697c0a65c5c427817ea60b712a27c6c159d9/pkg/nodepassword/nodepassword.go#L54-L67
https://github.com/k3s-io/k3s/blob/f345697c0a65c5c427817ea60b712a27c6c159d9/pkg/authenticator/hash/scrypt.go#L58-L88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant