Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autopilot with controller+worker roles fails to update #4450

Closed
4 tasks done
jnummelin opened this issue May 20, 2024 · 2 comments · Fixed by #4457
Closed
4 tasks done

Autopilot with controller+worker roles fails to update #4450

jnummelin opened this issue May 20, 2024 · 2 comments · Fixed by #4457

Comments

@jnummelin
Copy link
Member

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

ALL

Version

1.30.0

Sysinfo

`k0s sysinfo`
➡️ Please replace this text with the output of `k0s sysinfo`. ⬅️

What happened?

When k0s running with controller+worker nodes, autopilot fails to update. Say you have 3 controller+worker nodes. When autopilot updates the first one, it is very likely that the node will lose it's leadership. If it was a leader in the first place.

Now when autopilot updates the k0s binary and restarts it (via systemd/openrc),t he controller parts does start properly. The worker part will NOT start properly as it does not find the worker-config-default-1.30 configmap. This is because the controller does not apply the new version of the CM as it is not the leader. Hence we reach a "deadlock", autopilot cannot proceed as the first controller does not successfully update.

This doesn't happen 100% of the time, if one is lucky enough that autopilot updates first the controller that is a leader and everything is fast enough so that node does not lose its leadership it will succeed.

Steps to reproduce

  1. Install 3 nodes with controller-worker with version 1.29.4 (the version does not really matter AFAIK)
  2. Apply a plan to update to 1.30.0
  3. Observe a deadlock

Expected behavior

Autopilot successfully updates controller+worker nodes

Actual behavior

Autopilot gets into "deadlock" where the worker part on an updated controller does not start properly unless the updated node happens to also be the leader.

Screenshots and logs

No response

Additional context

No response

@makhov
Copy link
Contributor

makhov commented May 20, 2024

I've noticed this error, probably meaning that kubelet can't connect to kube-apiserver. Looks like, kubelet won't be operational even if the worker-config configmap is present

May 16 14:16:48 k0s-cluster-ctr2 k0s[1761293]: time="2024-05-16 14:16:48" level=info msg="E0516 14:16:48.793590 1761329 authentication.go:73] \"Unable to aut
henticate the request\" err=\"[invalid bearer token, service account token has been invalidated]\"" component=kube-apiserver stream=stderr

@twz123
Copy link
Member

twz123 commented May 22, 2024

NB: This is not Autopilot doing something wrong, it is a general conceptual problem with updating HA clusters using controller+worker nodes that Autopilot runs into: the worker parts of newly added controller+worker nodes won't get ready, so Autopilot fails to proceed with the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants