Autopilot with controller+worker roles fails to update #4450

jnummelin · 2024-05-20T08:16:47Z

Before creating an issue, make sure you've checked the following:

You are running the latest released version of k0s
Make sure you've searched for existing issues, both open and closed
Make sure you've searched for PRs too, a fix might've been merged already
You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

ALL

Version

1.30.0

Sysinfo

`k0s sysinfo`

➡️ Please replace this text with the output of `k0s sysinfo`. ⬅️

What happened?

When k0s running with controller+worker nodes, autopilot fails to update. Say you have 3 controller+worker nodes. When autopilot updates the first one, it is very likely that the node will lose it's leadership. If it was a leader in the first place.

Now when autopilot updates the k0s binary and restarts it (via systemd/openrc),t he controller parts does start properly. The worker part will NOT start properly as it does not find the worker-config-default-1.30 configmap. This is because the controller does not apply the new version of the CM as it is not the leader. Hence we reach a "deadlock", autopilot cannot proceed as the first controller does not successfully update.

This doesn't happen 100% of the time, if one is lucky enough that autopilot updates first the controller that is a leader and everything is fast enough so that node does not lose its leadership it will succeed.

Steps to reproduce

Install 3 nodes with controller-worker with version 1.29.4 (the version does not really matter AFAIK)
Apply a plan to update to 1.30.0
Observe a deadlock

Expected behavior

Autopilot successfully updates controller+worker nodes

Actual behavior

Autopilot gets into "deadlock" where the worker part on an updated controller does not start properly unless the updated node happens to also be the leader.

Screenshots and logs

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

makhov · 2024-05-20T09:03:20Z

I've noticed this error, probably meaning that kubelet can't connect to kube-apiserver. Looks like, kubelet won't be operational even if the worker-config configmap is present

May 16 14:16:48 k0s-cluster-ctr2 k0s[1761293]: time="2024-05-16 14:16:48" level=info msg="E0516 14:16:48.793590 1761329 authentication.go:73] \"Unable to aut
henticate the request\" err=\"[invalid bearer token, service account token has been invalidated]\"" component=kube-apiserver stream=stderr

twz123 · 2024-05-22T07:51:11Z

NB: This is not Autopilot doing something wrong, it is a general conceptual problem with updating HA clusters using controller+worker nodes that Autopilot runs into: the worker parts of newly added controller+worker nodes won't get ready, so Autopilot fails to proceed with the update.

jnummelin added bug Something isn't working area/controlplane priority/P0 component/autopilot labels May 20, 2024

jnummelin mentioned this issue May 21, 2024

Use dedicated leasepool for worker config component #4457

Merged

16 tasks

jnummelin closed this as completed in #4457 May 22, 2024

k0s-bot mentioned this issue May 22, 2024

[Backport release-1.30] Use dedicated leasepool for worker config component #4463

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autopilot with controller+worker roles fails to update #4450

Autopilot with controller+worker roles fails to update #4450

jnummelin commented May 20, 2024

makhov commented May 20, 2024 •

edited

Loading

twz123 commented May 22, 2024

Autopilot with controller+worker roles fails to update #4450

Autopilot with controller+worker roles fails to update #4450

Comments

jnummelin commented May 20, 2024

Before creating an issue, make sure you've checked the following:

Platform

Version

Sysinfo

What happened?

Steps to reproduce

Expected behavior

Actual behavior

Screenshots and logs

Additional context

makhov commented May 20, 2024 • edited Loading

twz123 commented May 22, 2024

makhov commented May 20, 2024 •

edited

Loading