Single node upgrades are broken by container_manager replacement logic #8609

cristicalin · 2022-03-06T07:30:44Z

Environment:

Cloud provider or hardware configuration:
Baremetal single node clusters.
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):

Linux 5.13.0-28-generic x86_64
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Version of Ansible (ansible --version):

ansible [core 2.11.9] 
  config file = /root/kubespray/ansible.cfg
  configured module search path = ['/root/kubespray/library']
  ansible python module location = /root/venv/lib/python3.8/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/venv/bin/ansible
  python version = 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0]
  jinja version = 2.11.3
  libyaml = True

Version of Python (python --version):

Python 3.8.10

Kubespray version (commit) (git rev-parse --short HEAD):

0fc453fe

Network plugin used:

calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

[all]
ubuntu-nuc-00.kaveman.intra ansible_connection=local local_as=64512

[all:vars]
upgrade_cluster_setup=True
force_certificate_regeneration=True
etcd_kubeadm_enabled=True
download_container=False
peer_with_router=False

[kube_control_plane]
ubuntu-nuc-00.kaveman.intra

[kube_control_plane:vars]

[etcd:children]
kube_control_plane

[kube_node]
ubuntu-nuc-00.kaveman.intra

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

[k8s_cluster:vars]
kube_version=v1.22.7
calico_version=v3.22.0
helm_enabled=True
metrics_server_enabled=True
ingress_nginx_enabled=True
cert_manager_enabled=False
metallb_enabled=True
metallb_speaker_enabled=False
metallb_protocol=bgp
metallb_controller_tolerations=[{'effect':'NoSchedule','key':'node-role.kubernetes.io/master'},{'effect':'NoSchedule','key':'node-role.kubernetes.io/control-plane'}]
metallb_ip_range=["10.5.0.0/16"]
kube_proxy_strict_arp=True
kube_encrypt_secret_data=True
container_manager=containerd
kubernetes_audit=True
calico_datastore="kdd"
calico_iptables_backend="NFT"
calico_advertise_cluster_ips=True
calico_felix_prometheusmetricsenabled=True
calico_ipip_mode=Never
calico_vxlan_mode=Never
calico_advertise_service_loadbalancer_ips=["10.5.0.0/16"]
calico_ip_auto_method: "interface=eno1""
kube_network_plugin_multus=True
kata_containers_enabled=False
runc_version=v1.1.0
typha_enabled=True
nodelocaldns_external_zones=[{'cache': 30,'zones':['kaveman.intra'],'nameservers':['192.168.0.1']}]
nodelocaldns_bind_metrics_host_ip=True
csi_snapshot_controller_enabled=True
deploy_netchecker=True
krew_enabled=True

Command used to invoke ansible:

ansible-playbook -i ../inventory.ini cluster.yml -vvv

Output of ansible run:

Ansible run breaks cordoning off the node since the container-engine/validate-container-engine role calls the remove-node/pre-remove node each time. There seems to be an issue in the container_manager detection logic which causes this role to trigger the action even though the container_manager has not been changed.

Anything else do we need to know:

This container_manager replacement logic should be tested in CI to ensure it works properly and does not break existing deployments before we tag 2.19.

/cc @cyril-corbon

The text was updated successfully, but these errors were encountered:

cristicalin · 2022-03-08T06:49:29Z

Upon further diagnosis it looks like to container manager detection logic may be at fault when an old container_manager has not been properly cleaned up. In my case I had docker.service files laying around from a previous install. I think also a service check should be performed to see if the service is actually running.

Another symptom which I was unable to reproduce is that the cluster.yaml run actually triggered a re-install of the docker engine which I cannot quite explain.

cristicalin added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2022

cyril-corbon mentioned this issue Mar 6, 2022

fix: container-replacement logic and add CI to test it #8611

Closed

cyril-corbon mentioned this issue Mar 29, 2022

fix: uninstall contailer engine if service is running #8662

Merged

k8s-ci-robot closed this as completed in #8662 Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single node upgrades are broken by container_manager replacement logic #8609

Single node upgrades are broken by container_manager replacement logic #8609

cristicalin commented Mar 6, 2022

cristicalin commented Mar 8, 2022

Single node upgrades are broken by container_manager replacement logic #8609

Single node upgrades are broken by container_manager replacement logic #8609

Comments

cristicalin commented Mar 6, 2022

cristicalin commented Mar 8, 2022