Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single node upgrades are broken by container_manager replacement logic #8609

Closed
cristicalin opened this issue Mar 6, 2022 · 1 comment · Fixed by #8662
Closed

Single node upgrades are broken by container_manager replacement logic #8609

cristicalin opened this issue Mar 6, 2022 · 1 comment · Fixed by #8662
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@cristicalin
Copy link
Contributor

Environment:

  • Cloud provider or hardware configuration:
    Baremetal single node clusters.

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):

Linux 5.13.0-28-generic x86_64
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
  • Version of Ansible (ansible --version):
ansible [core 2.11.9] 
  config file = /root/kubespray/ansible.cfg
  configured module search path = ['/root/kubespray/library']
  ansible python module location = /root/venv/lib/python3.8/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/venv/bin/ansible
  python version = 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0]
  jinja version = 2.11.3
  libyaml = True
  • Version of Python (python --version):
Python 3.8.10

Kubespray version (commit) (git rev-parse --short HEAD):

0fc453fe

Network plugin used:

calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

[all]
ubuntu-nuc-00.kaveman.intra ansible_connection=local local_as=64512

[all:vars]
upgrade_cluster_setup=True
force_certificate_regeneration=True
etcd_kubeadm_enabled=True
download_container=False
peer_with_router=False

[kube_control_plane]
ubuntu-nuc-00.kaveman.intra

[kube_control_plane:vars]

[etcd:children]
kube_control_plane

[kube_node]
ubuntu-nuc-00.kaveman.intra

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

[k8s_cluster:vars]
kube_version=v1.22.7
calico_version=v3.22.0
helm_enabled=True
metrics_server_enabled=True
ingress_nginx_enabled=True
cert_manager_enabled=False
metallb_enabled=True
metallb_speaker_enabled=False
metallb_protocol=bgp
metallb_controller_tolerations=[{'effect':'NoSchedule','key':'node-role.kubernetes.io/master'},{'effect':'NoSchedule','key':'node-role.kubernetes.io/control-plane'}]
metallb_ip_range=["10.5.0.0/16"]
kube_proxy_strict_arp=True
kube_encrypt_secret_data=True
container_manager=containerd
kubernetes_audit=True
calico_datastore="kdd"
calico_iptables_backend="NFT"
calico_advertise_cluster_ips=True
calico_felix_prometheusmetricsenabled=True
calico_ipip_mode=Never
calico_vxlan_mode=Never
calico_advertise_service_loadbalancer_ips=["10.5.0.0/16"]
calico_ip_auto_method: "interface=eno1""
kube_network_plugin_multus=True
kata_containers_enabled=False
runc_version=v1.1.0
typha_enabled=True
nodelocaldns_external_zones=[{'cache': 30,'zones':['kaveman.intra'],'nameservers':['192.168.0.1']}]
nodelocaldns_bind_metrics_host_ip=True
csi_snapshot_controller_enabled=True
deploy_netchecker=True
krew_enabled=True

Command used to invoke ansible:

ansible-playbook -i ../inventory.ini cluster.yml -vvv

Output of ansible run:

Ansible run breaks cordoning off the node since the container-engine/validate-container-engine role calls the remove-node/pre-remove node each time. There seems to be an issue in the container_manager detection logic which causes this role to trigger the action even though the container_manager has not been changed.

Anything else do we need to know:

This container_manager replacement logic should be tested in CI to ensure it works properly and does not break existing deployments before we tag 2.19.

/cc @cyril-corbon

@cristicalin cristicalin added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2022
@cristicalin
Copy link
Contributor Author

Upon further diagnosis it looks like to container manager detection logic may be at fault when an old container_manager has not been properly cleaned up. In my case I had docker.service files laying around from a previous install. I think also a service check should be performed to see if the service is actually running.

Another symptom which I was unable to reproduce is that the cluster.yaml run actually triggered a re-install of the docker engine which I cannot quite explain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
1 participant