Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to rollback etcd datadir in case of errors during kubeadm inplace upgrade from 1.10.5 to 1.11.0 #65580

Closed
fgbreel opened this issue Jun 28, 2018 · 3 comments
Labels
area/kubeadm kind/bug Categorizes issue or PR as related to a bug. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@fgbreel
Copy link

fgbreel commented Jun 28, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
During upgrade from v1.10.5 to v1.11.0 the etcd data rollback procedure of kubeadm does not delete /var/lib/etcd/member before copying backup data into it.

These are last lines from the etcd container after the rollback procedure:

2018-06-28 10:39:47.809770 N | etcdserver/membership: updated the cluster version from 3.1 to 3.2
2018-06-28 10:39:47.809807 C | etcdserver/membership: cluster cannot be downgraded (current version: 3.1.12 is lower than determined cluster version: 3.2).

Manually removing /var/lib/etcd/member followed by copying backup data from kubeadm-backup-etcd-2018-06-28-12-29-22/etcd into /var/lib/etcd and deleting etcd container makes next etcd container (with old version) stay alive.

I believe that the etcd upgrade failure in the first place was due a mistake I did in installing kubelet before executing kubeadm upgrade apply v1.11.0.

Mostly because during the installation of kubeadm the file /etc/systemd/system/kubelet.service.d/10-kubeadm.conf was replaced, and kubelet started to fail resulting in NotReady. (maybe this is another bug)

I had to copy /etc/systemd/system/kubelet.service.d/10-kubeadm.conf from a node and restart kubelet.

This is the output of kubeadm:

root@k8s-master-0 /home/gfrancisco # kubeadm upgrade apply v1.11.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
I0628 12:29:20.391687   19474 feature_gate.go:230] feature gates: &{map[]}
[upgrade/apply] Respecting the --cri-socket flag that is set with higher priority than the config file.
[upgrade/version] You have chosen to change the cluster version to "v1.11.0"
[upgrade/versions] Cluster version: v1.10.5
[upgrade/versions] kubeadm version: v1.11.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler etcd]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.11.0"...
Static pod: kube-apiserver-k8s-master-0 hash: e2da7405c3d64b205e2cc70c4441024a
Static pod: kube-controller-manager-k8s-master-0 hash: 4dc08321fa06fbe902c2092ec1f31846
Static pod: kube-scheduler-k8s-master-0 hash: bba7685b70c1361fba54bcb8dcbaf72f
Static pod: etcd-k8s-master-0 hash: 339a6e903e445d5775cf37a441ad419d
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests751988143/etcd.yaml"
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing etcd/server certificate and key.
[certificates] Using the existing etcd/peer certificate and key.
[certificates] Using the existing etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2018-06-28-12-29-22/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
Static pod: etcd-k8s-master-0 hash: 339a6e903e445d5775cf37a441ad419d
Static pod: etcd-k8s-master-0 hash: 339a6e903e445d5775cf37a441ad419d
Static pod: etcd-k8s-master-0 hash: 14468a549576a1e43d0aa13ded97dd7b
[apiclient] Found 1 Pods for label selector component=etcd
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]
[upgrade/etcd] Waiting for previous etcd to become available
[util/etcd] Waiting 0s for initial delay
[util/etcd] Attempting to see if all cluster endpoints are available 1/10
[util/etcd] Attempt timed out
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 4/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 5/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 6/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 7/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 8/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 9/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 10/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[upgrade/etcd] Failed to healthcheck previous etcd: timeout waiting for etcd cluster to be available
[upgrade/etcd] Rolling back etcd data
[upgrade/etcd] Etcd data rollback successful
[upgrade/etcd] Waiting for previous etcd to become available
[util/etcd] Waiting 0s for initial delay
[util/etcd] Attempting to see if all cluster endpoints are available 1/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 4/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 5/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 6/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 7/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 8/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 9/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 10/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[upgrade/etcd] Failed to healthcheck previous etcd: timeout waiting for etcd cluster to be available
[upgrade/apply] FATAL: fatal error rolling back local etcd cluster manifest: timeout waiting for etcd cluster to be available, the backup of etcd database is stored here:(/etc/kubernetes/tmp/kubeadm-backup-etcd-2018-06-28-12-29-22)

What you expected to happen:
Successful rollback of etcd data dir in case of failures during upgrade.

How to reproduce it (as minimally and precisely as possible):

  1. dpkg -i kubectl_1.11.0-00_amd64.deb kubeadm_1.11.0-00_amd64.deb
  2. during the installation of kubeadm the file /etc/systemd/system/kubelet.service.d/10-kubeadm.conf will be replaced, and kubelet will start failing. (maybe this is another bug)
  3. copy /etc/systemd/system/kubelet.service.d/10-kubeadm.conf from a node and restart kubelet.
  4. dpkg -i kubelet_1.11.0-00_amd64.deb # _now_ I know this is a mistake, but important to trigger the "rollback etcd datadir" :)
  5. kubeadm upgrade plan
  6. kubeadm upgrade apply v1.11.0.

Anything else we need to know?:
This cluster already received the following upgrade paths:

1.8.0 (ok) -> 1.8.3 (ok) -> 1.9.3 (ok) -> 1.9.6 (ok) -> 1.10.2 (ok) -> 1.10.5 (ok) -> 1.11.0 (okish)

I moved kube-proxy from iptables to IPVS when running 1.10.2.

I downgraded kubelet to v1.10.5, upgraded again with kubeadm upgrade apply v.1.11.0 and it works, my cluster is healthy.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: bare metal
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
Linux k8s-master-0 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux
  • Install tools:
kubeadm version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:14:41Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jun 28, 2018
@fgbreel
Copy link
Author

fgbreel commented Jun 28, 2018

@kubernetes/sig-cluster-lifecycle-bugs

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 28, 2018
@k8s-ci-robot
Copy link
Contributor

@fgbreel: Reiterating the mentions to trigger a notification:
@kubernetes/sig-cluster-lifecycle-bugs

In response to this:

@kubernetes/sig-cluster-lifecycle-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@luxas
Copy link
Member

luxas commented Jun 30, 2018

Hi, thanks for taking time to create this issue 👋!
Please reopen this issue in https://github.com/kubernetes/kubeadm/issues.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubeadm kind/bug Categorizes issue or PR as related to a bug. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

4 participants