Failure to rollback etcd datadir in case of errors during kubeadm inplace upgrade from 1.10.5 to 1.11.0 #65580

fgbreel · 2018-06-28T12:10:09Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
During upgrade from v1.10.5 to v1.11.0 the etcd data rollback procedure of kubeadm does not delete /var/lib/etcd/member before copying backup data into it.

These are last lines from the etcd container after the rollback procedure:

2018-06-28 10:39:47.809770 N | etcdserver/membership: updated the cluster version from 3.1 to 3.2
2018-06-28 10:39:47.809807 C | etcdserver/membership: cluster cannot be downgraded (current version: 3.1.12 is lower than determined cluster version: 3.2).

Manually removing /var/lib/etcd/member followed by copying backup data from kubeadm-backup-etcd-2018-06-28-12-29-22/etcd into /var/lib/etcd and deleting etcd container makes next etcd container (with old version) stay alive.

I believe that the etcd upgrade failure in the first place was due a mistake I did in installing kubelet before executing kubeadm upgrade apply v1.11.0.

Mostly because during the installation of kubeadm the file /etc/systemd/system/kubelet.service.d/10-kubeadm.conf was replaced, and kubelet started to fail resulting in NotReady. (maybe this is another bug)

I had to copy /etc/systemd/system/kubelet.service.d/10-kubeadm.conf from a node and restart kubelet.

This is the output of kubeadm:

root@k8s-master-0 /home/gfrancisco # kubeadm upgrade apply v1.11.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
I0628 12:29:20.391687   19474 feature_gate.go:230] feature gates: &{map[]}
[upgrade/apply] Respecting the --cri-socket flag that is set with higher priority than the config file.
[upgrade/version] You have chosen to change the cluster version to "v1.11.0"
[upgrade/versions] Cluster version: v1.10.5
[upgrade/versions] kubeadm version: v1.11.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler etcd]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.11.0"...
Static pod: kube-apiserver-k8s-master-0 hash: e2da7405c3d64b205e2cc70c4441024a
Static pod: kube-controller-manager-k8s-master-0 hash: 4dc08321fa06fbe902c2092ec1f31846
Static pod: kube-scheduler-k8s-master-0 hash: bba7685b70c1361fba54bcb8dcbaf72f
Static pod: etcd-k8s-master-0 hash: 339a6e903e445d5775cf37a441ad419d
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests751988143/etcd.yaml"
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing etcd/server certificate and key.
[certificates] Using the existing etcd/peer certificate and key.
[certificates] Using the existing etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2018-06-28-12-29-22/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
Static pod: etcd-k8s-master-0 hash: 339a6e903e445d5775cf37a441ad419d
Static pod: etcd-k8s-master-0 hash: 339a6e903e445d5775cf37a441ad419d
Static pod: etcd-k8s-master-0 hash: 14468a549576a1e43d0aa13ded97dd7b
[apiclient] Found 1 Pods for label selector component=etcd
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]
[upgrade/etcd] Waiting for previous etcd to become available
[util/etcd] Waiting 0s for initial delay
[util/etcd] Attempting to see if all cluster endpoints are available 1/10
[util/etcd] Attempt timed out
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 4/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 5/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 6/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 7/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 8/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 9/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 10/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[upgrade/etcd] Failed to healthcheck previous etcd: timeout waiting for etcd cluster to be available
[upgrade/etcd] Rolling back etcd data
[upgrade/etcd] Etcd data rollback successful
[upgrade/etcd] Waiting for previous etcd to become available
[util/etcd] Waiting 0s for initial delay
[util/etcd] Attempting to see if all cluster endpoints are available 1/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 2/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 3/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 4/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 5/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 6/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 7/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 8/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 9/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to see if all cluster endpoints are available 10/10
[util/etcd] Attempt failed with error: dial tcp 127.0.0.1:2379: connect: connection refused
[upgrade/etcd] Failed to healthcheck previous etcd: timeout waiting for etcd cluster to be available
[upgrade/apply] FATAL: fatal error rolling back local etcd cluster manifest: timeout waiting for etcd cluster to be available, the backup of etcd database is stored here:(/etc/kubernetes/tmp/kubeadm-backup-etcd-2018-06-28-12-29-22)

What you expected to happen:
Successful rollback of etcd data dir in case of failures during upgrade.

How to reproduce it (as minimally and precisely as possible):

dpkg -i kubectl_1.11.0-00_amd64.deb kubeadm_1.11.0-00_amd64.deb
during the installation of kubeadm the file /etc/systemd/system/kubelet.service.d/10-kubeadm.conf will be replaced, and kubelet will start failing. (maybe this is another bug)
copy /etc/systemd/system/kubelet.service.d/10-kubeadm.conf from a node and restart kubelet.
dpkg -i kubelet_1.11.0-00_amd64.deb # _now_ I know this is a mistake, but important to trigger the "rollback etcd datadir" :)
kubeadm upgrade plan
kubeadm upgrade apply v1.11.0.

Anything else we need to know?:
This cluster already received the following upgrade paths:

1.8.0 (ok) -> 1.8.3 (ok) -> 1.9.3 (ok) -> 1.9.6 (ok) -> 1.10.2 (ok) -> 1.10.5 (ok) -> 1.11.0 (okish)

I moved kube-proxy from iptables to IPVS when running 1.10.2.

I downgraded kubelet to v1.10.5, upgraded again with kubeadm upgrade apply v.1.11.0 and it works, my cluster is healthy.

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: bare metal
OS (e.g. from /etc/os-release):

PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel (e.g. uname -a):

Linux k8s-master-0 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux

Install tools:

kubeadm version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:14:41Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}

Others:

The text was updated successfully, but these errors were encountered:

fgbreel · 2018-06-28T12:12:01Z

@kubernetes/sig-cluster-lifecycle-bugs

k8s-ci-robot · 2018-06-28T12:12:08Z

@fgbreel: Reiterating the mentions to trigger a notification:
@kubernetes/sig-cluster-lifecycle-bugs

In response to this:

@kubernetes/sig-cluster-lifecycle-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

luxas · 2018-06-30T08:27:48Z

Hi, thanks for taking time to create this issue 👋!
Please reopen this issue in https://github.com/kubernetes/kubeadm/issues.
Thank you!

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jun 28, 2018

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 28, 2018

liggitt added the area/kubeadm label Jun 28, 2018

luxas closed this as completed Jun 30, 2018

rsparulek mentioned this issue Apr 21, 2022

ETCD cluster unhealthy during kubeadm upgrade to v1.22.5 kubernetes/kubeadm#2682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to rollback etcd datadir in case of errors during kubeadm inplace upgrade from 1.10.5 to 1.11.0 #65580

Failure to rollback etcd datadir in case of errors during kubeadm inplace upgrade from 1.10.5 to 1.11.0 #65580

fgbreel commented Jun 28, 2018

fgbreel commented Jun 28, 2018

k8s-ci-robot commented Jun 28, 2018

luxas commented Jun 30, 2018

Failure to rollback etcd datadir in case of errors during kubeadm inplace upgrade from 1.10.5 to 1.11.0 #65580

Failure to rollback etcd datadir in case of errors during kubeadm inplace upgrade from 1.10.5 to 1.11.0 #65580

Comments

fgbreel commented Jun 28, 2018

fgbreel commented Jun 28, 2018

k8s-ci-robot commented Jun 28, 2018

luxas commented Jun 30, 2018