-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD cluster unhealthy during kubeadm upgrade to v1.22.5 #2682
Comments
@neolit123 and me have been discussing this issue over the kubeadm slack channel |
slack discussion: do you have proof that |
@neolit123 The trace below if from running
|
what do you see in the etcd logs for that member if the etcd manifest is at 3.5? |
Also, who is responsible for updating the static manifest file ? I thought kubeadm upgrade handles it ? |
@neolit123 Will this step not set the manifest file back to 3.4.13 ?
This is from the kubeadm upgrade node failure log I pasted above. |
These were the docker logs I could gather from the crashing etcd container on the third master. Note that the pod was not even running on the third master post the failure; the container itself was getting restarted continuously with the following trace
|
yes, it rolls back etcd to the old manifest if the 3.5 pod fails to start.
kubeadm upgrades etcd for you.
let's say you have 3 etcd members at 3.4 upgrade on first node will result in: after upgrade on second node you will have: at that point only one member is at 3.4, but the minimum version of etcd is still at 3.4 because that's how etcd works. once you call upgrade on the third node you will get: that would upgrade the etcd minimum version to 3.5 (NOTE: etcd does that internally not kubeadm) this is what i suspect happened, but the question is why the 3.5 etcd pod failed in the first place after the minimum version was set?
this sequence of messages does not make sense. that sounds like an etcd bug. |
if this happens the kubeadm rollback is incorrect, but we actually can't do much about it because the etcd minimum version is internal. as mentioned on slack you can manually update the 3rd node manifest to 3.5 and see if it starts, but if it still contains the errors related to downgrade and 3.4 that's very strange and sounds like a etcd database bug. |
what you can try is follow standard etcd recovery procedure and delete the 3rd member data:
see if the pod starts. not much else can be done here. |
Thanks @neolit123 We will make a note of these steps and try it out if we hit this issue again. As I said; this is a very rare issue |
yeah, just comment here again if you see it. it's much more rare than 1/20. in our CI we create multiple HA clusters every 2 hours. |
What keywords did you search in kubeadm issues before filing this one?
This issue is similar to kubernetes/kubernetes#65580 raised some time back
Is this a BUG REPORT or FEATURE REQUEST? BUG REPORT
Versions
kubeadm version (use
kubeadm version
): 1.22.5Environment:
kubectl version
): 1.22.5uname -a
): 4.18.0-305.25.1.el8_4.x86_64What happened?
While performing a K8S upgrade from version 1.21 to 1.22 using kubeadm; we hit a failure as etcd pod failed to come up on one of the masters with the following trace
What you expected to happen?
successful K8S upgrade to 1.22.5 or healthy etcd cluster in case of failure
How to reproduce it (as minimally and precisely as possible)?
This issue is not hit consistently; we hit this in 1 out of our 20 upgrade runs
Anything else we need to know?
All etcd pods were up and running healthy before attempting the kubeadm upgrade; it is only during the kubeadm upgrade that the etcd pod on one of the master fails to come up with the trace pasted above.
This is how the kubeadm upgrade faiulure trace looks like
before attempting the kubeadm upgrade; all etcd pods were in healthy state
Post the kubeadm upgrade; the etcd pod on etcd-node-11 failed to come up as the etcd cluster was using version 3.5 and etcd-node-11 pod was failing and crashing with the docker logs pasted above in this bug.
The text was updated successfully, but these errors were encountered: