-
Notifications
You must be signed in to change notification settings - Fork 294
kube-aws v0.9.6-rc.2 - Unstable Etcd3 cluster #660
Comments
Might be caused by #640 |
I've confirmed that this is reproducible on Steps to reproduce with EtcD logs shipped to a Pre-existing ELK stack (hope this helps in Etcd failure analysis):
|
It appears that after this failure, the Presumably, the probability of going below quorum decreases with increasing odd cluster size... but for bug reproduction a 3 node cluster will most likely exhibit the failure. So far 3 out of 3 tests with fresh clusters have had the problem. |
Hi @trinitronx, thank you very much for the detailed feedback 👍 First of all, please let me sync. Sorry if it missed your expectation in that regard and made you frustrated. Backing to the topic -
It does seem like you have affected by #640, which has been fixed since v0.9.7-rc.1. |
@mumoshu : Thanks for the quick response! Ah, I suppose my assumption of "stable" was incorrect. This is definitely good to know! I did see that the etcd auto-recovery was just added, so I did not assume this would work 100%. However, I did not expect the cluster to go unhealthy so fast in the first place, and was hoping not to need it ;-) We have been running an older kubernetes cluster with I think it may make sense to test out this Hopefully my reproduction & ELK logging for Etcd nodes can provide some debugging value. Initially I thought that I would be able to debug the cluster myself, but I am seeing basically the logs I posted. This time, the quorum was kept with 2 nodes coming back When I test connection to this node, there are unexplained connection errors. I try to connect to Basically what I've tried so far is in the steps I detailed above. I also have tested out this
Bad node result:
Good node result:
|
@trinitronx Thanks again for the info 👍 Just a quick response but could you run the followings on a bad node:
|
@mumoshu So I did dig in a bit and check these logs. It seems that the recovery service Here is a Gist with scrubbed output from
SystemD Unit "status" output
Conclusion:
So, More Etcd connection debuggingOn the bad node,
Does not seem to respond to a
Additional informationI do see symptoms of #640 when I list exited docker containers I see many
|
@mumoshu : Perhaps last bit of helpful info: On the bad etcd node, it has an IAM Role with this policy attached (private info scrubbed):
|
Further DebuggingToday I was able to dig deeper into the S3 issue. I was able to find the
I then ran that command on my system using:
The output in debug mode shows that the IAM Role for this node is being used:
Testing SolutionSo for some reason the default S3 bucket policy is not letting this instance read from the bucket. I experimented by adding a new more permissive policy:
ValidationUsing this new policy, the Etcd nodes are able to Read / Write to this
After rebooting all 3 nodes, 2 of them were able to restore successfully:
1 Node Restore ProblemThe third one did not. Here are his logs It seems that all 3 of the
@mumoshu : Is this intended operation of the After rebooting this node a couple times and restarting all |
Edit: Updated step-by-step after some testing. I most likely ran into this issue this weekend. One of my etcd instances (the only one that survived anyway) had 20k+ stopped containers, so I manually added the Just to add more information to the table, I'll explain what happened on my end. I'm running a three-node etcd cluster, and one of the instances suddenly started to consume 100% CPU (the same thing happened later with other node, but luckily I had time to recover the first failing node before the other one went down). I was unable to log into it via SSH to see what was happening (most likely due to the high load), so I terminated the instance (thinking back, it would probably be better to enable the termination protection on the instance and reboot it, but I guess I'll never know by now). Then, when the new instance went up, I saw all these rafthttp errors, causing the new instance to be unable to join the cluster. After many hours of trial and error, this is how I managed to restore the cluster: # on some working etcd node, remove and re-add the member
$ etcdctl member remove <old-member-id>
$ etcdctl member add etcd<N> https://<hostname>:2380
# on the new etcd node, stop etcd-related services and clean up the node data
$ sudo systemctl stop etcd etcd2 etcdadm-save.timer etcd-member etcdadm-reconfigure etcdadm-update-status etcdadm-check
$ sudo rm -R /var/lib/etcd2/*
# replace the values of ETCD_INITIAL_CLUSTER and ETCD_INITIAL_CLUSTER_STATE by the ones returned by the `member add` command and re-start the etcd-member service
$ sudo vi /etc/etcd-environment
$ sudo systemctl start etcd-member The step-by-step might not be accurate as I'm not an etcd expert myself, but, if I remember correctly, this is how I manage to restore two failing members on my cluster. |
Adds 's3:ListBucket' permission explicitly to statement with only bucket ARN. Adds 's3:GetObject*' permission to statement with s3:prefix and bucket ARN. Should fix etcd snapshot.db recovery process. This should fix S3 permission denied errors during etcdadm-reconfigure such as: etcdadm[28736]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: checking existence of s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7 etcdadm[28736]: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied Fixes kubernetes-retired#660
@danielfm: Thanks for the helpful procedure to restore etcd. If this happens again I'll be sure to try this. |
UpdateI've made the changes I mentioned above to the S3 bucket policy in
|
@trinitronx Thank you very much for your efforts 🙇 I'm trying to understand - so probably we had been missing an IAM policy for checking existence of snapshot.db, which resulted in inability to recover a permanently failed etcd node from the snapshot? // Too bad if I had missed such a defect in my manual testing 😢 |
Yes. Currently, every etcd node is tried to be recovered from the same snapshot when in a disaster recovery process, so that the second and the following etcd nodes won't need to catch up huge data from the etcd leader. The bunch of comments in Of course, any questions are always welcomed 👍 |
@mumoshu : Just wanted to check back in to give some battle testing results of the cluster so far. We have not seen any more failures for the etcd nodes in this cluster, and they have been staying up this week. However, we are seeing some strange behavior when trying to interact with the cluster via
When hitting a
This seems to be intermittent, sometimes we can hit the When it fails, it returns:
Other times, the same API endpoint returns So there is something still causing intermittent Still not sure what the root cause is, but I thought I'd provide the feedback. |
Update:Seems that there is some issue with
Solution (for now)Going onto the master nodes and issuing:
|
I've noticed that with our cluster there was also a problem with Here were the error messages in container logs for Worker nodes
Controller nodes
|
Update@mumoshu : Just to loop back in on this one... We have been implementing AWS NACLs as a separate task in our VPC, and it turns out that we were blocking too much at one point. This is what caused some of our issues with However, this also brings up an unhandled case in the I believe after most of this testing that the current cluster state is not tolerant of full network-level failures and at least requires manual intervention to kickstart recovery of The logged message was strange in that it said
This message was repeatedly looped each time the We were very confused by this, so we added some debug logging to the script to check the values going into the equation else
# At least N/2+1 members are NOT working
local running_num
local remaining_num
local total_num
running_num=$(cluster_num_running_nodes)
total_num=$member_count
remaining_num=$(( quorum - running_num + 1 ))
### BEGIN ADDED DEBUG LOGGING
_info "${running_num} nodes are running"
_info "${quorum} nodes required for quorum"
_info "${total_num} nodes total"
### END ADDED DEBUG LOGGING
_info "${remaining_num} more nodes are required until the quorum is met"
if (( remaining_num >= 2 )); then
member_set_unit_type simple
else
member_set_unit_type notify
fi The resulting log output was:
So this means:
this seems to mean that it calls: But the script was not able to recover the entire cluster from this state on it's own without intervention. We ended up restarting everything as I detailed in the 2 previous comments above here and here. Additionally, we have still been unable to recover a single etcd node which appears to be a separate issue. This one is restoring from snapshot, but once I hope this extra information is helpful when working on stabilizing this new etcd recovery feature! At least this appears to be reproducible by:
|
Adds 's3:ListBucket' permission explicitly to statement with only bucket ARN. Adds 's3:GetObject*' permission to statement with s3:prefix and bucket ARN. Should fix etcd snapshot.db recovery process. This should fix S3 permission denied errors during etcdadm-reconfigure such as: etcdadm[28736]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: checking existence of s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7 etcdadm[28736]: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied Fixes kubernetes-retired#660
Hello, we have been testing out
kube-aws
and had started a cluster usingkube-aws v0.9.6-rc.2
started onApr 17 15:46:43 GMT-600 2017
. Since then, we have seen multiple failures & degradation events for the Etcd3 cluster, without any action or change required to cause this.It appears that for some reason, the
AutoScalingGroup
for each Etcd node is reporting errors similar to the following:i-01162861a56c5fb21
This has happened a couple times for each node, resulting in times when with a 3 node cluster it was put into a degraded state. After this point,
kube-apiserver
is unable to contact Etcd, and any changes to the cluster requiringkube-apiserver
access orkubectl
commands will not work.Here are the details on each AutoScaling Group as shown in the EC2 Console:
Etcd0
Etcd1
Etcd2
The text was updated successfully, but these errors were encountered: