kube-aws v0.9.6-rc.2 - Unstable Etcd3 cluster #660

trinitronx · 2017-05-17T19:59:24Z

Hello, we have been testing out kube-aws and had started a cluster using kube-aws v0.9.6-rc.2 started on Apr 17 15:46:43 GMT-600 2017. Since then, we have seen multiple failures & degradation events for the Etcd3 cluster, without any action or change required to cause this.

It appears that for some reason, the AutoScalingGroup for each Etcd node is reporting errors similar to the following:


Description:	Terminating EC2 instance: `i-01162861a56c5fb21`
Cause:	At 2017-05-03T00:39:05Z an instance was taken out of service in response to a EC2 instance status checks failure.

This has happened a couple times for each node, resulting in times when with a 3 node cluster it was put into a degraded state. After this point, kube-apiserver is unable to contact Etcd, and any changes to the cluster requiring kube-apiserver access or kubectl commands will not work.

Here are the details on each AutoScaling Group as shown in the EC2 Console:

Etcd0

Etcd1

Etcd2

The text was updated successfully, but these errors were encountered:

redbaron · 2017-05-17T20:12:33Z

Might be caused by #640

trinitronx · 2017-05-31T23:04:23Z

I've confirmed that this is reproducible on kube-aws v0.9.6 "stable".

Steps to reproduce with EtcD logs shipped to a Pre-existing ELK stack (hope this helps in Etcd failure analysis):

Make sure your ElasticSearch nodes are contactable within a pre-existing VPC
Use this cluster.yaml fill in your VPC & custom cluster details, follow normal kube-aws up process (using v.0.9.6 stable):

kube-aws render credentials --generate-ca
kube-aws render stack
kube-aws up --export --s3-uri s3://etcd-fail-example/
kube-aws validate --s3-uri s3://etcd-fail-example/
kube-aws up --s3-uri s3://etcd-fail-example/

Let cluster come up, watch logs come in to your ELK stack
Leave cluster alone doing absolutely nothing for about 3 days
Come back later to an unhealthy Etcd3 cluster

Check the ELK logs. You will probably see:

 2017-05-31 21:16:56.742308 E | rafthttp: failed to find member 123abc456badbeef in cluster fffabc123badcafe
 ### [...SNIP...]  Many of this  ^^^^^ error repeatedly
 2017-05-31 22:17:50.356072 W | rafthttp: failed to process raft message (raft: stopped)

In AWS AutoScalingGroup history page you will most likely see "Terminating EC2 instance" for all 3 nodes at some time in the past 3 days.

Try checking EtcD3 cluster health by connecting via SSH to one of the nodes, then run:

 source /var/run/coreos/etcdadm-environment
 docker run -ti  -e ETCDCTL_API=3 -v /etc/ssl/certs:/etc/ssl/certs:ro  quay.io/coreos/etcd:v3.1.7 \
     etcdctl   --endpoints="$ETCD_ENDPOINTS" --cert=$ETCDCTL_CERT_FILE --key=$ETCDCTL_KEY_FILE --cacert=$ETCDCTL_CACERT \
                 endpoint health
 
 ## You will probably see:
 2017-05-31 23:02:09.548464 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
 2017-05-31 23:02:09.549244 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
 2017-05-31 23:02:09.549949 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
 https://ec2-12-123-231-12.compute-1.amazonaws.com:2379 is healthy: successfully committed proposal: took = 59.700185ms
 https://ec2-55-66-123-231.compute-1.amazonaws.com:2379 is healthy: successfully committed proposal: took = 164.421588ms
 https://ec2-77-88-123-1.compute-1.amazonaws.com:2379 is unhealthy: failed to connect: grpc: timed out when dialing

trinitronx · 2017-05-31T23:08:03Z

It appears that after this failure, the etcdadm-reconfigure.service tries to recover the cluster. It may or may not be successful depending on the timing of the node failures & terminations. I have had this happen where the cluster goes below quorum and is non-recoverable by the automatic process, and also where 2 nodes are still healthy.

Presumably, the probability of going below quorum decreases with increasing odd cluster size... but for bug reproduction a 3 node cluster will most likely exhibit the failure. So far 3 out of 3 tests with fresh clusters have had the problem.

mumoshu · 2017-05-31T23:29:10Z

Hi @trinitronx, thank you very much for the detailed feedback 👍

First of all, please let me sync.
v0.9.6 is not a "stable" version in regard to the fact that it is the first version introduced the automatic disaster recovery feature. I'd say it just "v0.9.6 final".
Also, kube-aws is a community-maintained project. Many of its active maintainers/contributors, including me, do run their production clusters with kube-aws but that doesn't mean every "final" release has every feature "stabilized". I won't say a feature "stable" until someone battle-test it.

Sorry if it missed your expectation in that regard and made you frustrated.
However, a detailed feedback like yours is SUPER HELPFUL to improve kube-aws!
Thank you very much for sharing your hard-won experience 🙇

Backing to the topic -

you will most likely see "Terminating EC2 instance" for all 3 nodes at some time in the past 3 days.

It does seem like you have affected by #640, which has been fixed since v0.9.7-rc.1.
If I could ask you for more, would you mind confirming if it actually fix your case?

trinitronx · 2017-05-31T23:54:56Z

@mumoshu : Thanks for the quick response! Ah, I suppose my assumption of "stable" was incorrect. This is definitely good to know! I did see that the etcd auto-recovery was just added, so I did not assume this would work 100%. However, I did not expect the cluster to go unhealthy so fast in the first place, and was hoping not to need it ;-)

We have been running an older kubernetes cluster with etcd2 and luckily have never had to touch the etcd2 nodes at all. I suppose this implementation is different enough, and uses etcd3, so my initial assumption based on prior experience was definitely not correct.

I think it may make sense to test out this v0.9.7-rc.1, although I had thought that this fix was in v0.9.6 (also wrong assumption).

Hopefully my reproduction & ELK logging for Etcd nodes can provide some debugging value. Initially I thought that I would be able to debug the cluster myself, but I am seeing basically the logs I posted.

This time, the quorum was kept with 2 nodes coming back healthy. The third node was not able to auto-recover, and is logging the rafthttp errors I mentioned above.

When I test connection to this node, there are unexplained connection errors. I try to connect to etcd on port 2379 of the unhealthy node public hostname from $ETCD_ENDPOINTS, even from localhost or local EC2 VPC private IP on that same node. Tried with both curl and etcdctl, same result just a connection timeout. So this appears very odd because all nodes are still in the same SecurityGroup, and indeed no SG should apply for localhost connection. I haven't had much experience with a deep-dive debug into Etcd, so any help on where to start there would be appreciated.

Basically what I've tried so far is in the steps I detailed above. I also have tested out this curl command on the bad node, and on other healthy nodes:

source /var/run/coreos/etcdadm-environment
curl  -vv --cert $ETCDCTL_CERT_FILE --key $ETCDCTL_KEY_FILE --cacert $ETCDCTL_CACERT -X GET https://127.0.0.1:2379/metrics

Bad node result:

curl  -vv --cert $ETCDCTL_CERT_FILE --key $ETCDCTL_KEY_FILE --cacert $ETCDCTL_CACERT -X GET https://10.100.2.111:2379/metrics
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 10.100.2.111...
* TCP_NODELAY set
* Connected to 10.100.2.111 (10.100.2.111) port 2379 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca.pem
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to 10.100.2.111:2379
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to 10.100.2.111:2379

Good node result:

curl  -vv --cert $ETCDCTL_CERT_FILE --key $ETCDCTL_KEY_FILE --cacert $ETCDCTL_CACERT -X GET https://ec2-12-34-123-321.compute-1.amazonaws.com:2379/metrics
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 10.100.2.123...
* TCP_NODELAY set
* Connected to ec2-12-34-123-321.compute-1.amazonaws.com (10.100.2.123) port 2379 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca.pem
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Request CERT (13):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS handshake, CERT verify (15):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: O=kube-aws; CN=kube-etcd
*  start date: May 23 23:27:50 2017 GMT
*  expire date: May 23 23:27:51 2018 GMT
*  subjectAltName: host "ec2-12-34-123-321.compute-1.amazonaws.com" matched cert's "*.compute-1.amazonaws.com"
*  issuer: O=kube-aws; CN=kube-ca
*  SSL certificate verify ok.
> GET /metrics HTTP/1.1
> Host: ec2-12-34-123-321.compute-1.amazonaws.com:2379
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Length: 29606
< Content-Type: text/plain; version=0.0.4
< Date: Wed, 31 May 2017 23:47:15 GMT
<
[... SNIP ...]
[... PROMETHEUS METRICS HERE ...]

mumoshu · 2017-06-01T00:10:34Z

@trinitronx Thanks again for the info 👍

Just a quick response but could you run the followings on a bad node:

journalctl -u etcdadm-reconfigure
journalctl -u etcd-member
systemctl status etcdadm-reconfigure
systemctl status etcd-member

etcd-member is the systemd unit that runs etcd3 in a rkt container, which should listen on the port 2379 and respond according to your request if it it is actually running. So, my guess is that it isn't running at all and running journalctl and systemctl status on the etcd-member unit would reveals if my guess is true or not.

etcdadm-reconfigure, as you've seen unfortunately, may be causing the instability of etcd-member. It is a supplemental systemd service to dynamically and periodically configure etcd-member service accordingly to the current state of an etcd cluster. If it failed uselessly, etcd-member would never come up.

trinitronx · 2017-06-01T00:51:50Z

@mumoshu So I did dig in a bit and check these logs. It seems that the recovery service etcdadm-reconfigure is trying to recover, but is stuck on an S3 access denied error.

Here is a Gist with scrubbed output from etcdadm-reconfigure JournalD logs. Basically in summary it is trying to recover by deleting the data dir, then failing with:

An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

SystemD Unit "status" output

core@ip-10-100-2-111 ~ $ systemctl status etcdadm-reconfigure
● etcdadm-reconfigure.service - etcdadm reconfigure runner
   Loaded: loaded (/etc/systemd/system/etcdadm-reconfigure.service; enabled; vendor preset: disabled)
   Active: active (exited) since Wed 2017-05-31 22:17:44 UTC; 2h 25min ago
  Process: 29453 ExecStartPost=/usr/bin/systemctl start etcdadm-check.timer (code=exited, status=0/SUCCESS)
 Main PID: 28736 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 32768)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/etcdadm-reconfigure.service

May 31 22:17:42 ip-10-100-2-111.ec2.internal etcdadm[28736]: /opt/bin/etcdadm: info: member_reconfigure: cluster is already healthy but still in bootstrap process after the disaster recovery. searching for a etcd snapshot to rec
May 31 22:17:42 ip-10-100-2-111.ec2.internal etcdadm[28736]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: checking existence of s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7
May 31 22:17:43 ip-10-100-2-111.ec2.internal etcdadm[28736]: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
May 31 22:17:44 ip-10-100-2-111.ec2.internal etcdadm[28736]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etc
May 31 22:17:44 ip-10-100-2-111.ec2.internal etcdadm[28736]: /opt/bin/etcdadm: info: member_bootstrap: remote snapshot for etcd1 does not exist. skipped downloading
May 31 22:17:44 ip-10-100-2-111.ec2.internal etcdadm[28736]: /opt/bin/etcdadm: info: member_local_snapshot_exists: checking existence of file /var/run/coreos/etcdadm/snapshots/etcd1.db
May 31 22:17:44 ip-10-100-2-111.ec2.internal etcdadm[28736]: /opt/bin/etcdadm: info: member_bootstrap: backup not found. starting brand new etcd1...
May 31 22:17:44 ip-10-100-2-111.ec2.internal etcdadm[28736]: /opt/bin/etcdadm: info: member_set_initial_cluster_state: setting initial cluster state to: new
May 31 22:17:44 ip-10-100-2-111.ec2.internal etcdadm[28736]: /opt/bin/etcdadm: info: _systemctl_daemon_reload: running `systemctl daemon-reload` to reload etcd-member.service
May 31 22:17:44 ip-10-100-2-111.ec2.internal systemd[1]: Started etcdadm reconfigure runner.

core@ip-10-100-2-111 ~ $ systemctl status etcd-member
● etcd-member.service - etcd (System Application Container)
   Loaded: loaded (/usr/lib/systemd/system/etcd-member.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/etcd-member.service.d
           └─20-aws-cluster.conf, 30-unit-type.conf, 40-version.conf
   Active: activating (start) since Wed 2017-05-31 22:17:44 UTC; 2h 28min ago
     Docs: https://github.com/coreos/etcd
  Process: 29494 ExecStartPre=/usr/bin/chown -R etcd:etcd /var/lib/etcd2 (code=exited, status=0/SUCCESS)
  Process: 29490 ExecStartPre=/usr/bin/systemctl is-active decrypt-assets.service (code=exited, status=0/SUCCESS)
  Process: 29487 ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service (code=exited, status=0/SUCCESS)
  Process: 29466 ExecStartPre=/usr/bin/rkt rm --uuid-file=/var/lib/coreos/etcd-member-wrapper.uuid (code=exited, status=0/SUCCESS)
  Process: 29456 ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos (code=exited, status=0/SUCCESS)
 Main PID: 29496 (etcd)
    Tasks: 6 (limit: 32768)
   Memory: 30.9M
      CPU: 29min 2.035s
   CGroup: /system.slice/etcd-member.service
           └─29496 /usr/local/bin/etcd

Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.700705 E | rafthttp: failed to find member 123abc456badbeef in cluster fffabc123badcafe
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.702247 E | rafthttp: failed to find member 123abc456badbeef in cluster fffabc123badcafe
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.703720 W | rafthttp: failed to process raft message (raft: stopped)
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.836365 W | rafthttp: failed to process raft message (raft: stopped)
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.836583 E | rafthttp: failed to find member 5555abcdef000000 in cluster fffabc123badcafe
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.836678 E | rafthttp: failed to find member 5555abcdef000000 in cluster fffabc123badcafe
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.884561 E | rafthttp: failed to find member 123abc456badbeef in cluster fffabc123badcafe
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.893164 E | rafthttp: failed to find member 123abc456badbeef in cluster fffabc123badcafe
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.894621 W | rafthttp: failed to process raft message (raft: stopped)
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.895131 W | rafthttp: failed to process raft message (raft: stopped)
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.970338 E | rafthttp: failed to find member 5555abcdef000000 in cluster fffabc123badcafe
Jun 01 00:46:11 ip-10-100-2-111.ec2.internal etcd-wrapper[29496]: 2017-06-01 00:46:11.971201 E | rafthttp: failed to find member 5555abcdef000000 in cluster fffabc123badcafe

Conclusion:

etcd-member service is up, but in a non-recovered state due to S3 permission error when etcdadm-reconfigure tried to download the snapshot backup.

So, etcd is indeed up but somehow in a bad state. Additionally, connecting to the API on this unhealthy node is not working.

More Etcd connection debugging

On the bad node, etcd is supposedly listening on that port:

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp      129      0 10.100.2.123:2379        0.0.0.0:*               LISTEN      29496/etcd
tcp        0      0 10.100.2.123:2380        0.0.0.0:*               LISTEN      29496/etcd

Does not seem to respond to a telnet connect either:

core@ip-10-100-2-123 ~ $ /usr/bin/toolbox
Downloading sha256:2bf01635e2a [=============================] 72.9 MB / 72.9 MB
successfully removed aci for image: "sha512-42c13efebbc9ba9bd6324f07293c6eadb424042564821024eca129748e8f6054"
rm: 1 image(s) successfully removed
Spawning container core-fedora-24 on /var/lib/toolbox/core-fedora-24.
Press ^] three times within 1s to kill container.
[root@ip-10-100-2-123 ~]#
[root@ip-10-100-2-123 ~]#
[root@ip-10-100-2-123 ~]# dnf -y install telnet
Fedora 24 - x86_64 - Updates                                      22 MB/s |  23 MB     00:01
Fedora 24 - x86_64                                                23 MB/s |  47 MB     00:02
Last metadata expiration check: 0:01:06 ago on Thu Jun  1 00:10:21 2017.
Dependencies resolved.
=================================================================================================
 Package             Arch                Version                      Repository            Size
=================================================================================================
Installing:
 telnet              x86_64              1:0.17-66.fc24               updates               67 k

Transaction Summary
=================================================================================================
Install  1 Package

Total download size: 67 k
Installed size: 113 k
Downloading Packages:
telnet-0.17-66.fc24.x86_64.rpm                                   496 kB/s |  67 kB     00:00
-------------------------------------------------------------------------------------------------
Total                                                            171 kB/s |  67 kB     00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Installing  : telnet-1:0.17-66.fc24.x86_64                                                 1/1
  Verifying   : telnet-1:0.17-66.fc24.x86_64                                                 1/1

Installed:
  telnet.x86_64 1:0.17-66.fc24

Complete!

[root@ip-10-100-2-123 ~]#
[root@ip-10-100-2-123 ~]#
[root@ip-10-100-2-123 ~]#  telnet 10.100.2.123 2379
Trying 10.100.2.123...
Connected to 10.100.2.123.
Escape character is '^]'.
GET /

[... NO RESPONSE ...]
[... Maybe only SSL handshake is expected? ...]
[... Ctrl+D ...]

core@ip-10-100-2-123 ~ $ echo 'TOOLBOX_BIND="--bind=/:/media/root --bind=/usr:/media/root/usr --bind=/run:/media/root/run --bind=/etc/ssl/certs:/etc/ssl/certs"' >> .toolboxrc
core@ip-10-100-2-123 ~ $ toolbox 
[root@ip-10-100-2-123 ~]# dnf -yq install openssl

[root@ip-10-100-2-123 ~]# openssl s_client -connect 10.100.2.123:2379 -CAfile /etc/ssl/certs/ca.pem -cert /etc/ssl/certs/etcd-client.pem -key /etc/ssl/certs/etcd-client-key.pem
CONNECTED(00000003)
GET /

write:errno=104
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 201 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : 0000
    Session-ID:
    Session-ID-ctx:
    Master-Key:
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1496276787
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---

Additional information

I do see symptoms of #640 when I list exited docker containers I see many exited old awscli containers:

core@ip-10-100-2-123 ~ $ sudo docker ps -a  | grep -c awscli
3369

trinitronx · 2017-06-01T01:01:59Z

@mumoshu : Perhaps last bit of helpful info:

On the bad etcd node, it has an IAM Role with this policy attached (private info scrubbed):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "kms:Decrypt",
      "Resource": "arn:aws:kms:<AWS_REGION_HERE>:<AWS_ACCOUNT_ID_HERE>:key/<KMS_KEY_ID_HERE>",
      "Effect": "Allow"
    },
    {
      "Action": "ec2:DescribeTags",
      "Resource": "*",
      "Effect": "Allow"
    },
    {
      "Action": "ec2:DescribeVolumes",
      "Resource": "*",
      "Effect": "Allow"
    },
    {
      "Action": "ec2:AttachVolume",
      "Resource": "*",
      "Effect": "Allow"
    },
    {
      "Action": "ec2:DescribeVolumeStatus",
      "Resource": "*",
      "Effect": "Allow"
    },
    {
      "Action": "ec2:AssociateAddress",
      "Resource": "*",
      "Effect": "Allow"
    },
    {
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::test-etcd-fail-example/kube-aws/clusters/etcd-fail/exported/stacks/control-plane/userdata-etcd*",
      "Effect": "Allow"
    },
    {
      "Condition": {
        "StringLike": {
          "s3:prefix": "/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/*"
        }
      },
      "Action": [
        "s3:List*"
      ],
      "Resource": "arn:aws:s3:::test-etcd-fail-example",
      "Effect": "Allow"
    },
    {
      "Action": [
        "s3:*"
      ],
      "Resource": "arn:aws:s3:::test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/*",
      "Effect": "Allow"
    },
    {
      "Action": "ec2:DescribeInstances",
      "Resource": "*",
      "Effect": "Allow"
    }
  ]
}

trinitronx · 2017-06-03T01:05:41Z

Further Debugging

Today I was able to dig deeper into the S3 issue. I was able to find the docker run command that was running awscli with the following:

source /var/run/coreos/etcdadm-environment
export $(cat  /var/run/coreos/etcdadm-environment  /etc/etcd-environment | sed -e 's/\([A-Z_0-9]+\{0,\}\)=\(.*\)/\1/g'  | sort | uniq | tr '\n' ' ')
sudo -E  /opt/bin/etcdadm  _docker_awscli_command | tr '\n' ' '

I then ran that command on my system using:

docker run -e AWS_DEFAULT_REGION=us-east-1 --net=host --volume /var/run/coreos:/var/run/coreos quay.io/coreos/awscli aws s3 ls s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/

An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

The output in debug mode shows that the IAM Role for this node is being used:

$ docker run -e AWS_DEFAULT_REGION=us-east-1 --net=host --volume /var/run/coreos:/var/run/coreos quay.io/coreos/awscli aws s3 ls s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/ --debug


2017-06-03 00:38:32,376 - MainThread - awscli.clidriver - DEBUG - CLI version: aws-cli/1.11.15 Python/2.7.12 Linux/4.9.24-coreos botocore/1.4.72
2017-06-03 00:38:32,376 - MainThread - awscli.clidriver - DEBUG - Arguments entered to CLI: ['s3', 'ls', 's3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/', '--debug']
2017-06-03 00:38:32,377 - MainThread - botocore.hooks - DEBUG - Event session-initialized: calling handler <function add_scalar_parsers at 0x7f8314242a28>
2017-06-03 00:38:32,377 - MainThread - botocore.hooks - DEBUG - Event session-initialized: calling handler <function inject_assume_role_provider_cache at 0x7f83146468c0>
2017-06-03 00:38:32,377 - MainThread - botocore.credentials - DEBUG - Skipping environment variable credential check because profile name was explicitly set.
2017-06-03 00:38:32,377 - MainThread - botocore.hooks - DEBUG - Event building-command-table.s3: calling handler <function add_waiters at 0x7f83141cbc08>
2017-06-03 00:38:32,378 - MainThread - botocore.hooks - DEBUG - Event load-cli-arg.custom.s3.anonymous: calling handler <function uri_param at 0x7f831466d050>
2017-06-03 00:38:32,378 - MainThread - botocore.hooks - DEBUG - Event building-command-table.ls: calling handler <function add_waiters at 0x7f83141cbc08>
2017-06-03 00:38:32,379 - MainThread - botocore.hooks - DEBUG - Event load-cli-arg.custom.ls.paths: calling handler <function uri_param at 0x7f831466d050>
2017-06-03 00:38:32,379 - MainThread - botocore.hooks - DEBUG - Event load-cli-arg.custom.ls.human-readable: calling handler <function uri_param at 0x7f831466d050>
2017-06-03 00:38:32,379 - MainThread - botocore.hooks - DEBUG - Event process-cli-arg.custom.ls: calling handler <awscli.argprocess.ParamShorthandParser object at 0x7f8314624d10>
2017-06-03 00:38:32,379 - MainThread - botocore.hooks - DEBUG - Event load-cli-arg.custom.ls.page-size: calling handler <function uri_param at 0x7f831466d050>
2017-06-03 00:38:32,379 - MainThread - botocore.hooks - DEBUG - Event load-cli-arg.custom.ls.anonymous: calling handler <function uri_param at 0x7f831466d050>
2017-06-03 00:38:32,379 - MainThread - botocore.hooks - DEBUG - Event load-cli-arg.custom.ls.request-payer: calling handler <function uri_param at 0x7f831466d050>
2017-06-03 00:38:32,379 - MainThread - botocore.hooks - DEBUG - Event load-cli-arg.custom.ls.summarize: calling handler <function uri_param at 0x7f831466d050>
2017-06-03 00:38:32,379 - MainThread - botocore.hooks - DEBUG - Event process-cli-arg.custom.ls: calling handler <awscli.argprocess.ParamShorthandParser object at 0x7f8314624d10>
2017-06-03 00:38:32,380 - MainThread - botocore.credentials - DEBUG - Looking for credentials via: env
2017-06-03 00:38:32,380 - MainThread - botocore.credentials - DEBUG - Looking for credentials via: assume-role
2017-06-03 00:38:32,380 - MainThread - botocore.credentials - DEBUG - Looking for credentials via: shared-credentials-file
2017-06-03 00:38:32,380 - MainThread - botocore.credentials - DEBUG - Looking for credentials via: config-file
2017-06-03 00:38:32,380 - MainThread - botocore.credentials - DEBUG - Looking for credentials via: ec2-credentials-file
2017-06-03 00:38:32,380 - MainThread - botocore.credentials - DEBUG - Looking for credentials via: boto-config
2017-06-03 00:38:32,380 - MainThread - botocore.credentials - DEBUG - Looking for credentials via: container-role
2017-06-03 00:38:32,380 - MainThread - botocore.credentials - DEBUG - Looking for credentials via: iam-role
2017-06-03 00:38:32,395 - MainThread - botocore.vendored.requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): 169.254.169.254
2017-06-03 00:38:32,396 - MainThread - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "GET /latest/meta-data/iam/security-credentials/ HTTP/1.1" 200 60
2017-06-03 00:38:32,398 - MainThread - botocore.vendored.requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): 169.254.169.254
2017-06-03 00:38:32,398 - MainThread - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "GET /latest/meta-data/iam/security-credentials/etcd-fail-Controlplane-1ABCD12345678-IAMRoleEtcd-ABCDEFOOOOOK HTTP/1.1" 200 894
2017-06-03 00:38:32,399 - MainThread - botocore.credentials - INFO - Found credentials from IAM Role: etcd-fail-Controlplane-1ABCD12345678-IAMRoleEtcd-ABCDEFOOOOOK
2017-06-03 00:38:32,400 - MainThread - botocore.loaders - DEBUG - Loading JSON file: /usr/lib/python2.7/site-packages/botocore/data/endpoints.json
2017-06-03 00:38:32,509 - MainThread - botocore.loaders - DEBUG - Loading JSON file: /usr/lib/python2.7/site-packages/botocore/data/s3/2006-03-01/service-2.json
2017-06-03 00:38:32,690 - MainThread - botocore.client - DEBUG - Registering retry handlers for service: s3
2017-06-03 00:38:32,720 - MainThread - botocore.hooks - DEBUG - Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x7f8314f13500>
2017-06-03 00:38:32,720 - MainThread - botocore.hooks - DEBUG - Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x7f8314f0cc08>
2017-06-03 00:38:32,720 - MainThread - botocore.args - DEBUG - The s3 config key is not a dictionary type, ignoring its value of: None
2017-06-03 00:38:32,751 - MainThread - botocore.endpoint - DEBUG - Setting s3 timeout as (60, 60)
2017-06-03 00:38:32,752 - MainThread - botocore.client - DEBUG - Defaulting to S3 virtual host style addressing with path style addressing fallback.
2017-06-03 00:38:32,755 - MainThread - botocore.loaders - DEBUG - Loading JSON file: /usr/lib/python2.7/site-packages/botocore/data/s3/2006-03-01/paginators-1.json
2017-06-03 00:38:32,756 - MainThread - botocore.hooks - DEBUG - Event before-parameter-build.s3.ListObjects: calling handler <function set_list_objects_encoding_type_url at 0x7f8314f43410>
2017-06-03 00:38:32,756 - MainThread - botocore.hooks - DEBUG - Event before-parameter-build.s3.ListObjects: calling handler <function validate_bucket_name at 0x7f8314f40500>
2017-06-03 00:38:32,781 - MainThread - botocore.hooks - DEBUG - Event before-parameter-build.s3.ListObjects: calling handler <bound method S3RegionRedirector.redirect_from_cache of <botocore.utils.S3RegionRedirector object at 0x7f8313d52ed0>>
2017-06-03 00:38:32,782 - MainThread - botocore.hooks - DEBUG - Event before-call.s3.ListObjects: calling handler <function add_expect_header at 0x7f8314f40938>
2017-06-03 00:38:32,782 - MainThread - botocore.hooks - DEBUG - Event before-call.s3.ListObjects: calling handler <bound method S3RegionRedirector.set_request_url of <botocore.utils.S3RegionRedirector object at 0x7f8313d52ed0>>
2017-06-03 00:38:32,782 - MainThread - botocore.endpoint - DEBUG - Making request for OperationModel(name=ListObjects) (verify_ssl=True) with params: {'body': '', 'url': u'https://s3.amazonaws.com/test-etcd-fail-example?delimiter=%2F&prefix=kube-aws%2Fclusters%2Fetcd-fail%2Finstances%2F9f6c6e20-400f-11e7-8527-500c32cd4462%2Fetcd-snapshots%2F&encoding-type=url', 'headers': {'User-Agent': 'aws-cli/1.11.15 Python/2.7.12 Linux/4.9.24-coreos botocore/1.4.72'}, 'context': {'encoding_type_auto_set': True, 'client_region': 'us-east-1', 'signing': {'bucket': u'test-etcd-fail-example'}, 'has_streaming_input': False, 'client_config': <botocore.config.Config object at 0x7f8313d52a10>}, 'query_string': {u'delimiter': '/', u'prefix': u'kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/', u'encoding-type': 'url'}, 'url_path': u'/test-etcd-fail-example', 'method': u'GET'}
2017-06-03 00:38:32,783 - MainThread - botocore.hooks - DEBUG - Event request-created.s3.ListObjects: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7f8313d52990>>
2017-06-03 00:38:32,783 - MainThread - botocore.hooks - DEBUG - Event before-sign.s3.ListObjects: calling handler <function fix_s3_host at 0x7f831547e488>
2017-06-03 00:38:32,783 - MainThread - botocore.utils - DEBUG - Checking for DNS compatible bucket for: https://s3.amazonaws.com/test-etcd-fail-example?delimiter=%2F&prefix=kube-aws%2Fclusters%2Fetcd-fail%2Finstances%2F9f6c6e20-400f-11e7-8527-500c32cd4462%2Fetcd-snapshots%2F&encoding-type=url
2017-06-03 00:38:32,783 - MainThread - botocore.utils - DEBUG - URI updated to: https://test-etcd-fail-example.s3.amazonaws.com/?delimiter=%2F&prefix=kube-aws%2Fclusters%2Fetcd-fail%2Finstances%2F9f6c6e20-400f-11e7-8527-500c32cd4462%2Fetcd-snapshots%2F&encoding-type=url
2017-06-03 00:38:32,783 - MainThread - botocore.auth - DEBUG - Calculating signature using hmacv1 auth.
2017-06-03 00:38:32,783 - MainThread - botocore.auth - DEBUG - HTTP request method: GET
2017-06-03 00:38:32,783 - MainThread - botocore.auth - DEBUG - StringToSign:
GET


Sat, 03 Jun 2017 00:38:32 GMT
x-amz-security-token:[... Base64 Encoded signature SCRUBBED ...]
/test-etcd-fail-example/
2017-06-03 00:38:32,784 - MainThread - botocore.endpoint - DEBUG - Sending http request: <PreparedRequest [GET]>
2017-06-03 00:38:32,784 - MainThread - botocore.vendored.requests.packages.urllib3.connectionpool - INFO - Starting new HTTPS connection (1): test-etcd-fail-example.s3.amazonaws.com
2017-06-03 00:38:32,884 - MainThread - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "GET /?delimiter=%2F&prefix=kube-aws%2Fclusters%2Fetcd-fail%2Finstances%2F9f6c6e20-400f-11e7-8527-500c32cd4462%2Fetcd-snapshots%2F&encoding-type=url HTTP/1.1" 403 None
2017-06-03 00:38:32,885 - MainThread - botocore.parsers - DEBUG - Response headers: {'date': 'Sat, 03 Jun 2017 00:38:31 GMT', 'x-amz-id-2': '[... SCRUBBED ...]', 'server': 'AmazonS3', 'transfer-encoding': 'chunked', 'x-amz-request-id': '12345678ABCDEFGH', 'x-amz-bucket-region': 'us-east-1', 'content-type': 'application/xml'}
2017-06-03 00:38:32,885 - MainThread - botocore.parsers - DEBUG - Response body:
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>12345678ABCDEFGH</RequestId><HostId>[... SCRUBBED ...]</HostId></Error>
2017-06-03 00:38:32,885 - MainThread - botocore.hooks - DEBUG - Event needs-retry.s3.ListObjects: calling handler <botocore.retryhandler.RetryHandler object at 0x7f8313a98610>
2017-06-03 00:38:32,885 - MainThread - botocore.retryhandler - DEBUG - No retry needed.
2017-06-03 00:38:32,885 - MainThread - botocore.hooks - DEBUG - Event needs-retry.s3.ListObjects: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x7f8313d52ed0>>
2017-06-03 00:38:32,885 - MainThread - botocore.hooks - DEBUG - Event after-call.s3.ListObjects: calling handler <function decode_list_object at 0x7f8314f43488>
2017-06-03 00:38:32,886 - MainThread - botocore.hooks - DEBUG - Event after-call.s3.ListObjects: calling handler <function enhance_error_msg at 0x7f8314242758>
2017-06-03 00:38:32,886 - MainThread - awscli.clidriver - DEBUG - Exception caught in main()
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/awscli/clidriver.py", line 186, in main
    return command_table[parsed_args.command](remaining, parsed_args)
  File "/usr/lib/python2.7/site-packages/awscli/customizations/commands.py", line 190, in __call__
    parsed_globals)
  File "/usr/lib/python2.7/site-packages/awscli/customizations/commands.py", line 187, in __call__
    return self._run_main(parsed_args, parsed_globals)
  File "/usr/lib/python2.7/site-packages/awscli/customizations/s3/subcommands.py", line 472, in _run_main
    bucket, key, parsed_args.page_size, parsed_args.request_payer)
  File "/usr/lib/python2.7/site-packages/awscli/customizations/s3/subcommands.py", line 499, in _list_all_objects
    for response_data in iterator:
  File "/usr/lib/python2.7/site-packages/botocore/paginate.py", line 102, in __iter__
    response = self._make_request(current_kwargs)
  File "/usr/lib/python2.7/site-packages/botocore/paginate.py", line 174, in _make_request
    return self._method(**current_kwargs)
  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 537, in _make_api_call
    raise ClientError(parsed_response, operation_name)
ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
2017-06-03 00:38:32,893 - MainThread - awscli.clidriver - DEBUG - Exiting with rc 255

An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

Testing Solution

So for some reason the default S3 bucket policy is not letting this instance read from the bucket.

I experimented by adding a new more permissive policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "kms:Decrypt",
            "Resource": "arn:aws:kms:<AWS_REGION_HERE>:<AWS_ACCOUNT_ID_HERE>:key/<KMS_KEY_ID_HERE>",
            "Effect": "Allow"
        },
        {
            "Action": "ec2:DescribeTags",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": "ec2:DescribeVolumes",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": "ec2:AttachVolume",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": "ec2:DescribeVolumeStatus",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": "ec2:AssociateAddress",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::test-etcd-fail-example/kube-aws/clusters/etcd-fail/exported/stacks/control-plane/userdata-etcd*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:List*",
                "s3:GetObject*",
                "s3:PutObject*"
            ],
            "Resource": "arn:aws:s3:::test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::test-etcd-fail-example",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringLike": {
                    "s3:prefix": "/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/*"
                }
            },
            "Action": [
                "s3:List*",
                "s3:GetObject*"
            ],
            "Resource": "arn:aws:s3:::test-etcd-fail-example",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:*"
            ],
            "Resource": "arn:aws:s3:::test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/*",
            "Effect": "Allow"
        },
        {
            "Action": "ec2:DescribeInstances",
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Validation

Using this new policy, the Etcd nodes are able to Read / Write to this snapshot.db file:

$ sudo journalctl -xn10 -u 'etcdadm*' -f
Jun 03 00:48:40 10.100.2.123.ec2.internal etcdadm[4586]: /opt/bin/etcdadm: info: member_upload_snapshot: uploading /var/run/coreos/etcdadm/snapshots/etcd0.db to s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/snapshot.db
Jun 03 00:48:42 10.100.2.123.ec2.internal etcdadm[4586]: [2.0K blob data]
Jun 03 00:48:42 10.100.2.123.ec2.internal etcdadm[4586]: [2.0K blob data]
Jun 03 00:48:42 10.100.2.123.ec2.internal etcdadm[4586]: [2.0K blob data]
Jun 03 00:48:42 10.100.2.123.ec2.internal etcdadm[4586]: [2.0K blob data]
Jun 03 00:48:43 10.100.2.123.ec2.internal etcdadm[4586]: [852B blob data]

docker run -e AWS_DEFAULT_REGION=us-east-1 --net=host --volume /var/run/coreos:/var/run/coreos quay.io/coreos/awscli aws s3 ls s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/
2017-06-03 00:47:05   42872864 snapshot.db

After rebooting all 3 nodes, 2 of them were able to restore successfully:

Jun 03 00:52:17 ip-10-100-2-123.ec2.internal systemd[1]: Starting etcd health check...
-- Subject: Unit etcdadm-check.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit etcdadm-check.service has begun starting up.
Jun 03 00:52:17 ip-10-100-2-123.ec2.internal systemctl[11309]: active
Jun 03 00:52:17 ip-10-100-2-123.ec2.internal etcdadm[11322]: declare -x ETCDCTL_CACERT="/etc/ssl/certs/ca.pem"
Jun 03 00:52:17 ip-10-100-2-123.ec2.internal etcdadm[11322]: declare -x ETCDCTL_CA_FILE="/etc/ssl/certs/ca.pem"
Jun 03 00:52:17 ip-10-100-2-123.ec2.internal etcdadm[11322]: declare -x ETCDCTL_CERT="/etc/ssl/certs/etcd-client.pem"
Jun 03 00:52:17 ip-10-100-2-123.ec2.internal etcdadm[11322]: declare -x ETCDCTL_CERT_FILE="/etc/ssl/certs/etcd-client.pem"
Jun 03 00:52:17 ip-10-100-2-123.ec2.internal etcdadm[11322]: declare -x ETCDCTL_KEY="/etc/ssl/certs/etcd-client-key.pem"
Jun 03 00:52:17 ip-10-100-2-123.ec2.internal etcdadm[11322]: declare -x ETCDCTL_KEY_FILE="/etc/ssl/certs/etcd-client-key.pem"
Jun 03 00:52:19 ip-10-100-2-123.ec2.internal etcdadm[11322]: [ 5140.543532] etcd[5]: https://ec2-12-34-123-321.compute-1.amazonaws.com:2379 is healthy: successfully committed proposal: took = 1.633271ms
Jun 03 00:52:22 ip-10-100-2-123.ec2.internal etcdadm[11322]: [ 5142.943404] etcd[5]: https://ec2-12-34-123-321.compute-1.amazonaws.com:2379 is healthy: successfully committed proposal: took = 2.274348ms

1 Node Restore Problem

The third one did not. Here are his logs

It seems that all 3 of the etcd nodes are storing snapshot.db to the same location:

s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/snapshot.db

@mumoshu : Is this intended operation of the etcdadm-save.service to write all snapshots to one file? I'm not familiar enough with etcd and the snapshot save & restore process right now to know if this is desirable or not.

After rebooting this node a couple times and restarting all etcd* services, I can get it to attempt to restore and fail in the same way each time. Maybe this one is confused about the member id somehow? I'm not sure why it doesn't come back up cleanly.

danielfm · 2017-06-05T13:47:51Z

Edit: Updated step-by-step after some testing.

I most likely ran into this issue this weekend.

One of my etcd instances (the only one that survived anyway) had 20k+ stopped containers, so I manually added the --rm flag to /opt/bin/etcdadm and cleaned things up with the famous docker rm $(docker ps -qa).

Just to add more information to the table, I'll explain what happened on my end.

I'm running a three-node etcd cluster, and one of the instances suddenly started to consume 100% CPU (the same thing happened later with other node, but luckily I had time to recover the first failing node before the other one went down).

I was unable to log into it via SSH to see what was happening (most likely due to the high load), so I terminated the instance (thinking back, it would probably be better to enable the termination protection on the instance and reboot it, but I guess I'll never know by now).

Then, when the new instance went up, I saw all these rafthttp errors, causing the new instance to be unable to join the cluster. After many hours of trial and error, this is how I managed to restore the cluster:

# on some working etcd node, remove and re-add the member
$ etcdctl member remove <old-member-id>
$ etcdctl member add etcd<N> https://<hostname>:2380

# on the new etcd node, stop etcd-related services and clean up the node data
$ sudo systemctl stop etcd etcd2 etcdadm-save.timer etcd-member etcdadm-reconfigure etcdadm-update-status etcdadm-check
$ sudo rm -R /var/lib/etcd2/*

# replace the values of ETCD_INITIAL_CLUSTER and ETCD_INITIAL_CLUSTER_STATE by the ones returned by the `member add` command and re-start the etcd-member service
$ sudo vi /etc/etcd-environment
$ sudo systemctl start etcd-member

The step-by-step might not be accurate as I'm not an etcd expert myself, but, if I remember correctly, this is how I manage to restore two failing members on my cluster.

Adds 's3:ListBucket' permission explicitly to statement with only bucket ARN. Adds 's3:GetObject*' permission to statement with s3:prefix and bucket ARN. Should fix etcd snapshot.db recovery process. This should fix S3 permission denied errors during etcdadm-reconfigure such as: etcdadm[28736]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: checking existence of s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7 etcdadm[28736]: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied Fixes kubernetes-retired#660

trinitronx · 2017-06-07T05:53:03Z

@danielfm: Thanks for the helpful procedure to restore etcd. If this happens again I'll be sure to try this.

trinitronx · 2017-06-07T05:58:53Z

Update

I've made the changes I mentioned above to the S3 bucket policy in kube-aws here. This appears to fix the S3 permission errors I was seeing such as:

May 31 22:17:42 ip-10-100-2-111.ec2.internal etcdadm[28736]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: checking existence of s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7-8527-500c32cd4462/etcd-snapshots/snapshot.db
May 31 22:17:43 ip-10-100-2-111.ec2.internal etcdadm[28736]: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

mumoshu · 2017-06-07T07:42:33Z

@trinitronx Thank you very much for your efforts 🙇

I'm trying to understand - so probably we had been missing an IAM policy for checking existence of snapshot.db, which resulted in inability to recover a permanently failed etcd node from the snapshot?

// Too bad if I had missed such a defect in my manual testing 😢

mumoshu · 2017-06-07T08:42:15Z

@trinitronx

Is this intended operation of the etcdadm-save.service to write all snapshots to one file? I'm not familiar enough with etcd and the snapshot save & restore process right now to know if this is desirable or not.

Yes. Currently, every etcd node is tried to be recovered from the same snapshot when in a disaster recovery process, so that the second and the following etcd nodes won't need to catch up huge data from the etcd leader. The bunch of comments in etcdadm_reconfigure() in the etcdadm script would help understading how it works.

Of course, any questions are always welcomed 👍

trinitronx · 2017-06-16T23:47:30Z

@mumoshu : Just wanted to check back in to give some battle testing results of the cluster so far. We have not seen any more failures for the etcd nodes in this cluster, and they have been staying up this week.

However, we are seeing some strange behavior when trying to interact with the cluster via kubectl or kubernetes-dashboard. Basically we are seeing intermittent timeout errors for kubectl commands and kubernetes-dashboard pages.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.3", GitCommit:"0480917b552be33e2dba47386e51decb1a211df6", GitTreeState:"clean", BuildDate:"2017-06-16T21:16:54Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.3+coreos.0", GitCommit:"70bc0937018d89b0d529d428eed8c444d8e7729a", GitTreeState:"clean", BuildDate:"2017-05-11T18:15:57Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl get pods
Error from server (ServerTimeout): the server cannot complete the requested operation at this time, try again later (get pods)

# A couple seconds later, we get a response but the pod is stuck in Pending

$ kubectl get pods
NAME                                        READY     STATUS    RESTARTS   AGE
example-pod-cronjob-test-1234567890-abcd1   0/1       Pending   0          3h

When hitting a kubernetes-dashboard page: https://example.com/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard/#!/workload?namespace=default

Gateway Timeout (504)

the server cannot complete the requested operation at this time, try again later (get pods)

This seems to be intermittent, sometimes we can hit the kube-apiserver directly from the same browser using: https://example.com/api/v1/namespaces/default/pods

When it fails, it returns:

{"metadata":{},"status":"Failure","message":"The list operation against pods could not be completed at this time, please try again.","reason":"ServerTimeout","details":{"name":"list","kind":"pods"},"code":500}

Other times, the same API endpoint returns 200 OK with results all within 300-400ms according to Google Chrome's Developer Tools -> Network request info.

So there is something still causing intermittent kube-apiserver failures, presumably also causing pods to be stuck in Pending.

Still not sure what the root cause is, but I thought I'd provide the feedback.

trinitronx · 2017-06-17T00:38:35Z

Update:

Seems that there is some issue with flanneld:

Jun 12 19:40:51 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:40:51.149236    1451 watch.go:272] Subnet watch failed: client: etcd cluster is unavailable or misconfigured
Jun 12 19:40:52 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:40:52.952952    1451 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured
Jun 12 19:40:53 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:40:53.155630    1451 watch.go:272] Subnet watch failed: client: etcd cluster is unavailable or misconfigured
Jun 12 19:40:54 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:40:54.961201    1451 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured
Jun 12 19:40:55 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:40:55.167600    1451 watch.go:272] Subnet watch failed: client: etcd cluster is unavailable or misconfigured
Jun 12 19:40:56 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:40:56.996751    1451 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured
Jun 12 19:40:57 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:40:57.173963    1451 watch.go:272] Subnet watch failed: client: etcd cluster is unavailable or misconfigured
Jun 12 19:40:59 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:40:59.004762    1451 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured
Jun 12 19:40:59 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:40:59.189276    1451 watch.go:272] Subnet watch failed: client: etcd cluster is unavailable or misconfigured
Jun 12 19:41:01 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:41:01.011583    1451 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured
Jun 12 19:41:01 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:41:01.197205    1451 watch.go:272] Subnet watch failed: client: etcd cluster is unavailable or misconfigured
Jun 12 19:41:03 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:41:03.019284    1451 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured
Jun 12 19:41:03 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:41:03.204723    1451 watch.go:272] Subnet watch failed: client: etcd cluster is unavailable or misconfigured
Jun 12 19:41:05 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:41:05.026410    1451 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured
Jun 12 19:41:05 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:41:05.211500    1451 watch.go:272] Subnet watch failed: client: etcd cluster is unavailable or misconfigured
Jun 12 19:41:07 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:41:07.033600    1451 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured
Jun 12 19:41:07 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: E0612 19:41:07.218702    1451 watch.go:272] Subnet watch failed: client: etcd cluster is unavailable or misconfigured

Solution (for now)

Going onto the master nodes and issuing: sudo systemctl restart flanneld seems to have fixed it

$  journalctl -xn -u flanneld.service

-- Subject: Unit flanneld.service has begun shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has begun shutting down.
Jun 17 00:32:15 ip-10-123-1-12.ec2.internal flannel-wrapper[1451]: I0617 00:32:15.881177    1451 main.go:172] Exiting...
Jun 17 00:32:15 ip-10-123-1-12.ec2.internal rkt[11157]: "98765432-1000-ffff-1111-abcdef012345"
Jun 17 00:32:15 ip-10-123-1-12.ec2.internal systemd[1]: Stopped flannel - Network fabric for containers (System Application Container).
-- Subject: Unit flanneld.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has finished shutting down.
Jun 17 00:32:15 ip-10-123-1-12.ec2.internal systemd[1]: Starting flannel - Network fabric for containers (System Application Container)...
-- Subject: Unit flanneld.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has begun starting up.
Jun 17 00:32:16 ip-10-123-1-12.ec2.internal rkt[11183]: "98765432-1000-ffff-1111-abcdef012345"
Jun 17 00:32:16 ip-10-123-1-12.ec2.internal systemctl[11205]: active
Jun 17 00:32:24 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850266.821530] awscli[5]: decrypting assets
Jun 17 00:32:24 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850266.823152] awscli[5]: decrypting /etc/kubernetes/ssl/apiserver-key.pem.enc
Jun 17 00:32:25 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850267.803033] awscli[5]: decrypting /etc/kubernetes/ssl/apiserver.pem.enc
Jun 17 00:32:26 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850268.692520] awscli[5]: decrypting /etc/kubernetes/ssl/ca-key.pem.enc
Jun 17 00:32:26 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850269.531869] awscli[5]: decrypting /etc/kubernetes/ssl/ca.pem.enc
Jun 17 00:32:27 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850270.462332] awscli[5]: decrypting /etc/kubernetes/ssl/etcd-client-key.pem.enc
Jun 17 00:32:28 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850271.247576] awscli[5]: decrypting /etc/kubernetes/ssl/etcd-client.pem.enc
Jun 17 00:32:29 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850272.065044] awscli[5]: decrypting /etc/kubernetes/auth/kubelet-tls-bootstrap-token.tmp.enc
Jun 17 00:32:30 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850272.938101] awscli[5]: generating /etc/kubernetes/auth/tokens.csv
Jun 17 00:32:30 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: [850272.948515] awscli[5]: done.
Jun 17 00:32:30 ip-10-123-1-12.ec2.internal decrypt-assets[11209]: "12345678-abcd-ef00-4321-cafef00dbad0"
Jun 17 00:32:31 ip-10-123-1-12.ec2.internal etcdctl[11400]: {"Network" : "10.2.0.0/16", "Backend" : {"Type" : "vxlan"}}
Jun 17 00:32:31 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/flannel-wrapper.uuid --trust-keys-from-https --volume ssl,kind=host,source=/etc/kubernetes/ssl,readOnly=true --mount volume=ssl,target=/etc/kubernetes/ssl --mount volume=notify,target=/run/systemd/notify --volume notify,kind=host,source=/run/systemd/notify --set-env=NOTIFY_SOCKET=/run/systemd/notify --net=host --volume run-flannel,kind=host,source=/run/flannel,readOnly=false --volume etc-ssl-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=run-flannel,target=/run/flannel --mount volume=etc-ssl-certs,target=/etc/ssl/certs --mount volume=usr-share-certs,target=/usr/share/ca-certificates --mount volume=etc-hosts,target=/etc/hosts --mount volume=etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/flannel:v0.7.0 -- --ip-masq=true
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.195919   11405 main.go:132] Installing signal handlers
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.196359   11405 manager.go:124] Searching for interface using 10.123.1.12
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.207266   11405 manager.go:149] Using interface with name eth0 and address 10.123.1.12
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.207631   11405 manager.go:166] Defaulting external address to interface address (10.123.1.12)
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.582709   11405 local_manager.go:134] Found lease (10.2.27.0/24) for current IP (10.123.1.12), reusing
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.722927   11405 ipmasq.go:47] Adding iptables rule: -s 10.2.0.0/16 -d 10.2.0.0/16 -j RETURN
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.726667   11405 ipmasq.go:47] Adding iptables rule: -s 10.2.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.742158   11405 ipmasq.go:47] Adding iptables rule: ! -s 10.2.0.0/16 -d 10.2.0.0/16 -j MASQUERADE
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.752014   11405 manager.go:250] Lease acquired: 10.2.27.0/24
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal systemd[1]: Started flannel - Network fabric for containers (System Application Container).
-- Subject: Unit flanneld.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has finished starting up.
--
-- The start-up result is done.
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.830637   11405 network.go:58] Watching for L3 misses
Jun 17 00:32:33 ip-10-123-1-12.ec2.internal flannel-wrapper[11405]: I0617 00:32:33.831017   11405 network.go:66] Watching for new subnet leases

trinitronx · 2017-06-17T05:06:37Z

I've noticed that with our cluster there was also a problem with kube-proxy not being able to hit the apiserver. Presumably this was due to broken flanneld.

Here were the error messages in container logs for kube-proxy:

Worker nodes

$ docker logs --tail=100 -f  $(docker ps -a | grep kube-proxy | grep -v pause | awk '{ print $1 }')

E0612 20:30:34.634628       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:49: Failed to list *api.Endpoints: the server cannot complete the requested operation at this time, try again later (get endpoints)
E0612 20:31:35.638726       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:49: Failed to list *api.Endpoints: the server cannot complete the requested operation at this time, try again later (get endpoints)
E0612 20:32:36.641849       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:49: Failed to list *api.Endpoints: the server cannot complete the requested operation at this time, try again later (get endpoints)

E0612 20:47:49.031903       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:46: Failed to list *api.Service: the server cannot complete the requested operation at this time, try again later (get services)
E0612 20:48:50.035194       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:46: Failed to list *api.Service: the server cannot complete the requested operation at this time, try again later (get services)
E0612 20:49:51.050341       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:46: Failed to list *api.Service: the server cannot complete the requested operation at this time, try again later (get services)

Controller nodes

$ docker logs --tail=100 -f  $(docker ps -a | grep kube-proxy | grep -v pause | awk '{ print $1 }')

E0617 03:29:58.726256       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:46: Failed to list *api.Service: Get http://127.0.0.1:8080/api/v1/services?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused
E0617 03:29:58.727810       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:49: Failed to list *api.Endpoints: Get http://127.0.0.1:8080/api/v1/endpoints?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused
E0617 03:29:59.727147       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:46: Failed to list *api.Service: Get http://127.0.0.1:8080/api/v1/services?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused
E0617 03:29:59.728610       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:49: Failed to list *api.Endpoints: Get http://127.0.0.1:8080/api/v1/endpoints?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused
E0617 03:30:00.749474       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:46: Failed to list *api.Service: Get http://127.0.0.1:8080/api/v1/services?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused
E0617 03:30:00.749618       1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:49: Failed to list *api.Endpoints: Get http://127.0.0.1:8080/api/v1/endpoints?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused

`kube-proxy` restart fix:

I was able to manually fix the issue on the controller and worker nodes with:

proxies=$(docker ps  | grep -i kube-proxy | awk '{ print $1 }') ; for c in $proxies; do docker stop  $c; sync; sleep 1;   docker rm -v $c; sync; sleep 1;  done ;

docker logs --tail=100 -f  $(docker ps -a | grep kube-proxy | grep -v pause | awk '{ print $1 }')
I0617 04:04:31.563985       1 server.go:225] Using iptables Proxier.
W0617 04:04:31.663074       1 proxier.go:298] clusterCIDR not specified, unable to distinguish between internal and external traffic
I0617 04:04:31.663198       1 server.go:249] Tearing down userspace rules.
I0617 04:04:31.811253       1 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0617 04:04:31.811926       1 conntrack.go:66] Setting conntrack hashsize to 32768
I0617 04:04:31.812072       1 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0617 04:04:31.812157       1 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600

After this, testing the kube-proxy connection to localhost:8080 apiserver worked:

$ docker exec -ti  $(docker ps -a | grep kube-proxy | grep -v pause | awk '{ print $1 }') /bin/sh
$ apt-get update && apt-get -y install curl
$ curl -v 'http://127.0.0.1:8080/api/v1/endpoints?resourceVersion=0'
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET /api/v1/endpoints?resourceVersion=0 HTTP/1.1
> User-Agent: curl/7.38.0
> Host: 127.0.0.1:8080
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Sat, 17 Jun 2017 04:02:38 GMT
< Transfer-Encoding: chunked
<
{
  "kind": "EndpointsList",
  "apiVersion": "v1",

[...SNIP...]
[ JSON api response data here ]

I had to repeat these steps for all nodes, and kube-apiserver began responding well again. I have not stress tested this, but it works for now.

trinitronx · 2017-06-20T01:45:02Z

Update

@mumoshu :

Just to loop back in on this one... We have been implementing AWS NACLs as a separate task in our VPC, and it turns out that we were blocking too much at one point. This is what caused some of our issues with flanneld, kube-apiserver, kube-proxy, kubelet etc...

However, this also brings up an unhandled case in the etcdadm-reconfigure.service which requires intervention to recover from. It also is a good way to reproduce the issue for testing purposes 😉

I believe after most of this testing that the current cluster state is not tolerant of full network-level failures and at least requires manual intervention to kickstart recovery of etcd. We also had to restart flanneld, kubelet, and kube-proxy on all nodes. When everything goes down the etcdadm-reconfigure systemd unit logs that it gets stuck in handling the case here in /opt/bin/etcdadm:700

The logged message was strange in that it said -1 more nodes are required:

Jun 19 18:15:46 ip-10-123-1-183.ec2.internal etcdadm[5444]: /opt/bin/etcdadm: info: member_reconfigure: observing cluster state: quorum=3 healthy=0
Jun 19 18:15:49 ip-10-123-1-183.ec2.internal etcdadm[5444]: /opt/bin/etcdadm: info: member_reconfigure: -1 more nodes are required until the quorum is met
Jun 19 18:15:49 ip-10-123-1-183.ec2.internal etcdadm[5444]: /opt/bin/etcdadm: info: member_set_unit_type: setting etcd unit type to "notify". `systemctl daemon-reload` required afterwards
Jun 19 18:15:49 ip-10-123-1-183.ec2.internal etcdadm[5444]: /opt/bin/etcdadm: info: member_reconfigure: all the nodes are present but cluster is still unhealthy, which means the initial bootstrap is still in progress. keep retrying a while
Jun 19 18:15:49 ip-10-123-1-183.ec2.internal etcdadm[5444]: /opt/bin/etcdadm: info: _systemctl_daemon_reload: running `systemctl daemon-reload` to reload etcd-member.service
Jun 19 18:15:49 ip-10-123-1-183.ec2.internal systemd[1]: Started etcdadm reconfigure runner.

This message was repeatedly looped each time the etcdadm-reconfigure.service restarted.

We were very confused by this, so we added some debug logging to the script to check the values going into the equation remaining_num=$(( quorum - running_num + 1 )):

  else
    # At least N/2+1 members are NOT working

    local running_num
    local remaining_num
    local total_num
    running_num=$(cluster_num_running_nodes)
    total_num=$member_count
    remaining_num=$(( quorum - running_num + 1 ))

### BEGIN ADDED DEBUG LOGGING
    _info "${running_num} nodes are running"
    _info "${quorum} nodes required for quorum"
    _info "${total_num} nodes total"
### END ADDED DEBUG LOGGING
    _info "${remaining_num} more nodes are required until the quorum is met"

    if (( remaining_num >= 2 )); then
      member_set_unit_type simple
    else
      member_set_unit_type notify
    fi

The resulting log output was:

Jun 19 21:32:35 ip-10-123-1-183.ec2.internal etcdadm[25019]: /opt/bin/etcdadm: info: member_reconfigure: observing cluster state: quorum=3 healthy=0
Jun 19 21:32:39 ip-10-123-1-183.ec2.internal etcdadm[25019]: /opt/bin/etcdadm: info: member_reconfigure: 5 nodes are running
Jun 19 21:32:39 ip-10-123-1-183.ec2.internal etcdadm[25019]: /opt/bin/etcdadm: info: member_reconfigure: 3 nodes required for quorum
Jun 19 21:32:39 ip-10-123-1-183.ec2.internal etcdadm[25019]: /opt/bin/etcdadm: info: member_reconfigure: 5 nodes total
Jun 19 21:32:39 ip-10-123-1-183.ec2.internal etcdadm[25019]: /opt/bin/etcdadm: info: member_reconfigure: -1 more nodes are required until the quorum is met

So this means:

Variable	Value
running_num	5
quorum	3
total_num	5
remaining_num	-1

remaining_num = quorum - running_num + 1
remaining_num = 3 - 5 + 1 = -1

this seems to mean that it calls: member_set_unit_type notify

But the script was not able to recover the entire cluster from this state on it's own without intervention. We ended up restarting everything as I detailed in the 2 previous comments above here and here.

Additionally, we have still been unable to recover a single etcd node which appears to be a separate issue. This one is restoring from snapshot, but once etcd-member.service starts up, it fails to re-join the cluster and logs many errors. I may have to create a separate issue for this one.

I hope this extra information is helpful when working on stabilizing this new etcd recovery feature!

At least this appears to be reproducible by:

Start up a cluster with kube-aws
Set up a NACL rule to break the network connection
Undo NACL rule change
Watch as etcd nodes try to recover but are stuck

Adds 's3:ListBucket' permission explicitly to statement with only bucket ARN. Adds 's3:GetObject*' permission to statement with s3:prefix and bucket ARN. Should fix etcd snapshot.db recovery process. This should fix S3 permission denied errors during etcdadm-reconfigure such as: etcdadm[28736]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: checking existence of s3://test-etcd-fail-example/kube-aws/clusters/etcd-fail/instances/9f6c6e20-400f-11e7 etcdadm[28736]: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied Fixes kubernetes-retired#660

trinitronx mentioned this issue Jun 5, 2017

ETCd crash after few days #691

Closed

trinitronx mentioned this issue Jun 7, 2017

core: Add ListBucket explicitly to EtcdSnapshotsS3Bucket #695

Merged

mumoshu closed this as completed in #695 Jun 9, 2017

ytsarev mentioned this issue Jun 16, 2017

Use docker instead of rkt for regular etcdadm tasks (#4) #705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-aws v0.9.6-rc.2 - Unstable Etcd3 cluster #660

kube-aws v0.9.6-rc.2 - Unstable Etcd3 cluster #660

trinitronx commented May 17, 2017

redbaron commented May 17, 2017

trinitronx commented May 31, 2017 •

edited

Loading

trinitronx commented May 31, 2017

mumoshu commented May 31, 2017 •

edited

Loading

trinitronx commented May 31, 2017

mumoshu commented Jun 1, 2017 •

edited

Loading

trinitronx commented Jun 1, 2017 •

edited

Loading

trinitronx commented Jun 1, 2017

trinitronx commented Jun 3, 2017

danielfm commented Jun 5, 2017 •

edited

Loading

trinitronx commented Jun 7, 2017

trinitronx commented Jun 7, 2017 •

edited

Loading

mumoshu commented Jun 7, 2017

mumoshu commented Jun 7, 2017

trinitronx commented Jun 16, 2017

trinitronx commented Jun 17, 2017

trinitronx commented Jun 17, 2017

trinitronx commented Jun 20, 2017

kube-aws v0.9.6-rc.2 - Unstable Etcd3 cluster #660

kube-aws v0.9.6-rc.2 - Unstable Etcd3 cluster #660

Comments

trinitronx commented May 17, 2017

Etcd0

Etcd1

Etcd2

redbaron commented May 17, 2017

trinitronx commented May 31, 2017 • edited Loading

trinitronx commented May 31, 2017

mumoshu commented May 31, 2017 • edited Loading

trinitronx commented May 31, 2017

Bad node result:

Good node result:

mumoshu commented Jun 1, 2017 • edited Loading

trinitronx commented Jun 1, 2017 • edited Loading

SystemD Unit "status" output

Conclusion:

More Etcd connection debugging

Additional information

trinitronx commented Jun 1, 2017

trinitronx commented Jun 3, 2017

Further Debugging

Testing Solution

Validation

1 Node Restore Problem

danielfm commented Jun 5, 2017 • edited Loading

trinitronx commented Jun 7, 2017

trinitronx commented Jun 7, 2017 • edited Loading

Update

mumoshu commented Jun 7, 2017

mumoshu commented Jun 7, 2017

trinitronx commented Jun 16, 2017

trinitronx commented Jun 17, 2017

Update:

Solution (for now)

trinitronx commented Jun 17, 2017

Worker nodes

Controller nodes

kube-proxy restart fix:

trinitronx commented Jun 20, 2017

Update

trinitronx commented May 31, 2017 •

edited

Loading

mumoshu commented May 31, 2017 •

edited

Loading

mumoshu commented Jun 1, 2017 •

edited

Loading

trinitronx commented Jun 1, 2017 •

edited

Loading

danielfm commented Jun 5, 2017 •

edited

Loading

trinitronx commented Jun 7, 2017 •

edited

Loading

`kube-proxy` restart fix: