intermittent etcd failures post build #1372

ryane · 2016-04-21T13:07:50Z

Ansible version (ansible --version): 1.9.4
Python version (python --version): 2.7.6
Git commit hash or branch: master
Cloud Environment: gce, aws
Terraform version (terraform version): v0.6.11

Over the last few builds, across platforms, I have occassionally seen some worker nodes come up with a failing etcd distributive check:

{"message":"Internal Server Error"}

On the affected node, the etcd service is running but there are lots of errors like this in the logs:

Apr 21 13:05:33 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:05:33 rafthttp: failed to find member 96bed1dbce03eb25 in cluster 7eda40fd26f24de5

This does not happen every time so I am not sure yet how to reproduce consistently.

We should figure out how to prevent this and document how to fix it if it does occur.

The text was updated successfully, but these errors were encountered:

ryane · 2016-04-21T13:22:32Z

hm, after about 45 minutes, the problem resolved itself:

Apr 21 12:23:32 resching-gce-worker-02.c.asteris-mi.internal etcd-service-start.sh[7696]: 2016/04/21 12:23:32 rafthttp: failed to find member beed3541e47a9276 in cluster 7eda40fd26f24de5
... (lots of rafthttp errors) ...
Apr 21 13:08:08 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:08:08 rafthttp: the connection with fe4d96286d46b0e6 became active
Apr 21 13:08:12 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:08:12 etcdserver: publish error: etcdserver: request timed out
Apr 21 13:08:12 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:08:12 etcdserver: published {Name:resching-gce-worker-02 ClientURLs:[http://resching-gce-worker-02:2379]} to cluster 7eda40fd26f24de5

And the service is healthy in consul now.

stevendborrelli · 2016-05-02T14:13:39Z

I just saw this problem, and I believe it is due to etcd not running on every node in the defined ETCD_INITIAL_CLUSTER= during initial boot up of the cluster.

This can happen on existing clusters if you don't push etcd to all the nodes.

Should we make the initial etcd cluster more like consul and only require quorum from the control nodes?

It feels like the non control nodes should be running as proxies and not part of the raft election: https://coreos.com/etcd/docs/latest/proxy.html

distributorofpain · 2016-06-03T05:01:40Z

Seeing this in the newest build. When i access the mesos ui, it frequently says the server is not available. The /var/log/messages log is scrolling with this message:

Note: consul is green, not issues listed...

Jun  3 04:52:09 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:09 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:09 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:09 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,346] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,347] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,347] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,348] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:11 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:11 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,349] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,350] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,350] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,350] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:12 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:12 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,349] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,350] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,350] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,351] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)
Jun  3 04:52:14 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:14 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:14 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:14 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:15 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:15,562] INFO 10.128.36.67 -  -  [03/Jun/2016:04:52:15 +0000] "GET / HTTP/1.1" 200 3492 "-" "Consul Health Check" (mesosphere.chaos.http.ChaosRequestLog:15)
Jun  3 04:52:15 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:15 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:15 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:15 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,353] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,354] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,354] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,355] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)

Some error messages

Failed to connect to slave '9bb05ba1-3873-4257-8257-0319c5b1f91a-S3' on '/mesos/slave/9bb05ba1-3873-4257-8257-0319c5b1f91a-S3'.
Potential reasons:
• The slave is not accessible from your network
• The slave timed out or went offline

distributorofpain · 2016-06-07T14:13:20Z

i couldnt get the worker nodes to come back online. When i dug deep, i found that one of the control nodes ( the one in the error message above ) was not reachable. I then rebooted the node and after reboot, the consul piece would not come back online. The worker nodes continued to fail ( rather than switch control nodes? ). At this point three worker nodes and one control node were not working. Guessing the cause was the control node not coming back online.

I destroyed the environment and rebuilt it. After rebuild, two of the worker nodes showed the same issue. Again, one control node wasnt reachable. I was able to restart the control node without issue though. It was fully functional after reboot. What i did observe this time though was that i could not ping on the private network to the control node from the two worker nodes( or vice versa, as expected ). I was able to ping all the other nodes from the control node and the worker nodes as well, it was just that those particular worker nodes wouldnt talk to that particular control node.

I put the two worker nodes in consul maintenance mode, but was unable to get them into mesos maintenance mode ( is there a guide here somewhere as killing the process doesnt work since it restarts ). I went to bed, then when i woke up this morning, the servers were all able to talk with each other.

So this leads me to believe there is some sort of temporary firewall rule that is being activated during install, probably not intentionally, but perhaps its being triggered?

At this point the environment is green.

langston-barrett · 2016-07-04T10:46:25Z

The next time anyone sees this, please try #1616

ryane added the core/kubernetes label Apr 21, 2016

ryane modified the milestone: 1.1 Apr 22, 2016

ryane mentioned this issue Apr 26, 2016

Remove .node.consul from mesos hostnames #1385

Merged

3 tasks

ryane modified the milestones: Feature Backlog, 1.1 Apr 28, 2016

ryane added help wanted need feedback labels May 2, 2016

This was referenced Jun 21, 2016

Cluster creation random issue - ETCD #1574

Open

Move certificate generation to hosts #1463

Merged

langston-barrett mentioned this issue Jul 6, 2016

security-setup: Clean up root CA generation #1646

Merged

3 tasks

langston-barrett mentioned this issue Jul 14, 2016

Etcd2 error check 2 failing after adding two worker nodes #1566

Closed

Theaxiom modified the milestone: Feature Backlog Apr 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent etcd failures post build #1372

intermittent etcd failures post build #1372

ryane commented Apr 21, 2016

ryane commented Apr 21, 2016

stevendborrelli commented May 2, 2016

distributorofpain commented Jun 3, 2016 •

edited

Loading

distributorofpain commented Jun 7, 2016

langston-barrett commented Jul 4, 2016

intermittent etcd failures post build #1372

intermittent etcd failures post build #1372

Comments

ryane commented Apr 21, 2016

ryane commented Apr 21, 2016

stevendborrelli commented May 2, 2016

distributorofpain commented Jun 3, 2016 • edited Loading

distributorofpain commented Jun 7, 2016

langston-barrett commented Jul 4, 2016

distributorofpain commented Jun 3, 2016 •

edited

Loading