Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

intermittent etcd failures post build #1372

Open
ryane opened this issue Apr 21, 2016 · 5 comments
Open

intermittent etcd failures post build #1372

ryane opened this issue Apr 21, 2016 · 5 comments

Comments

@ryane
Copy link
Contributor

ryane commented Apr 21, 2016

  • Ansible version (ansible --version): 1.9.4
  • Python version (python --version): 2.7.6
  • Git commit hash or branch: master
  • Cloud Environment: gce, aws
  • Terraform version (terraform version): v0.6.11

Over the last few builds, across platforms, I have occassionally seen some worker nodes come up with a failing etcd distributive check:

{"message":"Internal Server Error"}

On the affected node, the etcd service is running but there are lots of errors like this in the logs:

Apr 21 13:05:33 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:05:33 rafthttp: failed to find member 96bed1dbce03eb25 in cluster 7eda40fd26f24de5

This does not happen every time so I am not sure yet how to reproduce consistently.

We should figure out how to prevent this and document how to fix it if it does occur.

@ryane
Copy link
Contributor Author

ryane commented Apr 21, 2016

hm, after about 45 minutes, the problem resolved itself:

Apr 21 12:23:32 resching-gce-worker-02.c.asteris-mi.internal etcd-service-start.sh[7696]: 2016/04/21 12:23:32 rafthttp: failed to find member beed3541e47a9276 in cluster 7eda40fd26f24de5
... (lots of rafthttp errors) ...
Apr 21 13:08:08 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:08:08 rafthttp: the connection with fe4d96286d46b0e6 became active
Apr 21 13:08:12 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:08:12 etcdserver: publish error: etcdserver: request timed out
Apr 21 13:08:12 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:08:12 etcdserver: published {Name:resching-gce-worker-02 ClientURLs:[http://resching-gce-worker-02:2379]} to cluster 7eda40fd26f24de5

And the service is healthy in consul now.

@ryane ryane modified the milestone: 1.1 Apr 22, 2016
@ryane ryane modified the milestones: Feature Backlog, 1.1 Apr 28, 2016
@stevendborrelli
Copy link
Contributor

I just saw this problem, and I believe it is due to etcd not running on every node in the defined ETCD_INITIAL_CLUSTER= during initial boot up of the cluster.

This can happen on existing clusters if you don't push etcd to all the nodes.

Should we make the initial etcd cluster more like consul and only require quorum from the control nodes?

It feels like the non control nodes should be running as proxies and not part of the raft election: https://coreos.com/etcd/docs/latest/proxy.html

@distributorofpain
Copy link

distributorofpain commented Jun 3, 2016

Seeing this in the newest build. When i access the mesos ui, it frequently says the server is not available. The /var/log/messages log is scrolling with this message:

Note: consul is green, not issues listed...

Jun  3 04:52:09 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:09 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:09 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:09 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,346] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,347] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,347] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,348] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:11 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:11 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,349] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,350] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,350] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,350] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:12 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:12 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,349] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,350] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,350] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,351] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)
Jun  3 04:52:14 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:14 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:14 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:14 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:15 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:15,562] INFO 10.128.36.67 -  -  [03/Jun/2016:04:52:15 +0000] "GET / HTTP/1.1" 200 3492 "-" "Consul Health Check" (mesosphere.chaos.http.ChaosRequestLog:15)
Jun  3 04:52:15 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:15 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:15 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:15 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,353] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,354] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,354] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,355] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)

Some error messages
test

Failed to connect to slave '9bb05ba1-3873-4257-8257-0319c5b1f91a-S3' on '/mesos/slave/9bb05ba1-3873-4257-8257-0319c5b1f91a-S3'.
Potential reasons:
• The slave is not accessible from your network
• The slave timed out or went offline

@distributorofpain
Copy link

i couldnt get the worker nodes to come back online. When i dug deep, i found that one of the control nodes ( the one in the error message above ) was not reachable. I then rebooted the node and after reboot, the consul piece would not come back online. The worker nodes continued to fail ( rather than switch control nodes? ). At this point three worker nodes and one control node were not working. Guessing the cause was the control node not coming back online.

I destroyed the environment and rebuilt it. After rebuild, two of the worker nodes showed the same issue. Again, one control node wasnt reachable. I was able to restart the control node without issue though. It was fully functional after reboot. What i did observe this time though was that i could not ping on the private network to the control node from the two worker nodes( or vice versa, as expected ). I was able to ping all the other nodes from the control node and the worker nodes as well, it was just that those particular worker nodes wouldnt talk to that particular control node.

I put the two worker nodes in consul maintenance mode, but was unable to get them into mesos maintenance mode ( is there a guide here somewhere as killing the process doesnt work since it restarts ). I went to bed, then when i woke up this morning, the servers were all able to talk with each other.

So this leads me to believe there is some sort of temporary firewall rule that is being activated during install, probably not intentionally, but perhaps its being triggered?

At this point the environment is green.

@langston-barrett
Copy link
Contributor

The next time anyone sees this, please try #1616

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants