Frequent membership loss for 50-100 node cluster in AWS #916

Amit-PivotalLabs · 2015-05-07T17:17:14Z

The main problem we're seeing is we have KV data associated with a lock, and hence a session, and the agents associated with those sessions "frequently" lose membership briefly, invalidating the associated KV data. By "frequently", I mean that in a cluster consisting of 3 consul servers and 50-100 other nodes running consul agents, we see individual isolated incidents of agent membership lost every 3-4 hours, roughly (sometimes ~30 mins apart, sometimes 7-8 hours apart, but generally seems to be randomly distributed with an "average" of 3-4).

We have a heterogeneous cluster configured as a LAN, deployed across 3 AZs in EC2. It appears that there is no correlation between the type of node and the loss of membership (e.g. if we have 10 instances of service A, and 50 of service B, we will see roughly 2 membership losses of A instances, and 10 instances of B losses in a period of a few days). It also appears to be independent of stress on the system, i.e. we see roughly the same distribution of membership loss when the system is idle as we do when it's under heavy load.

We use consul for service discovery/DNS as well as a KV store. We use the KV store specifically for the purpose of:

maintaining presence -- certain nodes maintain the presence of a key with a TTL on it to indicate they are up and running, and when those disappear another node can react and re-schedule the workload that was being performed by that node, and
locks -- for multiple instances of the same service to determine which one actually provides the service, in the case where the service needs to be a singleton

In addition to the occasional member loss, we also have the issue that when we roll the consul server cluster, which triggers a leader election, the KV store is temporarily unavailable. This potentially prevents a presence maintainer from bumping its lock before the TTL expires.

Questions:

Are there ways to configure things so that membership is more robust/more forgiving? We've seen that WAN has more conservative parameters, e.g. a larger SuspicionMult(iplicationFactor), but switching to WAN feels like it would just be papering over an issue, and would only reduce the frequency of this issue.
Is there any advice for a better way to implement "presence maintenance"? Because we use locks, and hence sessions, we're sensitive to the lossiness of UDP and probabilistic nature of Gossip, even though the running application maintaining its presence in the KV stores is running fine. E.g. is there a way to establish a lock but side-step the coupling to the membership of its agent?
What information would help diagnose this problem? We have plenty of server and agent configuration, consul server logs, consul agent logs, and application logs from services running alongside consul agents, all spanning several days.

Thanks,
Amit + @matt-royal

armon · 2015-05-07T19:02:48Z

@Amit-PivotalLabs That is certainly strange. Could you attach some log files during one of these losses? It is generally useful to see both sides of the log (e.g. the node that was suspected of failure, and the node that did the suspecting, it should be clear which two nodes are involved via the logs). This may also expose some pattern of pairs.

With respect to the questions:

There is no current way to tune the underlying Gossip values, but that is a goal for 0.6 to expose the low-level tunable of both Raft and Serf. We've tried to just ship very conservative defaults for now. You could always do a custom build and update those values however.
This is totally possible. When a session is created, it associates with the "serfHealth" by default, but that is not necessary. You can just override the default and not associate with that check. That way it will not be used to automatically invalidate the session. Instead you can use the TTL-based sessions which are not susceptible to UDP issues.
Logs primarily. Enabling telemetry is generally useful as well, since you can look for correlations between various metrics that may be impacting the system. E.g. any spikes in the gossip metrics would be interesting.

Amit-PivotalLabs · 2015-05-08T02:12:49Z

Thanks @armon, that's great info. We'll look into overriding the serfHealth check association and TTL-based sessions. And enabling telemetry.

Here's a few examples of log snippets:

Node: cell_z2-14
IP: 10.10.6.88
Time: 2015/04/28 11:39:07
Agent Logs from Failed Node:
    2015/04/28 11:39:07 [WARN] memberlist: Refuting a dead message (from: cell_z2-3)
Most Informative Server Logs:
    2015/04/28 11:39:07 [INFO] memberlist: Marking cell_z2-14 as failed, suspect timeout reached
    2015/04/28 11:39:07 [INFO] serf: EventMemberFailed: cell_z2-14 10.10.6.88
    2015/04/28 11:39:07 [INFO] consul: member 'cell_z2-14' failed, marking health critical
    2015/04/28 11:39:07 [INFO] serf: EventMemberJoin: cell_z2-14 10.10.6.88
    2015/04/28 11:39:07 [INFO] consul: member 'cell_z2-14' joined, marking health alive

Node: cell_z1-23
IP: 10.10.5.89
Time: 2015/04/28 07:35:44
Agent Logs from Failed Node:
    2015/04/28 07:35:51 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
    2015/04/28 07:35:56 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
    2015/04/28 07:36:01 [ERR] http: Request /v1/session/create, error: rpc error: rpc error: Check 'serfHealth' is in critical state
    2015/04/28 07:36:06 [ERR] http: Request /v1/session/create, error: rpc error: rpc error: Check 'serfHealth' is in critical state
    2015/04/28 07:36:11 [ERR] http: Request /v1/session/create, error: rpc error: rpc error: Check 'serfHealth' is in critical state
    2015/04/28 07:36:16 [ERR] http: Request /v1/session/create, error: rpc error: rpc error: Check 'serfHealth' is in critical state
    2015/04/28 07:36:21 [ERR] http: Request /v1/session/create, error: rpc error: rpc error: Check 'serfHealth' is in critical state
    2015/04/28 07:36:24 [WARN] memberlist: Refuting a suspect message (from: cell_z1-23)
    2015/04/28 07:38:56 [INFO] memberlist: Marking cell_z2-18 as failed, suspect timeout reached
    2015/04/28 07:38:56 [INFO] serf: EventMemberFailed: cell_z2-18 10.10.6.72
    2015/04/28 07:39:35 [INFO] serf: EventMemberJoin: cell_z2-18 10.10.6.72
Most Informative Server Logs #1:
    2015/04/28 07:35:44 [INFO] serf: EventMemberFailed: cell_z1-23 10.10.5.89
    2015/04/28 07:35:44 [INFO] consul: member 'cell_z1-23' failed, marking health critical
    2015/04/28 07:36:24 [INFO] serf: EventMemberJoin: cell_z1-23 10.10.5.89
    2015/04/28 07:36:24 [INFO] consul: member 'cell_z1-23' joined, marking health alive
Most Informative Server Logs #2:
    2015/04/28 07:35:44 [INFO] memberlist: Marking cell_z1-23 as failed, suspect timeout reached
    2015/04/28 07:35:44 [INFO] serf: EventMemberFailed: cell_z1-23 10.10.5.89

Node: cell_z2-4
IP: 10.10.6.75
Time: 2015/04/28 04:10:56
Agent Logs from Failed Node:
    2015/04/28 04:10:56 [WARN] memberlist: Refuting a dead message (from: brain_z2-0)
Most Informative Server Logs:
    2015/04/28 04:10:56 [INFO] memberlist: Marking cell_z2-4 as failed, suspect timeout reached
    2015/04/28 04:10:56 [INFO] serf: EventMemberFailed: cell_z2-4 10.10.6.75
    2015/04/28 04:10:56 [INFO] consul: member 'cell_z2-4' failed, marking health critical
    2015/04/28 04:10:57 [INFO] serf: EventMemberJoin: cell_z2-4 10.10.6.75

Node: cell_z2-3
IP: 10.10.6.69
Time: 2015/04/28 00:59:49
Agent Logs from Failed Node:
    2015/04/28 00:51:57 [INFO] serf: EventMemberFailed: api_z2-0 10.10.3.66
    2015/04/28 00:52:19 [INFO] serf: EventMemberJoin: api_z2-0 10.10.3.66
    2015/04/28 00:59:39 [INFO] memberlist: Suspect stress_tests-0 has failed, no acks received
    2015/04/28 00:59:56 [WARN] memberlist: Refuting a suspect message (from: cell_z2-3)
Most Informative Server Logs #1:
    2015/04/28 00:59:49 [INFO] memberlist: Marking cell_z2-3 as failed, suspect timeout reached
    2015/04/28 00:59:49 [INFO] serf: EventMemberFailed: cell_z2-3 10.10.6.69
    2015/04/28 00:59:56 [INFO] serf: EventMemberJoin: cell_z2-3 10.10.6.69
Most Informative Server Logs #2:
    2015/04/28 00:59:49 [INFO] serf: EventMemberFailed: cell_z2-3 10.10.6.69
    2015/04/28 00:59:49 [INFO] consul: member 'cell_z2-3' failed, marking health critical
    2015/04/28 00:59:56 [INFO] serf: EventMemberJoin: cell_z2-3 10.10.6.69
    2015/04/28 00:59:56 [INFO] consul: member 'cell_z2-3' joined, marking health alive

Node: cell_z1-24
IP: 10.10.5.79
Time: 2015/04/27 19:48:37
Agent Logs from Failed Node:
    2015/04/27 19:48:43 [ERR] http: Request /v1/session/create, error: rpc error: Check 'serfHealth' is in critical state
    2015/04/27 19:48:47 [WARN] memberlist: Refuting a suspect message (from: cell_z1-24)
Most Informative Server Logs:
    2015/04/27 19:48:37 [INFO] serf: EventMemberFailed: cell_z1-24 10.10.5.79
    2015/04/27 19:48:37 [INFO] consul: member 'cell_z1-24' failed, marking health critical
    2015/04/27 19:48:48 [INFO] serf: EventMemberJoin: cell_z1-24 10.10.5.79
    2015/04/27 19:48:48 [INFO] consul: member 'cell_z1-24' joined, marking health alive

aj-jester · 2015-07-12T09:26:17Z

@Amit-PivotalLabs @armon

I can confirm we see the same frequent membership loss issue in AWS. I will try to get some logs.

armon · 2015-07-22T22:41:59Z

I missed this in my inbox earlier. Based on the logs it looks 100% like UDP routing issues. I would suspect iptables, routing rules, security groups or other NAT / firewall at play.

Amit-PivotalLabs · 2015-07-22T22:49:50Z

Thanks @armon.

/cc @fraenkel @ematpl @luan see above comment from @armon re UDP routing issues.

armon · 2015-07-22T22:52:02Z

@Amit-PivotalLabs As a heads up, the newest Consul builds from master do an additional TCP based health check. They will provide more debug output to help indicate a UDP routing issue and potentially help debug. Might be worth building from master to debug.

emalm · 2015-07-22T22:54:07Z

Thanks, @armon and @Amit-PivotalLabs!

bruno-loyal3 · 2015-08-26T19:26:21Z

Hi @ematpl, @armon and @Amit-PivotalLabs did you finally find the root cause of this error? we are seeing the same error in our systems, also having 3 consul servers (one of them on-prem) in different AZ. Same errors appear in AWS hosts as well as in on-prem consul server. Is there any way to increase Timeout.
Checking the logs, it is almost happening every hour in each client in our case. So, it's "frequent" enough.

We are seeing same errors in our logs:

2015/08/26 15:12:23 [INFO] serf: EventMemberJoin: ip-10-44-90-183 10.44.90.183
2015/08/26 15:12:23 [INFO] serf: EventMemberJoin: ip-10-44-94-94 10.44.94.94
2015/08/26 15:12:23 [INFO] serf: EventMemberJoin: ip-10-44-94-232 10.44.94.232
2015/08/26 15:12:23 [INFO] consul: member 'ip-10-44-191-249' joined, marking health alive
2015/08/26 15:12:23 [INFO] consul: member 'ip-10-44-90-183' joined, marking health alive
2015/08/26 15:12:23 [INFO] consul: member 'ip-10-44-94-94' joined, marking health alive
2015/08/26 15:12:23 [INFO] consul: member 'ip-10-44-94-232' joined, marking health alive
2015/08/26 15:12:39 [INFO] memberlist: Marking ip-10-44-94-232 as failed, suspect timeout reached
2015/08/26 15:12:39 [INFO] serf: EventMemberFailed: ip-10-44-94-232 10.44.94.232
2015/08/26 15:12:39 [INFO] consul: member 'ip-10-44-94-232' failed, marking health critical
2015/08/26 15:12:39 [INFO] serf: EventMemberFailed: ip-10-44-93-21 10.44.93.21
2015/08/26 15:12:39 [INFO] consul: member 'ip-10-44-93-21' failed, marking health critical

armon · 2015-08-28T01:22:15Z

@bruno-loyal3 This symptom almost always indicates a UDP routing issue. The best bet is to go to the nodes being marked as failed, and their logs should indicate which peer suspected it of failure. e.g. node "ip-10-44-93-21" will have a log saying something like "refuting [suspect|dead] message from ". This indicates a routing issue between "ip-10-44-93-21" and

cya9nide · 2015-09-04T14:39:50Z

Having this same issue with a cluster with Consul servers in AWS and our on-prem VMware environment and some nodes consistently joining/leaving over and over with following message.

2015/09/04 14:36:54 [WARN] memberlist: Refuting a suspect message

aj-jester · 2015-09-04T23:07:32Z

I wonder if this is a problem with AWS's "classic" network. Has anyone tried their VPC network and experience the same issue?

Amit-PivotalLabs · 2015-09-04T23:16:43Z

@aj-jester: Everything we (myself, @ematpl, @fraenkel, @luan) have done has been in a VPC.

cya9nide · 2015-09-04T23:40:30Z

@aj-jester VPC here as well. Can't figure this one out...yet.

zonzamas · 2015-11-03T15:28:07Z

Just wanted to say that we are having the same issue in a VPC network, with around 35 consul members (5 leaders). From time to time some of the nodes leaves the cluster for a brief time (a few seconds) and that triggers consul-template to reload some services.

djenriquez · 2015-11-03T16:14:22Z

Identical issue to #1212.

Can you try modifying the MTU of your nodes to 1500?

changwuf31 · 2015-11-15T02:04:10Z

Just want to say that issue #1335 is connected to this also.

@djenriquez I've tried setting mtu to 1500 on all nodes where consul is installed, but we still face this issue, currently running on around 17 consul members, with just 1 leader (so we can say that this has nothing to do with any leader election problem)

sstarcher · 2016-01-25T14:37:15Z

This looks similar to the issue I'm still having while running 0.6.3 on AWS in docker.

draxly · 2016-01-25T15:14:06Z

Running 0.6.3 on AWS as well (using Docker 1.9.1) and see serf membership leaving/joing such as has been described throughout this post. Does the 0.6.3 version of Consul "expose low-level tunable of both Raft and Serf" @armon ?

armon · 2016-01-26T02:18:08Z

@draxly Likely 0.7 will expose more of the low level tunes, 0.6.3 does not. A high level of flapping almost certainly still indicates a network configuration problem as there are customers with thousands of nodes in AWS without issue. Later versions of Consul try to provide more helpful diagnostic log messages particularly in the case of misconfigured UDP connectivity.

sstarcher · 2016-01-26T02:20:30Z

@armon I would love my issue to be a configuration issue. I have been running the cluster for over a year and it has been a consistent low level of annoyance. Over the weekend it went 2 days without any flapping, but lost connection on sunday evening.

armon · 2016-01-26T03:17:48Z

@sstarcher It's hard to give any real blanket answer unfortunately. Without access to a lot more information, there are so many potential root causes. The most common issue is simple misconfiguration. We've seen everything from Xen hypervisor bugs, to SYN floods, to driver bugs, to CPU/NIC exhaustion, and Serf bugs be the issue.

If its a very low level of flapping on a large cluster, often times it falls into the acceptable level of false positives, since no failure detector is perfect (its a trade off of time to detection vs FP rate). If its a very high level of flapping its likely misconfiguration as Serf works pretty well. If its somewhere in the middle, then a fairly extensive forensics needs to take place.

Unfortunately, we have only so much time to do support for the community in addition to the development on the core. Given reproduction cases or detailed reports we do our absolute best to solve the issue. Reports that are open ended with many possible root causes are much harder as they consume an enormous amount of time and are not necessarily an issue with Consul at the root.

draxly · 2016-01-26T11:08:33Z

Thanks @amon for the information regarding configuration. I have now upgraded all my AWS instance sizes to m3.large and still see the occasional serf health check problems.
Or perhaps I am misinterpreting the consul logs? On the consul server (running only 1 server but that should not affect serf, right?) I can see the following in the logs:

2016/01/26 10:42:18 [INFO] serf: EventMemberFailed: ip-10-7-1-190 10.7.1.190
2016/01/26 10:42:18 [INFO] consul: member 'ip-10-7-1-190' failed, marking health critical
2016/01/26 10:42:34 [INFO] serf: EventMemberJoin: ip-10-7-1-190 10.7.1.190
2016/01/26 10:42:34 [INFO] consul: member 'ip-10-7-1-190' joined, marking health alive

Does this imply that the server 10.7.1.190 is not available for service discovery during the problem period?

draxly · 2016-01-26T11:10:09Z

We are using Ubuntu 14.04.2 LTS and I have tried to locate problems regarding our network configuration. Can anyone point out some special configuration that is recommended with regards to UDP for serf to work better?

sstarcher · 2016-01-26T12:41:46Z

@draxly We are running m3.mediums on Ubuntu 14.04.2LTS running inside of docker 1.9.2
We have UDP,TCP,ICMP fully open without our subnets and see similar issues.

armon · 2016-01-26T18:13:25Z

@sstarcher @draxly What you want to look for is the source of the flapping in the cluster. The failure detector works in roughly a ping/ack model, so Node A is pinging Node B and if that fails gossiping that Node B has failed (at a high level summary). This log:

2016/01/26 10:42:18 [INFO] serf: EventMemberFailed: ip-10-7-1-190 10.7.1.190

Indicates that ip-10-7-1-190 is being marked as failed (Node B). That same node is refuting the claim about 20 seconds later and is marked as healthy again. I would check the logs on ip-10-7-1-190 to see which peer is suspecting it of failing (Node A). There maybe a network issue between A <-> B. Alternatively, maybe there is some correlated issue with either A or B (high load, IO starvation, process crashed, etc).

armon · 2016-01-26T18:14:59Z

The extra wrinkle to this is that you both are running Consul in Docker. Depending on how you are using Docker, it adds an extra layer of network hops and a UDP proxy which can introduce packet loss, higher CPU load, and has known issues with ARP caching. Its entirely possible that just by moving Consul out of Docker the issue will be resolved.

sstarcher · 2016-01-26T18:26:22Z

@armon I'm using docker with --net=host to remove the ARP problems. My servers are highly over provisioned and run at 15% CPU for Consul. I do see sporadic jumps in CPU every day or so randomly. I was hoping my issue was #1592 , but after upgrading to 0.6.3 I still see issues, but it's starting to look like the issue is less often.

armon · 2016-01-26T18:43:08Z

@sstarcher One trick I've used is to grep the logs for the Failed/Join events and use awk to aggregate by the node. See if the failures are random (each node is equally likely to fail) or if you see a concentration, sometimes it points to some problem node. We've sometimes had to just recycle EC2 instances that were particularly problematic.

sstarcher · 2016-01-26T19:08:06Z

@armon in the past 24 hours I have 2,907 occurrences of EventMemberFailed, but the large majority of these are members and not the 5 leader nodes.

On just my 5 leader nodes I have 253 occurrences of EventMemberFailed At one point in time in the last 24 hours every leader node has had someone report that EventMemberFailed for that node.

4 of the 5 nodes have reported at certain times that every other node has serf EventMemberFailed Only a single node has no entries for EventMemberFailed.

This does raise an interesting point the 4 nodes that are complaining about each other are complaining about each other within 30 seconds of each other.

Jan 25th 19:50:43, 23:08:55, 23:09:11, 23:10:00
Jan 26th 00:17:09

hack-s · 2016-03-08T17:40:30Z

It's 0.5.2, and it was probably a little premature to say it solved the issue, since I can still see a flap once in a while, but the rate decreased a lot. I will continue to investigate today, and update to 0.6.3.

calvn · 2016-03-11T16:07:39Z

@hack-s any conclusions as to what might be the cause on 0.5.2 that is triggering nodes to flap? Did you have a mix of 0.5.2 and 0.6.3 nodes before? I am also experiencing nodes flapping from time to time. The cluster contains a mix of the versions.

draxly · 2016-03-17T08:02:43Z

Consul 0.6.4 released yesterday and this bug fix looks promising:
"Updated memberlist to pull in a fix for leaking goroutines when performing TCP fallback pings. This affected users with frequent UDP connectivity problems. [GH-1802]"

draxly · 2016-03-17T15:08:54Z

Got this problem on my test environment (with no load whatsoever) on EC2:
2016/03/17 13:25:31 [ERR] memberlist: Failed TCP fallback ping: read tcp 10.6.1.207:42175->10.6.1.213:8301: i/o timeout

This is the problem I have had all along. Any more advice? =) For my life I cannot understand why my the Consul agent times out here.

slackpad · 2016-03-17T15:35:28Z

@draxly can you vet the connectivity between those two hosts TCP and UDP in both directions on port 8301? It looks like there's something black holing the traffic so it doesn't refuse to connect but it times out trying.

draxly · 2016-03-21T07:50:24Z

@slackpad, I believe I have verified that by using netcat on both servers and seeing that tcp and udp connections are successfull.
Also, this problem occurs a couple of times per day for the entire environment (10 servers) so it is a pretty infrequent problem. I'm starting to believe it might be something with the network between the Amazon EC2 instances but not sure how to verify this.

calvn · 2016-03-21T18:12:17Z

@draxly I am also seeing TCP fallback errors and the cluster is on AWS if that helps. Looking at the CloudWatch logs, I don't see any network i/o issue when the error appears. As you mentioned this issue only occurs from time to time, so it's definitely not a blocked port on anything of that sort. I also suspect that it could be a network issue between the EC2 instances, but am unsure as well how to verify this theory.

draxly · 2016-03-31T16:45:45Z

Hi! This is perhaps common knowledge but it turned out to be a big difference for us running on Amazon so I think it might be worth mentioning.
By using the enhanced networking option for our EC2 instances, a lot of problems with flapping nodes have disappeared. We are now running Consul servers and agents with Docker using --net and it looks really promising.

blakeblackshear · 2016-03-31T16:59:52Z

@draxly This is probably for the same reason, but I do not see the flapping within our VPC that is set to dedicated tenacity. Just another data point that points to it being network related. Unfortunately, we use smaller instance types. I don't see why I need to run such a beefy machine with enhanced networking for consul to work properly. Seems like we just need some tunable thresholds within consul.

byrnedo · 2016-04-17T15:38:19Z

I'm also seeing this on aws with consul 0.6.3. My consul servers don't do a lot of work apart from health checks and dns so they're running on t2.micro instances.

sstarcher · 2016-04-17T20:39:23Z

@byrnedo We we previously using t2 instance types and noticed our cpu credits were never being used up. Due to this we assumed the load on the server was not high enough to matter. After we moved to m3.medium servers we immediately noticed we were wrong. We also see significantly less leader elections.

rgardam · 2016-04-26T10:38:18Z

I'm really surprised this hasn't been getting the attention that this ticket really deserves.

Is there anyone using consul in aws that's not seeing this issue?

This is causing major head aches for us as we're using consul-template and if a node is ever marked as unhealthy the service restart is triggered.

This pretty much means that consul/consul-template is unusable for us.

I spoke with some people from AWS about this and they were also surprised to hear about this.

Any update from the Hasicorp side about this?
If they're offering a supported product in the future this is one feature that needs to be resolved.

draxly · 2016-04-26T11:30:13Z

@rgardam, have you tried using EC2 instances with enhanced networking enabled? It does require a larger instance size, for example m4.large, but it solved the problem for us.
It seems that 0.7 of Consul will include settings for configuring the agents to be more "forgiving".

rgardam · 2016-04-26T11:41:43Z

@draxly I get that this potentially fixes the issue, but this is a big increase in cost that I feel shouldn't need to happen. I am running other systems without any issues on t2.micro's

I'm hoping that 0.7 reduces this issue.

sstarcher · 2016-04-26T12:41:03Z

@rgardam with any amount of real load Consul requires cpu. We were seeing constant leader election even with m4.larges under our workload. The moment we moved to c4's pretty much all elections went away. If you want to lower CPU load instead of using consul template for individual keys store the entire config inside of one key.

byrnedo · 2016-04-27T09:17:50Z

@sstarcher That seems to have solved our problems (moving to m3.medium), thanks!
@rgardam I'm with you on the increase in cost

rgardam · 2016-05-06T12:48:04Z

After watching this
https://youtu.be/3qln2u1Vr2E?t=1108
I think this could have something to do with the fact that some packet magic is happening with all non enhanced networking systems. ie, the mapping service intercepts ARP requests and rewrites ARP's and other headers.
Could it be that consul is also trying to do some packet magic?

slackpad · 2016-08-10T20:53:14Z

@rgardam we don't do any packet magic in Consul, that's an interesting presentation, though. We've spent some effort to make this better in the upcoming release of Consul - #2101. A given host experiencing problems may still flap, but we've worked to reduce the damage that host is able to cause to the rest of the cluster.

sstarcher · 2016-08-17T02:46:32Z

@rgardam Without spending time looking into it my guess would be it's related to gomaxproc/docker/t2 instance types. When running inside of docker on a t2 instance type consul seems to use very little CPU, but the moment I move to a m3.medium the cpu usage shot up.

paladin8 · 2016-08-21T22:19:33Z

We are running into this issue as well, and as we're starting to rely more on consul it's becoming quite problematic as we get a bunch of false alarms for nodes being down. Here's a bunch of information, let me know what else would be useful to get this figured out, it's a huge issue with an otherwise awesome piece of infrastructure.

AWS: us-west-2 region, multi-AZ, EC2 classic
Servers: 3 (c3.xlarge)
Members: 391 (mixed)
Security groups: 8300, 8301, 8302 open for TCP/UDP between all agents

consul info:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 1
build:
    prerelease = 
    revision = 26a0ef8c
    version = 0.6.4
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    server = true
raft:
    applied_index = 9566406
    commit_index = 9566406
    fsm_pending = 0
    last_contact = 0
    last_log_index = 9566406
    last_log_term = 607
    last_snapshot_index = 9564535
    last_snapshot_term = 607
    num_peers = 2
    state = Leader
    term = 607
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 1080
    max_procs = 4
    os = linux
    version = go1.6
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 259
    failed = 0
    intent_queue = 0
    left = 3
    member_time = 8655
    members = 390
    query_queue = 0
    query_time = 8
serf_wan:
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 8
    members = 1
    query_queue = 0
    query_time = 1

We see sporadic failed events for nodes in our cluster -- sometimes we'll get a burst of 2-4 nodes failing out at the same time, but usually it's just a single one. They might happen every 5 minutes for a while, or there might be a 6+ hour gap between any instances. The "failed" node rejoins sometime between 2 seconds and 2 minutes afterwards (slightly after it refutes the suspect/dead message), but usually in the 5-20s range.

Analyzing the failed checks, it looks like the agent marking another as failed/dead is random (pretty even across the cluster), but the agents being marked as failed/dead are somewhat correlated. They tend to be nodes with higher network/IO load, but it's not always the case, and it's not very clear cut (over the last couple of days 20+ different nodes have been marked dead/failed). Additionally, I haven't found any real correlation with load on the failed machine at the time of the membership loss.

Couple of questions:

Can an agent that is overloaded cause a separate agent to appear failed/dead?
What are the improvements in 0.7 for stability of cluster membership? When do you plan to have a production-ready release of 0.7?

Please let me know what else I can provide to help debug this. Thanks for any help!

jhmartin · 2016-08-21T22:45:31Z

@paladin8 An overloaded agent can definitely cause another node to show as failed. All the nodes randomly healthcheck each other, so bad node B that is overloaded and attempts to check good node G can erroneously timeout and gossip that G is down.

My understanding is that other nodes will then recheck G, so it takes a confirmation to actually mark it as dead. However, if you have multiple overloaded nodes then there is a chance that both the primary and secondary check could land on overloaded nodes and mark G as down.

paladin8 · 2016-08-22T03:17:09Z

@jhmartin Thanks for the reply. If that's the case, I would guess that a specific node or set of nodes would appear to be frequently marking random other nodes as failed/dead, but we aren't observing any patterns among the nodes that are doing the marking (i.e. those logging messages like "memberlist: Marking {host} as failed, suspect timeout reached"). Does that sound like it rules out any issue related to an overloaded machine performing a check?

slackpad · 2016-09-16T23:05:17Z

@paladin8 if it doesn't look like there's a small set of source hosts then that seems like it might be a network issue. That timeout reached message is the right one to look for in the logs to see who's actually detecting another node as failed.

slackpad · 2016-09-21T17:13:22Z

I'm going to close this out as Consul 0.7 introduced Lifeguard enhancements to the gossip layer to combat these kinds of issues. Since there's a lot of noise here, please open a new issue (potentially linking to this one) if you are seeing trouble with Consul 0.7 and later.

* Update consul and consul-k8s images * Provide NET_ADMIN capability to PSP when running with tproxy Co-authored-by: Kyle Schochenmaier <[email protected]>

slackpad added the performance label Jan 8, 2016

rgardam mentioned this issue Jul 1, 2016

TCP node check #2152

Closed

ghost mentioned this issue Jul 20, 2016

UDP memberlist checks always timing out #2200

Closed

slackpad closed this as completed Sep 21, 2016

sumitsarkar mentioned this issue Oct 26, 2017

Consul client agent intermittently becomes unreachable and exit eventually due to flaky network connections. stelligent/mu#203

Closed

snyk-bot mentioned this issue Jul 15, 2021

[Snyk] Upgrade husky from 4.3.7 to 6.0.0 ekmixon/consul#4

Open

duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021

Fix nightly acceptance tests (hashicorp#916)

6a74d33

* Update consul and consul-k8s images * Provide NET_ADMIN capability to PSP when running with tproxy Co-authored-by: Kyle Schochenmaier <[email protected]>

Frequent membership loss for 50-100 node cluster in AWS #916

Frequent membership loss for 50-100 node cluster in AWS #916

Comments

Amit-PivotalLabs commented May 7, 2015

armon commented May 7, 2015

Amit-PivotalLabs commented May 8, 2015

aj-jester commented Jul 12, 2015

armon commented Jul 22, 2015

Amit-PivotalLabs commented Jul 22, 2015

armon commented Jul 22, 2015

emalm commented Jul 22, 2015

bruno-loyal3 commented Aug 26, 2015

armon commented Aug 28, 2015

cya9nide commented Sep 4, 2015

aj-jester commented Sep 4, 2015

Amit-PivotalLabs commented Sep 4, 2015

cya9nide commented Sep 4, 2015

zonzamas commented Nov 3, 2015

djenriquez commented Nov 3, 2015

changwuf31 commented Nov 15, 2015

sstarcher commented Jan 25, 2016

draxly commented Jan 25, 2016

armon commented Jan 26, 2016

sstarcher commented Jan 26, 2016

armon commented Jan 26, 2016

draxly commented Jan 26, 2016

draxly commented Jan 26, 2016

sstarcher commented Jan 26, 2016

armon commented Jan 26, 2016

armon commented Jan 26, 2016

sstarcher commented Jan 26, 2016

armon commented Jan 26, 2016

sstarcher commented Jan 26, 2016

hack-s commented Mar 8, 2016

calvn commented Mar 11, 2016

draxly commented Mar 17, 2016

draxly commented Mar 17, 2016

slackpad commented Mar 17, 2016

draxly commented Mar 21, 2016

calvn commented Mar 21, 2016

draxly commented Mar 31, 2016

blakeblackshear commented Mar 31, 2016

byrnedo commented Apr 17, 2016

sstarcher commented Apr 17, 2016

rgardam commented Apr 26, 2016

draxly commented Apr 26, 2016

rgardam commented Apr 26, 2016

sstarcher commented Apr 26, 2016

byrnedo commented Apr 27, 2016

rgardam commented May 6, 2016

slackpad commented Aug 10, 2016

sstarcher commented Aug 17, 2016

paladin8 commented Aug 21, 2016 • edited Loading

jhmartin commented Aug 21, 2016 • edited Loading

paladin8 commented Aug 22, 2016 • edited Loading

slackpad commented Sep 16, 2016

slackpad commented Sep 21, 2016

paladin8 commented Aug 21, 2016 •

edited

Loading

jhmartin commented Aug 21, 2016 •

edited

Loading

paladin8 commented Aug 22, 2016 •

edited

Loading