-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent membership loss for 50-100 node cluster in AWS #916
Comments
@Amit-PivotalLabs That is certainly strange. Could you attach some log files during one of these losses? It is generally useful to see both sides of the log (e.g. the node that was suspected of failure, and the node that did the suspecting, it should be clear which two nodes are involved via the logs). This may also expose some pattern of pairs. With respect to the questions:
|
Thanks @armon, that's great info. We'll look into overriding the serfHealth check association and TTL-based sessions. And enabling telemetry. Here's a few examples of log snippets:
|
I can confirm we see the same frequent membership loss issue in AWS. I will try to get some logs. |
I missed this in my inbox earlier. Based on the logs it looks 100% like UDP routing issues. I would suspect iptables, routing rules, security groups or other NAT / firewall at play. |
@Amit-PivotalLabs As a heads up, the newest Consul builds from master do an additional TCP based health check. They will provide more debug output to help indicate a UDP routing issue and potentially help debug. Might be worth building from master to debug. |
Thanks, @armon and @Amit-PivotalLabs! |
Hi @ematpl, @armon and @Amit-PivotalLabs did you finally find the root cause of this error? we are seeing the same error in our systems, also having 3 consul servers (one of them on-prem) in different AZ. Same errors appear in AWS hosts as well as in on-prem consul server. Is there any way to increase Timeout. We are seeing same errors in our logs:
|
@bruno-loyal3 This symptom almost always indicates a UDP routing issue. The best bet is to go to the nodes being marked as failed, and their logs should indicate which peer suspected it of failure. e.g. node "ip-10-44-93-21" will have a log saying something like "refuting [suspect|dead] message from ". This indicates a routing issue between "ip-10-44-93-21" and |
Having this same issue with a cluster with Consul servers in AWS and our on-prem VMware environment and some nodes consistently joining/leaving over and over with following message. 2015/09/04 14:36:54 [WARN] memberlist: Refuting a suspect message |
I wonder if this is a problem with AWS's "classic" network. Has anyone tried their VPC network and experience the same issue? |
@aj-jester: Everything we (myself, @ematpl, @fraenkel, @luan) have done has been in a VPC. |
@aj-jester VPC here as well. Can't figure this one out...yet. |
Just wanted to say that we are having the same issue in a VPC network, with around 35 consul members (5 leaders). From time to time some of the nodes leaves the cluster for a brief time (a few seconds) and that triggers consul-template to reload some services. |
Identical issue to #1212. Can you try modifying the MTU of your nodes to 1500? |
Just want to say that issue #1335 is connected to this also. @djenriquez I've tried setting mtu to 1500 on all nodes where consul is installed, but we still face this issue, currently running on around 17 consul members, with just 1 leader (so we can say that this has nothing to do with any leader election problem) |
This looks similar to the issue I'm still having while running 0.6.3 on AWS in docker. |
Running 0.6.3 on AWS as well (using Docker 1.9.1) and see serf membership leaving/joing such as has been described throughout this post. Does the 0.6.3 version of Consul "expose low-level tunable of both Raft and Serf" @armon ? |
@draxly Likely 0.7 will expose more of the low level tunes, 0.6.3 does not. A high level of flapping almost certainly still indicates a network configuration problem as there are customers with thousands of nodes in AWS without issue. Later versions of Consul try to provide more helpful diagnostic log messages particularly in the case of misconfigured UDP connectivity. |
@armon I would love my issue to be a configuration issue. I have been running the cluster for over a year and it has been a consistent low level of annoyance. Over the weekend it went 2 days without any flapping, but lost connection on sunday evening. |
@sstarcher It's hard to give any real blanket answer unfortunately. Without access to a lot more information, there are so many potential root causes. The most common issue is simple misconfiguration. We've seen everything from Xen hypervisor bugs, to SYN floods, to driver bugs, to CPU/NIC exhaustion, and Serf bugs be the issue. If its a very low level of flapping on a large cluster, often times it falls into the acceptable level of false positives, since no failure detector is perfect (its a trade off of time to detection vs FP rate). If its a very high level of flapping its likely misconfiguration as Serf works pretty well. If its somewhere in the middle, then a fairly extensive forensics needs to take place. Unfortunately, we have only so much time to do support for the community in addition to the development on the core. Given reproduction cases or detailed reports we do our absolute best to solve the issue. Reports that are open ended with many possible root causes are much harder as they consume an enormous amount of time and are not necessarily an issue with Consul at the root. |
Thanks @amon for the information regarding configuration. I have now upgraded all my AWS instance sizes to m3.large and still see the occasional serf health check problems.
Does this imply that the server 10.7.1.190 is not available for service discovery during the problem period? |
We are using Ubuntu 14.04.2 LTS and I have tried to locate problems regarding our network configuration. Can anyone point out some special configuration that is recommended with regards to UDP for serf to work better? |
@draxly We are running m3.mediums on Ubuntu 14.04.2LTS running inside of docker 1.9.2 |
@sstarcher @draxly What you want to look for is the source of the flapping in the cluster. The failure detector works in roughly a ping/ack model, so Node A is pinging Node B and if that fails gossiping that Node B has failed (at a high level summary). This log:
Indicates that |
The extra wrinkle to this is that you both are running Consul in Docker. Depending on how you are using Docker, it adds an extra layer of network hops and a UDP proxy which can introduce packet loss, higher CPU load, and has known issues with ARP caching. Its entirely possible that just by moving Consul out of Docker the issue will be resolved. |
@armon I'm using docker with --net=host to remove the ARP problems. My servers are highly over provisioned and run at 15% CPU for Consul. I do see sporadic jumps in CPU every day or so randomly. I was hoping my issue was #1592 , but after upgrading to 0.6.3 I still see issues, but it's starting to look like the issue is less often. |
@sstarcher One trick I've used is to grep the logs for the Failed/Join events and use awk to aggregate by the node. See if the failures are random (each node is equally likely to fail) or if you see a concentration, sometimes it points to some problem node. We've sometimes had to just recycle EC2 instances that were particularly problematic. |
@armon in the past 24 hours I have 2,907 occurrences of On just my 5 leader nodes I have 253 occurrences of 4 of the 5 nodes have reported at certain times that every other node has This does raise an interesting point the 4 nodes that are complaining about each other are complaining about each other within 30 seconds of each other. Jan 25th 19:50:43, 23:08:55, 23:09:11, 23:10:00 |
It's 0.5.2, and it was probably a little premature to say it solved the issue, since I can still see a flap once in a while, but the rate decreased a lot. I will continue to investigate today, and update to 0.6.3. |
@hack-s any conclusions as to what might be the cause on 0.5.2 that is triggering nodes to flap? Did you have a mix of 0.5.2 and 0.6.3 nodes before? I am also experiencing nodes flapping from time to time. The cluster contains a mix of the versions. |
Consul 0.6.4 released yesterday and this bug fix looks promising: |
Got this problem on my test environment (with no load whatsoever) on EC2: This is the problem I have had all along. Any more advice? =) For my life I cannot understand why my the Consul agent times out here. |
@draxly can you vet the connectivity between those two hosts TCP and UDP in both directions on port 8301? It looks like there's something black holing the traffic so it doesn't refuse to connect but it times out trying. |
@slackpad, I believe I have verified that by using netcat on both servers and seeing that tcp and udp connections are successfull. |
@draxly I am also seeing TCP fallback errors and the cluster is on AWS if that helps. Looking at the CloudWatch logs, I don't see any network i/o issue when the error appears. As you mentioned this issue only occurs from time to time, so it's definitely not a blocked port on anything of that sort. I also suspect that it could be a network issue between the EC2 instances, but am unsure as well how to verify this theory. |
Hi! This is perhaps common knowledge but it turned out to be a big difference for us running on Amazon so I think it might be worth mentioning. |
@draxly This is probably for the same reason, but I do not see the flapping within our VPC that is set to dedicated tenacity. Just another data point that points to it being network related. Unfortunately, we use smaller instance types. I don't see why I need to run such a beefy machine with enhanced networking for consul to work properly. Seems like we just need some tunable thresholds within consul. |
I'm also seeing this on aws with consul 0.6.3. My consul servers don't do a lot of work apart from health checks and dns so they're running on t2.micro instances. |
@byrnedo We we previously using t2 instance types and noticed our cpu credits were never being used up. Due to this we assumed the load on the server was not high enough to matter. After we moved to m3.medium servers we immediately noticed we were wrong. We also see significantly less leader elections. |
I'm really surprised this hasn't been getting the attention that this ticket really deserves. Is there anyone using consul in aws that's not seeing this issue? This is causing major head aches for us as we're using consul-template and if a node is ever marked as unhealthy the service restart is triggered. This pretty much means that consul/consul-template is unusable for us. I spoke with some people from AWS about this and they were also surprised to hear about this. Any update from the Hasicorp side about this? |
@rgardam, have you tried using EC2 instances with enhanced networking enabled? It does require a larger instance size, for example m4.large, but it solved the problem for us. |
@draxly I get that this potentially fixes the issue, but this is a big increase in cost that I feel shouldn't need to happen. I am running other systems without any issues on t2.micro's I'm hoping that 0.7 reduces this issue. |
@rgardam with any amount of real load Consul requires cpu. We were seeing constant leader election even with m4.larges under our workload. The moment we moved to c4's pretty much all elections went away. If you want to lower CPU load instead of using consul template for individual keys store the entire config inside of one key. |
@sstarcher That seems to have solved our problems (moving to |
After watching this |
@rgardam we don't do any packet magic in Consul, that's an interesting presentation, though. We've spent some effort to make this better in the upcoming release of Consul - #2101. A given host experiencing problems may still flap, but we've worked to reduce the damage that host is able to cause to the rest of the cluster. |
@rgardam Without spending time looking into it my guess would be it's related to gomaxproc/docker/t2 instance types. When running inside of docker on a t2 instance type consul seems to use very little CPU, but the moment I move to a m3.medium the cpu usage shot up. |
We are running into this issue as well, and as we're starting to rely more on consul it's becoming quite problematic as we get a bunch of false alarms for nodes being down. Here's a bunch of information, let me know what else would be useful to get this figured out, it's a huge issue with an otherwise awesome piece of infrastructure. AWS: us-west-2 region, multi-AZ, EC2 classic consul info:
We see sporadic failed events for nodes in our cluster -- sometimes we'll get a burst of 2-4 nodes failing out at the same time, but usually it's just a single one. They might happen every 5 minutes for a while, or there might be a 6+ hour gap between any instances. The "failed" node rejoins sometime between 2 seconds and 2 minutes afterwards (slightly after it refutes the suspect/dead message), but usually in the 5-20s range. Analyzing the failed checks, it looks like the agent marking another as failed/dead is random (pretty even across the cluster), but the agents being marked as failed/dead are somewhat correlated. They tend to be nodes with higher network/IO load, but it's not always the case, and it's not very clear cut (over the last couple of days 20+ different nodes have been marked dead/failed). Additionally, I haven't found any real correlation with load on the failed machine at the time of the membership loss. Couple of questions:
Please let me know what else I can provide to help debug this. Thanks for any help! |
@paladin8 An overloaded agent can definitely cause another node to show as failed. All the nodes randomly healthcheck each other, so bad node B that is overloaded and attempts to check good node G can erroneously timeout and gossip that G is down. My understanding is that other nodes will then recheck G, so it takes a confirmation to actually mark it as dead. However, if you have multiple overloaded nodes then there is a chance that both the primary and secondary check could land on overloaded nodes and mark G as down. |
@jhmartin Thanks for the reply. If that's the case, I would guess that a specific node or set of nodes would appear to be frequently marking random other nodes as failed/dead, but we aren't observing any patterns among the nodes that are doing the marking (i.e. those logging messages like "memberlist: Marking {host} as failed, suspect timeout reached"). Does that sound like it rules out any issue related to an overloaded machine performing a check? |
@paladin8 if it doesn't look like there's a small set of source hosts then that seems like it might be a network issue. That timeout reached message is the right one to look for in the logs to see who's actually detecting another node as failed. |
I'm going to close this out as Consul 0.7 introduced Lifeguard enhancements to the gossip layer to combat these kinds of issues. Since there's a lot of noise here, please open a new issue (potentially linking to this one) if you are seeing trouble with Consul 0.7 and later. |
* Update consul and consul-k8s images * Provide NET_ADMIN capability to PSP when running with tproxy Co-authored-by: Kyle Schochenmaier <[email protected]>
The main problem we're seeing is we have KV data associated with a lock, and hence a session, and the agents associated with those sessions "frequently" lose membership briefly, invalidating the associated KV data. By "frequently", I mean that in a cluster consisting of 3 consul servers and 50-100 other nodes running consul agents, we see individual isolated incidents of agent membership lost every 3-4 hours, roughly (sometimes ~30 mins apart, sometimes 7-8 hours apart, but generally seems to be randomly distributed with an "average" of 3-4).
We have a heterogeneous cluster configured as a LAN, deployed across 3 AZs in EC2. It appears that there is no correlation between the type of node and the loss of membership (e.g. if we have 10 instances of service A, and 50 of service B, we will see roughly 2 membership losses of A instances, and 10 instances of B losses in a period of a few days). It also appears to be independent of stress on the system, i.e. we see roughly the same distribution of membership loss when the system is idle as we do when it's under heavy load.
We use consul for service discovery/DNS as well as a KV store. We use the KV store specifically for the purpose of:
In addition to the occasional member loss, we also have the issue that when we roll the consul server cluster, which triggers a leader election, the KV store is temporarily unavailable. This potentially prevents a presence maintainer from bumping its lock before the TTL expires.
Questions:
Thanks,
Amit + @matt-royal
The text was updated successfully, but these errors were encountered: