Cluster join i/o failure leading to slow client joins (over time) #11010
Labels
theme/ecs
Related to the AWS Elastic Container Service runtime
type/question
Not an "enhancement" or "bug". Please post on discuss.hashicorp
Overview of the Issue
As part of our infrastructure, we are using a Consul cluster with three servers and a number of client agents who are also frequently joining and leaving the cluster. After (re-)starting the cluster (=starting all servers), client agent joining is fast as usual. But after several hours of joining and leaving, most and eventually all of the join operations become slower (takes around 10s+), although they finally succeed. This hinders fast scale out for our services, because the main service needs to wait for the Consul client to join the cluster before it can use service discovery.
We are basically seeking for advice how to narrow down and further investigate the problem. What could be the cause of this slowly "degrading" behavior?
Reproduction Steps
Unfortunately, we were not able to reproduce the issue in a smaller environment, yet.
Consul info for both Client and Server
The clients that are showing this problem are running as containers on ECS Fargate (see below) and we are missing a (practical) way to execute
consul info
there.This is the configuration of the ECS sidecar Consul clients:
Server info
Operating system and Environment details
Consul version: 1.10.1
Environment: AWS, Consul runs inside a VPC (in the example below: IPv4 CIDR: *..108.0/23).
Servers: 3 Consul servers running on 3 Linux (Ubuntu 18.04.5 LTS) EC2 machines (ASG, but configured stable at 3 instances).
Clients: Clients are running on other EC2 instances (Windows, static) or as a sidecar container (Linux) in ECS Fargate (based on the Consul Docker image). Because of up/down scaling of our services running in ECS, the sidecar client agents are joining and leaving frequently (around 30-250 times an hour).
Security Groups: The relevant security groups allow for TCP and UDP access (ingress and egress) on ports 8300, 8301 and 8302.
Note: Since this VPC has a rather small address space, IP addresses for the ECS client agents are re-used eventually.
Log Fragments
In this example, the servers have the IP addresses [*..109.107 *..109.12 *..108.218]
Client joining (slow)
There is a 10 seconds gap between
Joining cluster
and the following error message (logged as DEBUG) about the i/o timeout. After that, the join happens virtually immediately. Note that the problem can be observed with all of the 3 servers, depending on which server the client agent tries to join first.After this, the client leaves again gracefully as intended (
left
in the memberlist).Here is the full log.
Server log for the same timeframe
This is a log excerpt of Consul server
***.**.109.107
(no messages about client***.**.109.93
before this).Here is a bigger log excerpt. Whats surprising about this log is that there are several server rebalacings during the relatively short timeframe of around 8 minutes and the
number_of_servers=1
seems odd as well because the cluster has 3 servers. Is this a normal behavior?Example of a fast join
Shortly after the restart of the Consul servers, joins are fast, like this one:
Thank you very much in advance!
The text was updated successfully, but these errors were encountered: