Cluster join i/o failure leading to slow client joins (over time) #11010

pln-rec · 2021-09-10T16:52:09Z

Overview of the Issue

As part of our infrastructure, we are using a Consul cluster with three servers and a number of client agents who are also frequently joining and leaving the cluster. After (re-)starting the cluster (=starting all servers), client agent joining is fast as usual. But after several hours of joining and leaving, most and eventually all of the join operations become slower (takes around 10s+), although they finally succeed. This hinders fast scale out for our services, because the main service needs to wait for the Consul client to join the cluster before it can use service discovery.

We are basically seeking for advice how to narrow down and further investigate the problem. What could be the cause of this slowly "degrading" behavior?

Reproduction Steps

Unfortunately, we were not able to reproduce the issue in a smaller environment, yet.

Consul info for both Client and Server

The clients that are showing this problem are running as containers on ECS Fargate (see below) and we are missing a (practical) way to execute consul info there.
This is the configuration of the ECS sidecar Consul clients:

{
  "bind_addr": "{{GetPrivateIP}}",
  "ca_file": "/consul/extra/consulAgent.ca.pem",
  "cert_file": "/consul/extra/consulAgent.crt",
  "datacenter": "dcname",
  "disable_update_check": true,
  "encrypt": "*******************",
  "key_file": "/consul/extra/consulAgent.key",
  "leave_on_terminate": true,
  "ports": {
    "dns": -1,
    "http": -1,
    "https": 8500
  },
  "rejoin_after_leave": true,
  "retry_join": [
    "provider=\"aws\" region=\"us-east-1\" addr_type=\"private_v4\" tag_key=\"aws:cloudformation:stack-name\" tag_value=\"PODNAMECC001\""
  ],
  "server": false,
  "ui": false,
  "verify_outgoing": true,
  "log_level": "trace"
}

Server info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = db839f18
        version = 1.10.1
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = true
        leader_addr = ***.**.109.107:8300
        server = true
raft:
        applied_index = 4349
        commit_index = 4349
        fsm_pending = 0
        last_contact = 0
        last_log_index = 4349
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:134c79d3-e2f9-d4e0-a547-fec5616cbdc9 Address:***.**.109.107:8300} {Suffrage:Voter ID:6cdd57c1-ca31-f0b7-2164-0cd5d623d183 Address:***.**.108.218:8300} {Suffrage:Voter ID:3eee12ed-9eae-b664-b545-90c8658d91dd Address:***.**.109.12:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 2
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 151
        max_procs = 2
        os = linux
        version = go1.16.6
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 132
        member_time = 1652
        members = 140
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 4
        members = 3
        query_queue = 0
        query_time = 1

Operating system and Environment details

Consul version: 1.10.1
Environment: AWS, Consul runs inside a VPC (in the example below: IPv4 CIDR: *..108.0/23).
Servers: 3 Consul servers running on 3 Linux (Ubuntu 18.04.5 LTS) EC2 machines (ASG, but configured stable at 3 instances).
Clients: Clients are running on other EC2 instances (Windows, static) or as a sidecar container (Linux) in ECS Fargate (based on the Consul Docker image). Because of up/down scaling of our services running in ECS, the sidecar client agents are joining and leaving frequently (around 30-250 times an hour).
Security Groups: The relevant security groups allow for TCP and UDP access (ingress and egress) on ports 8300, 8301 and 8302.
Note: Since this VPC has a rather small address space, IP addresses for the ECS client agents are re-used eventually.

Log Fragments

In this example, the servers have the IP addresses [*..109.107 *..109.12 *..108.218]

Client joining (slow)

==> Starting Consul agent...
           Version: '1.10.1'
           Node ID: 'e7b61cf6-3a71-4a6a-e499-b46793d1affa'
         Node name: 'ip-***-**-109-93.ec2.internal'
        Datacenter: 'dcname' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: -1, HTTPS: 8500, gRPC: 8502, DNS: -1)
      Cluster Addr: ***.**.109.93 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:

2021-09-09T13:55:22.332Z [INFO]  agent: Joining cluster...: cluster=LAN
...
2021-09-09T13:55:22.674Z [INFO]  agent: (LAN) joining: lan_addresses=[***.**.109.107, ***.**.109.12, ***.**.108.218]
2021-09-09T13:55:32.674Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed to join ***.**.109.107: dial tcp ***.**.109.107:8301: i/o timeout
2021-09-09T13:55:32.676Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  ***.**.109.12:8301
...
2021-09-09T13:55:32.693Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=2

There is a 10 seconds gap between Joining cluster and the following error message (logged as DEBUG) about the i/o timeout. After that, the join happens virtually immediately. Note that the problem can be observed with all of the 3 servers, depending on which server the client agent tries to join first.

After this, the client leaves again gracefully as intended (left in the memberlist).

Here is the full log.

Server log for the same timeframe

This is a log excerpt of Consul server ***.**.109.107 (no messages about client ***.**.109.93 before this).

2021-09-09T13:55:31.458Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ip-***-**-109-93.ec2.internal ***.**.109.93
2021-09-09T13:55:31.458Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.458Z [INFO]  agent.server: member joined, marking health alive: member=ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.584Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.637Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.784Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.784Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.830Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.831Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.837Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:32.031Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:32.037Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:32.039Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:32.039Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:33.252Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=2
2021-09-09T13:55:33.252Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=2
2021-09-09T13:55:33.307Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=2
2021-09-09T13:55:33.307Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=2

Here is a bigger log excerpt. Whats surprising about this log is that there are several server rebalacings during the relatively short timeframe of around 8 minutes and the number_of_servers=1 seems odd as well because the cluster has 3 servers. Is this a normal behavior?

Example of a fast join

Shortly after the restart of the Consul servers, joins are fast, like this one:

==> Starting Consul agent...
           Version: '1.10.1'
           Node ID: '46f36f54-467b-0dfd-3c72-270471e12c26'
         Node name: 'ip-***-**-109-44.ec2.internal'
        Datacenter: 'dcname' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: -1, HTTPS: 8500, gRPC: 8502, DNS: -1)
      Cluster Addr: ***.**.109.44 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2021-09-09T10:02:12.689Z [INFO]  agent: Consul agent running!
...
2021-09-09T10:02:13.058Z [INFO]  agent: Discovered servers: cluster=LAN cluster=LAN servers="***.**.109.107 ***.**.109.12 ***.**.108.218"
2021-09-09T10:02:13.058Z [INFO]  agent: (LAN) joining: lan_addresses=[***.**.109.107, ***.**.109.12, ***.**.108.218]
2021-09-09T10:02:13.059Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  ***.**.109.107:8301
2021-09-09T10:02:13.061Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-***-**-109-107 ***.**.109.107
2021-09-09T10:02:13.061Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-***-**-109-12 ***.**.109.12
2021-09-09T10:02:13.061Z [INFO]  agent.client: adding server: server="ip-***-**-109-107 (Addr: tcp/***.**.109.107:8300) (DC: dcname)"
2021-09-09T10:02:13.061Z [INFO]  agent.client: adding server: server="ip-***-**-109-12 (Addr: tcp/***.**.109.12:8300) (DC: dcname)"
2021-09-09T10:02:13.061Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-***-**-108-218 ***.**.108.218
2021-09-09T10:02:13.062Z [INFO]  agent.client: adding server: server="ip-***-**-108-218 (Addr: tcp/***.**.108.218:8300) (DC: dcname)"
2021-09-09T10:02:13.062Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  ***.**.109.12:8301
2021-09-09T10:02:13.062Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {***.**.109.12:8300 0 ip-***-**-109-12 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp ***.**.109.12:8300: operation was canceled". Reconnecting...
2021-09-09T10:02:13.065Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  ***.**.108.218:8301
2021-09-09T10:02:13.078Z [INFO]  agent: (LAN) joined: number_of_nodes=3
2021-09-09T10:02:13.078Z [DEBUG] agent: systemd notify failed: error="No socket"
2021-09-09T10:02:13.078Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3

Thank you very much in advance!

The text was updated successfully, but these errors were encountered:

erichaberkorn · 2022-01-24T14:31:51Z

I've been performance testing Consul on ECS Fargate over the past couple of months and have observed similar behaviors. Last week we merged a PR that solved the gossip issues I observed when Consul clients continuously join and leave. We are planning on including this change int the next 1.11 patch release and I really hope it helps resolve these issues.

pln-rec · 2022-01-24T17:42:47Z

@erichaberkorn Thank you very much, this is great news. We'll be happy to test this again once it's released.

jkirschner-hashicorp added type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp theme/ecs Related to the AWS Elastic Container Service runtime labels Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster join i/o failure leading to slow client joins (over time) #11010

Cluster join i/o failure leading to slow client joins (over time) #11010

pln-rec commented Sep 10, 2021

erichaberkorn commented Jan 24, 2022

pln-rec commented Jan 24, 2022

Cluster join i/o failure leading to slow client joins (over time) #11010

Cluster join i/o failure leading to slow client joins (over time) #11010

Comments

pln-rec commented Sep 10, 2021

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

Client joining (slow)

Server log for the same timeframe

Example of a fast join

erichaberkorn commented Jan 24, 2022

pln-rec commented Jan 24, 2022