Consul Client Memory Leak #487

kpurdon · 2020-06-08T17:50:16Z

Overview of the Issue

Created in hashicorp/consul#8051 and was asked to request the issue here.

I'm seeing patterns of memory usage on the client that indicate a memory leak.

Reproduction Steps

Create a Private GKE Cluster (maybe not relevant)
Deploy Consul via Helm (v 1.7.3)
Register services using the injector (we have 2 services currently ~10 pods)
See client memory usage grow until it hits a limit (set in Helm chart) and restarts, dropping memory back to normal, then growing again.

values.yml

global:
  image: consul:1.7.3
  datacenter: gcp

  # enable automatic encrypted communication between consul clients and servers
  tls:
    enabled: true
    enableAutoEncrypt: true

  # enable gossip encryption
  # https://www.consul.io/docs/agent/options.html#_encrypt
  gossipEncryption:
    secretName: consul-gossip-encryption-key
    secretKey: key

server:
  # TODO: monitor these and use defaults once they exist
  # https://github.com/hashicorp/consul-helm/issues/393
  resources: |
    limits:
      cpu: 200m
      memory: 300Mi
    requests:
      cpu: 100m
      memory: 200Mi

  # set affinity to the recommended default, even though there isn't an actual default
  # https://www.consul.io/docs/platform/k8s/helm.html#v-server-affinity
  affinity: |
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: {{ template "consul.name" . }}
              release: "{{ .Release.Name }}"
              component: server
          topologyKey: kubernetes.io/hostname

client:
  # TODO: monitor these and use defaults once they exist
  # https://github.com/hashicorp/consul-helm/issues/393
  resources: |
    limits:
      cpu: 200m
      memory: 400Mi
    requests:
      cpu: 100m
      memory: 300Mi

# TODO: disabled until we figure out how to guard it w/ authentication (like we do w/ Atlantis)
# https://github.com/hashicorp/consul-helm/issues/433
ui:
  enabled: false

# enable mesh connector injection
# https://learn.hashicorp.com/consul?track=developer-mesh#developer-mesh
connectInject:
  enabled: true
  # require services to opt-in via annotations
  # https://www.consul.io/docs/platform/k8s/connect.html#consul-hashicorp-com-connect-inject
  default: false

Consul info for both Client and Server

Client info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 6
	services = 6
build:
	prerelease =
	revision = 8b4a3d95
	version = 1.7.3
consul:
	acl = disabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 16
	goroutines = 347
	max_procs = 16
	os = linux
	version = go1.13.7
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 9
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 433
	members = 15
	query_queue = 0
	query_time = 1

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 8b4a3d95
	version = 1.7.3
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.12.1.35:8300
	server = true
raft:
	applied_index = 2592258
	commit_index = 2592258
	fsm_pending = 0
	last_contact = 0
	last_log_index = 2592258
	last_log_term = 26
	last_snapshot_index = 2582551
	last_snapshot_term = 26
	latest_configuration = [{Suffrage:Voter ID:372100d2-9083-7de2-3d1d-ffddb0912c0a Address:10.12.5.136:8300} {Suffrage:Voter ID:e3bbb10e-6ad9-1ebf-7bff-0ddfd3cf77dc Address:10.12.1.35:8300} {Suffrage:Voter ID:50438c36-ce2d-5072-0e07-bb093b2d2a6f Address:10.12.0.62:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 26
runtime:
	arch = amd64
	cpu_count = 16
	goroutines = 239
	max_procs = 16
	os = linux
	version = go1.13.7
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 9
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 1
	member_time = 433
	members = 16
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 87
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

GCP + Private GKE (1.16.8-gke.15) + Helm3 + Consul 1.7.3

Log Fragments

logs.zip

The text was updated successfully, but these errors were encountered:

rrondeau · 2020-06-11T08:57:27Z

Hi

I'm facing the same issue.

I'm using consul-helm with consul-k8s injector and i'm starting to think this is comming from the lifecycle mecanism which re-register the same service every 10s by default

rrondeau · 2020-06-11T16:06:21Z

I rebuilt consul-k8s without lifecyle sidecar injection.
Same graph 6 hours later :

Something is going on here :-)

lkysow · 2020-06-11T16:25:13Z

Thanks folks, we'll take a look ASAP.

lkysow · 2020-06-11T22:03:27Z

So we've found the issue: hashicorp/consul#8092. It's a goroutine leak.

Unfortunately it's in consul which means we need a consul release to fix it properly. The fix will be backported but we don't have a release date yet.

We're going to look into modifying the lifecycle-sidecar so it doesn't exacerbate the memory leak.

For a current workaround you can:

Increase the re-register period by annotating your pods with consul.hashicorp.com/connect-sync-period: <duration>, e.g. "1m". This will cause your pods to take longer to be re-registered if the consul agent on that node restarts and it will only delay the memory leak.
If you perform a write to the proxy-defaults config entry, then it triggers the goroutines that are causing the memory leak to exit. So, you could run something in a loop that every n minutes and does a consul config write proxy-defaults.hcl (with whatever your proxy-defaults should be).

kpurdon · 2020-06-15T15:20:08Z

@lkysow thanks for the update and workarounds. I'm trying to decide if I can live w/ the issue for a bit or if I need to implement the workaround. Even though you don't have a release data do you have a sense for how long that will be from now? A week, month, ...?

lkysow · 2020-06-17T05:05:30Z

Consul 1.8.0 should be released within a week.

lkysow · 2020-06-25T17:32:29Z

Consul 1.8.0 has been released. Please upgrade to this version to fix the memory leak.

kboek · 2021-06-02T09:43:38Z

Do you need to upgrade all nodes in the cluster before this is solved? I have a cluster with 3 servers and 10 clients, all but one are on 1.8.10 (last one is on 1.7.1) and I still see the memory leak.

lkysow · 2021-06-04T22:59:11Z

I'm not 100% sure but I think you should only see the memory leak on the one older node. If you're seeing memory leaks on other nodes that sounds like a bug. If you're seeing memory leaks on the one older node then yes, you need to upgrade that node.

ishustava added needs-investigation question Further information is requested labels Jun 8, 2020

lkysow added bug Something isn't working area/connect Related to Connect, e.g. injection and removed needs-investigation question Further information is requested labels Jun 11, 2020

lkysow closed this as completed Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul Client Memory Leak #487

Consul Client Memory Leak #487

kpurdon commented Jun 8, 2020

rrondeau commented Jun 11, 2020

rrondeau commented Jun 11, 2020

lkysow commented Jun 11, 2020

lkysow commented Jun 11, 2020

kpurdon commented Jun 15, 2020

lkysow commented Jun 17, 2020

lkysow commented Jun 25, 2020

kboek commented Jun 2, 2021

lkysow commented Jun 4, 2021

Consul Client Memory Leak #487

Consul Client Memory Leak #487

Comments

kpurdon commented Jun 8, 2020

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

rrondeau commented Jun 11, 2020

rrondeau commented Jun 11, 2020

lkysow commented Jun 11, 2020

lkysow commented Jun 11, 2020

kpurdon commented Jun 15, 2020

lkysow commented Jun 17, 2020

lkysow commented Jun 25, 2020

kboek commented Jun 2, 2021

lkysow commented Jun 4, 2021