Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

Consul Client Memory Leak #487

Closed
kpurdon opened this issue Jun 8, 2020 · 9 comments
Closed

Consul Client Memory Leak #487

kpurdon opened this issue Jun 8, 2020 · 9 comments
Labels
area/connect Related to Connect, e.g. injection bug Something isn't working

Comments

@kpurdon
Copy link

kpurdon commented Jun 8, 2020

Overview of the Issue

Created in hashicorp/consul#8051 and was asked to request the issue here.

I'm seeing patterns of memory usage on the client that indicate a memory leak.

Screen Shot 2020-06-08 at 10 41 07 AM

Reproduction Steps

  1. Create a Private GKE Cluster (maybe not relevant)
  2. Deploy Consul via Helm (v 1.7.3)
  3. Register services using the injector (we have 2 services currently ~10 pods)
  4. See client memory usage grow until it hits a limit (set in Helm chart) and restarts, dropping memory back to normal, then growing again.
values.yml
global:
  image: consul:1.7.3
  datacenter: gcp

  # enable automatic encrypted communication between consul clients and servers
  tls:
    enabled: true
    enableAutoEncrypt: true

  # enable gossip encryption
  # https://www.consul.io/docs/agent/options.html#_encrypt
  gossipEncryption:
    secretName: consul-gossip-encryption-key
    secretKey: key

server:
  # TODO: monitor these and use defaults once they exist
  # https://github.com/hashicorp/consul-helm/issues/393
  resources: |
    limits:
      cpu: 200m
      memory: 300Mi
    requests:
      cpu: 100m
      memory: 200Mi

  # set affinity to the recommended default, even though there isn't an actual default
  # https://www.consul.io/docs/platform/k8s/helm.html#v-server-affinity
  affinity: |
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: {{ template "consul.name" . }}
              release: "{{ .Release.Name }}"
              component: server
          topologyKey: kubernetes.io/hostname

client:
  # TODO: monitor these and use defaults once they exist
  # https://github.com/hashicorp/consul-helm/issues/393
  resources: |
    limits:
      cpu: 200m
      memory: 400Mi
    requests:
      cpu: 100m
      memory: 300Mi

# TODO: disabled until we figure out how to guard it w/ authentication (like we do w/ Atlantis)
# https://github.com/hashicorp/consul-helm/issues/433
ui:
  enabled: false

# enable mesh connector injection
# https://learn.hashicorp.com/consul?track=developer-mesh#developer-mesh
connectInject:
  enabled: true
  # require services to opt-in via annotations
  # https://www.consul.io/docs/platform/k8s/connect.html#consul-hashicorp-com-connect-inject
  default: false

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 6
	services = 6
build:
	prerelease =
	revision = 8b4a3d95
	version = 1.7.3
consul:
	acl = disabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 16
	goroutines = 347
	max_procs = 16
	os = linux
	version = go1.13.7
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 9
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 433
	members = 15
	query_queue = 0
	query_time = 1
Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 8b4a3d95
	version = 1.7.3
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.12.1.35:8300
	server = true
raft:
	applied_index = 2592258
	commit_index = 2592258
	fsm_pending = 0
	last_contact = 0
	last_log_index = 2592258
	last_log_term = 26
	last_snapshot_index = 2582551
	last_snapshot_term = 26
	latest_configuration = [{Suffrage:Voter ID:372100d2-9083-7de2-3d1d-ffddb0912c0a Address:10.12.5.136:8300} {Suffrage:Voter ID:e3bbb10e-6ad9-1ebf-7bff-0ddfd3cf77dc Address:10.12.1.35:8300} {Suffrage:Voter ID:50438c36-ce2d-5072-0e07-bb093b2d2a6f Address:10.12.0.62:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 26
runtime:
	arch = amd64
	cpu_count = 16
	goroutines = 239
	max_procs = 16
	os = linux
	version = go1.13.7
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 9
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 1
	member_time = 433
	members = 16
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 87
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

GCP + Private GKE (1.16.8-gke.15) + Helm3 + Consul 1.7.3

Log Fragments

logs.zip

@ishustava ishustava added needs-investigation question Further information is requested labels Jun 8, 2020
@rrondeau
Copy link

Hi

I'm facing the same issue.

image
I'm using consul-helm with consul-k8s injector and i'm starting to think this is comming from the lifecycle mecanism which re-register the same service every 10s by default

@rrondeau
Copy link

I rebuilt consul-k8s without lifecyle sidecar injection.
Same graph 6 hours later :
image
Something is going on here :-)

@lkysow
Copy link
Member

lkysow commented Jun 11, 2020

Thanks folks, we'll take a look ASAP.

@lkysow
Copy link
Member

lkysow commented Jun 11, 2020

So we've found the issue: hashicorp/consul#8092. It's a goroutine leak.

Unfortunately it's in consul which means we need a consul release to fix it properly. The fix will be backported but we don't have a release date yet.

We're going to look into modifying the lifecycle-sidecar so it doesn't exacerbate the memory leak.

For a current workaround you can:

  1. Increase the re-register period by annotating your pods with consul.hashicorp.com/connect-sync-period: <duration>, e.g. "1m". This will cause your pods to take longer to be re-registered if the consul agent on that node restarts and it will only delay the memory leak.
  2. If you perform a write to the proxy-defaults config entry, then it triggers the goroutines that are causing the memory leak to exit. So, you could run something in a loop that every n minutes and does a consul config write proxy-defaults.hcl (with whatever your proxy-defaults should be).

@lkysow lkysow added bug Something isn't working area/connect Related to Connect, e.g. injection and removed needs-investigation question Further information is requested labels Jun 11, 2020
@kpurdon
Copy link
Author

kpurdon commented Jun 15, 2020

@lkysow thanks for the update and workarounds. I'm trying to decide if I can live w/ the issue for a bit or if I need to implement the workaround. Even though you don't have a release data do you have a sense for how long that will be from now? A week, month, ...?

@lkysow
Copy link
Member

lkysow commented Jun 17, 2020

Consul 1.8.0 should be released within a week.

@lkysow
Copy link
Member

lkysow commented Jun 25, 2020

Consul 1.8.0 has been released. Please upgrade to this version to fix the memory leak.

@lkysow lkysow closed this as completed Jun 25, 2020
@kboek
Copy link

kboek commented Jun 2, 2021

image
Do you need to upgrade all nodes in the cluster before this is solved? I have a cluster with 3 servers and 10 clients, all but one are on 1.8.10 (last one is on 1.7.1) and I still see the memory leak.

@lkysow
Copy link
Member

lkysow commented Jun 4, 2021

I'm not 100% sure but I think you should only see the memory leak on the one older node. If you're seeing memory leaks on other nodes that sounds like a bug. If you're seeing memory leaks on the one older node then yes, you need to upgrade that node.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/connect Related to Connect, e.g. injection bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants