Goroutine leak in Filebeat with Kubernetes autodiscovery #23658

crisdarocha · 2021-01-25T15:23:04Z

Dear Beats team!

We have a possible memory leak in Filebeat (identified in 7.9.3 at least).

Running Filebeat as a Daemonset in a Kubernetes cluster, using auto-discovery, and there's a memory leak ~~that causes the pod to be restarted due to concurrent map read and write errors~~.

Update: there actually seems to be a goroutine leak in Filebeat, possibly related to the family of leaks seen in #12106 and #11263, but this is not related to the race condition originally reported.
Check if this is a regression or something that escaped #11263.

Trace originally reported:


fatal error: concurrent map read and map write

goroutine 1234 [running]: runtime.throw() /usr/local/go/src/runtime/panic.go:1116 runtime.mapaccess2_faststr() /usr/local/go/src/runtime/map_faststr.go:116 github.com/elastic/beats/v7/libbeat/common/kubernetes/k8skeystore.(*KubernetesKeystoresRegistry).GetKeystore() /go/src/github.com/elastic/beats/libbeat/common/kubernetes/k8skeystore/kubernetes_keystore.go:79 github.com/elastic/beats/v7/libbeat/autodiscover.Builders.GetConfig() /go/src/github.com/elastic/beats/libbeat/autodiscover/builder.go:102 github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*Provider).publish() /go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go:148 github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*Provider).publish-fm() /go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go:141 github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*pod).emitEvents() /go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/pod.go:428 github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*pod).emit() /go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/pod.go:270 github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*pod).OnUpdate.func1() /go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/pod.go:142 runtime.goexit()

I hope this helps tracking the issue!

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-01-25T21:00:40Z

Pinging @elastic/integrations (Team:Integrations)

spetrashov · 2021-01-27T09:00:47Z

Hello @crisdarocha ,

I've tested the same configuration with the Filebeat version 7.10.2. There were no crashes due to race condition for 15 hours, but goroutines keep piling up, so it doesn't fix the leak.

I've collected some more debug information from the pod running 7.10.2

Attaching this information here, since upload functionality didn't work on the support website for me
You can find heap and goroutine profiles, metrics from the pod, filebeat.yml config and a dashboard screenshot that shows some metrics' change over time in the attachment.

filebeat_7_10_2_debug_info.zip

ChrsMark · 2021-01-27T10:34:26Z

Hi @spetrashov and thank you for reporting this!

First of all the fatal error reported should be fixed by #21880. Also I don't think that the fatal error should be related with any possible leaks and goroutines' number increase. From the heap graph you sent I see that heap size is ~43MB so I cannot see anything suspicious there. I'm putting the graphs you posted here in png format for quicker access:

@jsoriano I think you had been dealing with some leaking cases in the past, does this one looks familiar?

ChrsMark · 2021-01-27T11:31:55Z

Initial error reported is fixed by #21880 and the possible memory leak should not be related so I'm closing this one and let's continue in a separate issue if needed.

jsoriano · 2021-01-27T11:38:10Z

First of all the fatal error reported should be fixed by #21880. Also I don't think that the fatal error should be related with any possible leaks and goroutines' number increase. From the heap graph you sent I see that heap size is ~43MB so I cannot see anything suspicious there.

+1, this possible leak is not related to this fatal error trace.

@jsoriano I think you had been dealing with some leaking cases in the past, does this one looks familiar?

Yes, these CloseOnSignal and SubOutlet are usual suspects on memory leaks in filebeat related to autodiscover and dynamic configs in general. There were several fixes there around 7.1. See #12106 and linked issues.

In the metrics I see that there were 3875 started harvesters, what is similar to the number of SubOutlets, and to the double the number of CloseOnSignal.

    "harvester": {
      "closed": 3824,
      "open_files": 52,
      "running": 51,
      "skipped": 0,
      "started": 3875
    },

So it actually seems that something is not being stopped when harvesters are. We will have to see if this is a regression or something that escaped to the fixes we did. It doesn't seem too serious if the memory usage is not very high, but still we should take a look.

jsoriano · 2021-01-27T11:39:43Z

Let's keep this open, I will update the description.

jsoriano · 2021-01-27T16:20:57Z

I can reproduce the goroutine leak also with docker autodiscover, seems to be happening since 7.8.0. What I see is that since this version, two additional CloseOnSignal goroutines are created for each harvester, and they are never released.

spetrashov · 2021-02-12T08:23:55Z

I've tested the latest Filebeat snapshot version that includes a fix for this issue for 3 days in a test environment and the number of goroutines remains stable 👍🏻

crisdarocha added the Filebeat Filebeat label Jan 25, 2021

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 25, 2021

andresrc added the Team:Integrations Label for the Integrations team label Jan 25, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 25, 2021

ChrsMark closed this as completed Jan 27, 2021

jsoriano reopened this Jan 27, 2021

jsoriano added the bug label Jan 27, 2021

jsoriano mentioned this issue Jan 27, 2021

Fix leak caused by input runners created when checking their configuration #23722

Merged

7 tasks

jsoriano closed this as completed in #23722 Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goroutine leak in Filebeat with Kubernetes autodiscovery #23658

Goroutine leak in Filebeat with Kubernetes autodiscovery #23658

crisdarocha commented Jan 25, 2021 •

edited by jsoriano

Loading

elasticmachine commented Jan 25, 2021

spetrashov commented Jan 27, 2021

ChrsMark commented Jan 27, 2021 •

edited

Loading

ChrsMark commented Jan 27, 2021

jsoriano commented Jan 27, 2021

jsoriano commented Jan 27, 2021

jsoriano commented Jan 27, 2021 •

edited

Loading

spetrashov commented Feb 12, 2021

Goroutine leak in Filebeat with Kubernetes autodiscovery #23658

Goroutine leak in Filebeat with Kubernetes autodiscovery #23658

Comments

crisdarocha commented Jan 25, 2021 • edited by jsoriano Loading

elasticmachine commented Jan 25, 2021

spetrashov commented Jan 27, 2021

ChrsMark commented Jan 27, 2021 • edited Loading

ChrsMark commented Jan 27, 2021

jsoriano commented Jan 27, 2021

jsoriano commented Jan 27, 2021

jsoriano commented Jan 27, 2021 • edited Loading

spetrashov commented Feb 12, 2021

crisdarocha commented Jan 25, 2021 •

edited by jsoriano

Loading

ChrsMark commented Jan 27, 2021 •

edited

Loading

jsoriano commented Jan 27, 2021 •

edited

Loading