Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goroutine leak in Filebeat with Kubernetes autodiscovery #23658

Closed
crisdarocha opened this issue Jan 25, 2021 · 8 comments · Fixed by #23722
Closed

Goroutine leak in Filebeat with Kubernetes autodiscovery #23658

crisdarocha opened this issue Jan 25, 2021 · 8 comments · Fixed by #23722
Labels
bug Filebeat Filebeat Team:Integrations Label for the Integrations team

Comments

@crisdarocha
Copy link

crisdarocha commented Jan 25, 2021

Dear Beats team!

We have a possible memory leak in Filebeat (identified in 7.9.3 at least).

Running Filebeat as a Daemonset in a Kubernetes cluster, using auto-discovery, and there's a memory leak that causes the pod to be restarted due to concurrent map read and write errors.

Update: there actually seems to be a goroutine leak in Filebeat, possibly related to the family of leaks seen in #12106 and #11263, but this is not related to the race condition originally reported.
Check if this is a regression or something that escaped #11263.

Trace originally reported: fatal error: concurrent map read and map write

goroutine 1234 [running]:
runtime.throw()
/usr/local/go/src/runtime/panic.go:1116
runtime.mapaccess2_faststr()
/usr/local/go/src/runtime/map_faststr.go:116
github.com/elastic/beats/v7/libbeat/common/kubernetes/k8skeystore.(*KubernetesKeystoresRegistry).GetKeystore()
/go/src/github.com/elastic/beats/libbeat/common/kubernetes/k8skeystore/kubernetes_keystore.go:79
github.com/elastic/beats/v7/libbeat/autodiscover.Builders.GetConfig()
/go/src/github.com/elastic/beats/libbeat/autodiscover/builder.go:102
github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*Provider).publish()
/go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go:148
github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*Provider).publish-fm()
/go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go:141
github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*pod).emitEvents()
/go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/pod.go:428
github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*pod).emit()
/go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/pod.go:270
github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.(*pod).OnUpdate.func1()
/go/src/github.com/elastic/beats/libbeat/autodiscover/providers/kubernetes/pod.go:142
runtime.goexit()

I hope this helps tracking the issue!

@crisdarocha crisdarocha added the Filebeat Filebeat label Jan 25, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 25, 2021
@andresrc andresrc added the Team:Integrations Label for the Integrations team label Jan 25, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 25, 2021
@spetrashov
Copy link

Hello @crisdarocha ,

I've tested the same configuration with the Filebeat version 7.10.2. There were no crashes due to race condition for 15 hours, but goroutines keep piling up, so it doesn't fix the leak.

I've collected some more debug information from the pod running 7.10.2

Attaching this information here, since upload functionality didn't work on the support website for me
You can find heap and goroutine profiles, metrics from the pod, filebeat.yml config and a dashboard screenshot that shows some metrics' change over time in the attachment.

filebeat_7_10_2_debug_info.zip

@ChrsMark
Copy link
Member

ChrsMark commented Jan 27, 2021

Hi @spetrashov and thank you for reporting this!

First of all the fatal error reported should be fixed by #21880. Also I don't think that the fatal error should be related with any possible leaks and goroutines' number increase. From the heap graph you sent I see that heap size is ~43MB so I cannot see anything suspicious there. I'm putting the graphs you posted here in png format for quicker access:

profile001
profile002

@jsoriano I think you had been dealing with some leaking cases in the past, does this one looks familiar?

@ChrsMark
Copy link
Member

Initial error reported is fixed by #21880 and the possible memory leak should not be related so I'm closing this one and let's continue in a separate issue if needed.

@jsoriano
Copy link
Member

First of all the fatal error reported should be fixed by #21880. Also I don't think that the fatal error should be related with any possible leaks and goroutines' number increase. From the heap graph you sent I see that heap size is ~43MB so I cannot see anything suspicious there.

+1, this possible leak is not related to this fatal error trace.

@jsoriano I think you had been dealing with some leaking cases in the past, does this one looks familiar?

Yes, these CloseOnSignal and SubOutlet are usual suspects on memory leaks in filebeat related to autodiscover and dynamic configs in general. There were several fixes there around 7.1. See #12106 and linked issues.

In the metrics I see that there were 3875 started harvesters, what is similar to the number of SubOutlets, and to the double the number of CloseOnSignal.

    "harvester": {
      "closed": 3824,
      "open_files": 52,
      "running": 51,
      "skipped": 0,
      "started": 3875
    },

So it actually seems that something is not being stopped when harvesters are. We will have to see if this is a regression or something that escaped to the fixes we did. It doesn't seem too serious if the memory usage is not very high, but still we should take a look.

@jsoriano
Copy link
Member

Let's keep this open, I will update the description.

@jsoriano jsoriano reopened this Jan 27, 2021
@jsoriano jsoriano added the bug label Jan 27, 2021
@jsoriano
Copy link
Member

jsoriano commented Jan 27, 2021

I can reproduce the goroutine leak also with docker autodiscover, seems to be happening since 7.8.0. What I see is that since this version, two additional CloseOnSignal goroutines are created for each harvester, and they are never released.

@spetrashov
Copy link

I've tested the latest Filebeat snapshot version that includes a fix for this issue for 3 days in a test environment and the number of goroutines remains stable 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Filebeat Filebeat Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants