-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat: Error when using autodiscover + hints + templates with filebeat 8.6 #34388
Comments
Sorry for the delay in this! We have planned the story and we will come back once we have any update |
Not a problem. Just let me know when/if you need any more information :-) |
spotted same errors with our ECK (2.6.1) managed Filebeat (8.6.1) instances |
Looks like the above-mentioned error comes from the beats/filebeat/input/log/input.go Lines 193 to 196 in 47723fe
This might indicate a race condition between stopping the beats/filebeat/input/log/harvester.go Lines 270 to 281 in 47723fe
beats/filebeat/input/log/harvester.go Lines 611 to 613 in 47723fe
And starting a new @belimawr might give more context since he's investigated a similar issue here #33653 In the near feature (#34393) we're going to migrate the |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Once we solve #34393, we'll come back and verify whether that was enough to solve this problem. |
I gave a quick read and this issue and #33653 might be related or even the same. Even though #33653 mentions the Elastic-Agent, the bits of code that cause the issue are related to starting/stopping inputs. Kubernetes autodiscover probably uses the same code path/logic when discovering new pods, hence I strongly believe it's the same issue. I need to do some investigation/testing to be 100% sure. |
OverviewThe log input keeps a global state for all files it harvesters, there is also a number of abstractions around an input that need to be correctly shutdown when an input is shutdown. On a high level, the abstrations are: Lines 55 to 128 in 66755e6
To map a runner to a config, it hashes the config (the yaml fields of an input), based on this hash it creates a list of runners (that wraps inputs) to be stopped and started. The state of a file is represented by the State struct: beats/filebeat/input/file/state.go Lines 28 to 42 in 66755e6
One of it's attributes is the Finished flag, as the documentation states, it represents the "harvester state", in other words, if the file is actively being read or not.
There is a global Registrar that keeps the state for every file harvested by any instance of the log input. The registrar has got a A copy of the harvester state is propagated with each event published by the input, the harvester also keeps its own copy of this struct, whenever an event (or a batch of events) is acked, their states are sent by the acker to the registrar. The acker code responsible for those states update is: beats/filebeat/beater/acker.go Lines 35 to 68 in 66755e6
The log input uses "stateful events", so they're sent to One very important thing to notice is that the Registrar's channel for updates is a buffered channel of length 1: beats/filebeat/registrar/registrar.go Line 84 in 66755e6
This means that the acker call to publish states as well as any other call to update states is non blocking. When the harvester stops (its beats/filebeat/input/log/harvester.go Lines 612 to 637 in 66755e6
It's interesting to notice that beats/filebeat/input/log/harvester.go Line 488 in b8e0449
The harvester publishState is defined here:beats/filebeat/input/log/input.go Lines 718 to 720 in 66755e6
The second one uses a synchronous path and calls directly into the States update method. beats/filebeat/input/log/harvester.go Line 491 in b8e0449
Even with this synchronous call to update the state of a file, there is a chance that some events on the publishing queue will be acknowledged and set the harvester's state back to "not finished" before the new input starts, which leads to the (in)famous error:
The current state of thingsThe Kubernetes autodiscovery keeps retrying to start inputs that failed, this act of retrying should, eventually, try to start the input for a file when it's harvester state is finally set to "harvester finished". However the bigger the load on Filebeat, the harder it might be to get back to a consistent state. Another problem is that the file might be deleted, so some data can be lost. The Kubernetes autodiscovery (with hints enabled) calls Other issues I foundWhen a runner is created, there is a beats/filebeat/input/runnerfactory.go Lines 61 to 72 in b8e0449
This method instantiates a runner and stops it right away to ensure the configuration is correct. It happens to ignore common.ErrInputNotFinished (the input not finished error). At least a few times when Filebeat was starting I saw the log from the runner stopping being logged before the input reporting it had stopped, which is definitely the wrong order. However, on my tests, this did not cause the error this issue is about. Anyway this also needs to be investigated. Interestingly enough, this did not seem to happen when the Reload method from RunnersList was called.
Testing setupFor my tests I used a VM running Minikube with the Two other important things on my setup:
Below is the filebeat.autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
hints.enabled: true
hints.default_config:
type: container
id: "runner-${data.kubernetes.container.id}"
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
fields:
role: kubernetes
logging:
level: debug
selectors:
- acker
- autodiscover
- harvester
- harvester_registry
- input
- input.harvester
- input_runner
- kubernetes
- registrar
output.elasticsearch:
hosts:
- https://foo.bar.cloud.elastic.co:443
protocol: https
username: foo
password: bar Filebeat monitoring config; filebeat.inputs:
- type: filestream
id: my-filestream-id
enabled: true
paths:
- /home/vagrant/work/logs/*
fields:
role: monitoring
parsers:
- ndjson:
overwrite_keys: true
expand_keys: true
add_error_key: true
ignore_decoding_error: true
processors:
- timestamp:
layouts:
- "2006-01-02T15:04:05.999999999Z07:00"
test:
- "2023-03-23T12:06:36.981660099+01:00"
field: zap-nano-timestamp
setup.template.name:
setup.template.pattern:
output.elasticsearch:
hosts:
- https://foo.bar.cloud.elastic.co:443
protocol: https
username: foo
password: bar
index: "filebeat-logs-rfc3339-nano-%{[agent.version]}"
setup.template.name: "filebeat-logs-rfc3339-nano-%{[agent.version]}"
setup.template.pattern: "filebeat-logs-rfc3339-nano-%{[agent.version]}" The code I used
|
This seems to be the same issue but with Docker: #13607. |
FYI: Not a confirmed bug, no response received on discuss.elastic.co.
Requested information:
8.6.0
8.6.0
in Kubernetes.After upgrading our filebeat kubernetes daemonset from
7.17.8
to8.6.0
I can't seem to get rid of errors of this type:The number of errors varies depending of the number of pods deployed. In our current production cluster I'm observing roughly 60k messages pr. 24h.
Filebeat is currently deployed as a daemonset using the official helm chart version
8.5.1
and running in Azure AKS, kubernetes version1.25.4
. (recently upgraded, the errors were also appearing in kubernetes version1.24.6
)This is the relevant part our current filebeat configuration (I've excluded
output.*
andsetup.*
):I'm able to avoid the error by either removing the
templates:
-config or by disabling thehints.default_config
. Either of these are not suitable solutions for us as they result in missing logs or logs not being parsed correctly.The error messages all refer to log files from our nginx pods. We have multiple other types of deployments without issues indicated by filebeat. Since we are using the nginx module conditionally for these pods this leads me to think there's some kind of race condition happening when the nginx module is applied with
templates
-config in combination with default hints configuration.We were able to achieve autodiscover with hints (including
default_config
) and templates using filebeat7.17.8
just fine without errors. This was running on the same Kubernetes version and deployed with official helm chart version7.17.3
.First I thought maybe I was experiencing #11834. But after reading I saw that this was fixed in filebeat
7.9.0
. Reading a bit further I saw that there were a couple of more issues resulting in this error but those had also been fixed:I have verified that we are not missing log entries and therefore I'm suspecting that my issue is also a "recoverable error", and that it should possibly not be logged on error level 🤷♀️
I also saw #33653 , we are not using agent/fleet and with my limited knowledge I cannot determine if it's related or not.
The text was updated successfully, but these errors were encountered: