-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processor errors can cause the Beat pipeline to enter what appears to be an infinite loop #34792
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
The error appears to come from beats/libbeat/publisher/pipeline/client.go Lines 94 to 104 in 9cd74f9
This doesn't look like it should do anything other than log the error. |
My guess based on what I've seen is that some early part of the pipeline got into an infinite retry loop with a closed processor or events associated with it. A reasonable way to troubleshoot might be to intentionally inject a closed processor into the pipeline and confirm that ingestion can continue with ~1 logged error per processor failure. |
@rdner This was resolved at some point right? |
I've tried to reproduce this issue by building a custom Filebeat from 2f7ff01 with this patch: diff --git a/libbeat/processors/safe_processor.go b/libbeat/processors/safe_processor.go
index a0bbf5824d..b32b344c0d 100644
--- a/libbeat/processors/safe_processor.go
+++ b/libbeat/processors/safe_processor.go
@@ -35,10 +35,7 @@ type SafeProcessor struct {
// Run allows to run processor only when `Close` was not called prior
func (p *SafeProcessor) Run(event *beat.Event) (*beat.Event, error) {
- if atomic.LoadUint32(&p.closed) == 1 {
- return nil, ErrClosed
- }
- return p.Processor.Run(event)
+ return nil, ErrClosed
}
// Close makes sure the underlying `Close` function is called only once. And I ran this modified Filebeat under Elastic Agent with the configuration taken from the original issue: outputs:
default:
type: elasticsearch
log_level: debug
enabled: true
hosts: [https://127.0.0.1:9200]
username: "elastic"
password: [password]
allow_older_versions: true
ssl:
verification_mode: none
shipper:
enabled: true
inputs:
- type: system/metrics
id: unique-system-metrics-input
data_stream.namespace: default
use_output: default
streams:
- metricset: cpu
data_stream.dataset: system.cpu
- metricset: memory
data_stream.dataset: system.memory
- metricset: network
data_stream.dataset: system.network
- metricset: filesystem
data_stream.dataset: system.filesystem This is what I got in Elastic Agent logs: logs.movThe original issue description was reported as:
I've failed to reproduce this behaviour, it seems like Filebeat logs a few errors, waits for some time (~10 seconds in my case) and retries but never goes into the described infinite loop. Just in case I added a test in this PR #37491 |
This was first observed in #34716, which describes a case of this problem triggered by the
attempt to use a closed processor
error. The trigger for this problem turned out to be unexpected processor reuse, but even with processor reuse the system should have remained functional and been able to publish events. The processor reuse should ideally result in a warning log instead of a complete failure.It is not clear why the Beat was publishing the following log line up to 10K times per second:
The scope of this issue is to identify the mechanism causing the Beat to loop continuously trying to publish the failed event when a closed processor error is detected, instead of logging a warning or dropping the event and continuing normally (not all events may pass through the affected processor for example).
The text was updated successfully, but these errors were encountered: