Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor errors can cause the Beat pipeline to enter what appears to be an infinite loop #34792

Closed
cmacknz opened this issue Mar 9, 2023 · 6 comments
Assignees
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Mar 9, 2023

This was first observed in #34716, which describes a case of this problem triggered by the attempt to use a closed processor error. The trigger for this problem turned out to be unexpected processor reuse, but even with processor reuse the system should have remained functional and been able to publish events. The processor reuse should ideally result in a warning log instead of a complete failure.

It is not clear why the Beat was publishing the following log line up to 10K times per second:

{"log.level":"error","@timestamp":"2023-03-02T11:59:42.394Z","message":"Failed to publish event: attempt to use a closed processor","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"publisher","log.origin":{"file.line":102,"file.name":"pipeline/client.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

The scope of this issue is to identify the mechanism causing the Beat to loop continuously trying to publish the failed event when a closed processor error is detected, instead of logging a warning or dropping the event and continuing normally (not all events may pass through the affected processor for example).

@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Mar 9, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member Author

cmacknz commented Mar 9, 2023

The error appears to come from

if c.processors != nil {
var err error
event, err = c.processors.Run(event)
publish = event != nil
if err != nil {
// If we introduce a dead-letter queue, this is where we should
// route the event to it.
log.Errorf("Failed to publish event: %v", err)
}
}

This doesn't look like it should do anything other than log the error.

@faec
Copy link
Contributor

faec commented Mar 9, 2023

My guess based on what I've seen is that some early part of the pipeline got into an infinite retry loop with a closed processor or events associated with it. A reasonable way to troubleshoot might be to intentionally inject a closed processor into the pipeline and confirm that ingestion can continue with ~1 logged error per processor failure.

@rdner rdner self-assigned this Mar 21, 2023
@P1llus
Copy link
Member

P1llus commented Mar 28, 2023

@rdner This was resolved at some point right?

@rdner
Copy link
Member

rdner commented Mar 28, 2023

@P1llus the error reported by the logs in the description has been resolved by #34761

This issue is not about that error itself but about the behaviour of our processing pipeline when it faces an unrecoverable error. This should not cause an infinite loop.

Did I get your question right?

@rdner
Copy link
Member

rdner commented Jan 5, 2024

I've tried to reproduce this issue by building a custom Filebeat from 2f7ff01 with this patch:

diff --git a/libbeat/processors/safe_processor.go b/libbeat/processors/safe_processor.go
index a0bbf5824d..b32b344c0d 100644
--- a/libbeat/processors/safe_processor.go
+++ b/libbeat/processors/safe_processor.go
@@ -35,10 +35,7 @@ type SafeProcessor struct {
 
 // Run allows to run processor only when `Close` was not called prior
 func (p *SafeProcessor) Run(event *beat.Event) (*beat.Event, error) {
-	if atomic.LoadUint32(&p.closed) == 1 {
-		return nil, ErrClosed
-	}
-	return p.Processor.Run(event)
+	return nil, ErrClosed
 }
 
 // Close makes sure the underlying `Close` function is called only once.

And I ran this modified Filebeat under Elastic Agent with the configuration taken from the original issue:

outputs:
  default:
    type: elasticsearch
    log_level: debug
    enabled: true
    hosts: [https://127.0.0.1:9200]
    username: "elastic"
    password: [password]
    allow_older_versions: true
    ssl:
      verification_mode: none
    shipper:
      enabled: true

inputs:
  - type: system/metrics
    id: unique-system-metrics-input
    data_stream.namespace: default
    use_output: default
    streams:
      - metricset: cpu
        data_stream.dataset: system.cpu
      - metricset: memory
        data_stream.dataset: system.memory
      - metricset: network
        data_stream.dataset: system.network
      - metricset: filesystem
        data_stream.dataset: system.filesystem

This is what I got in Elastic Agent logs:

logs.mov

The original issue description was reported as:

Intermittently, the monitoring started by Agent enters a loop where it repeats this message ~10K times a second.

This error is inconsistent -- on some runs it begins soon after startup, on many runs it never happens at all. When it does happen, it severely degrades or blocks other ingestion. Subsequent runs using identical configurations with identical binaries are still inconsistent as to whether this bug occurs.

I've failed to reproduce this behaviour, it seems like Filebeat logs a few errors, waits for some time (~10 seconds in my case) and retries but never goes into the described infinite loop.

Just in case I added a test in this PR #37491

@rdner rdner closed this as completed Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

5 participants