Processor errors can cause the Beat pipeline to enter what appears to be an infinite loop #34792

cmacknz · 2023-03-09T19:46:01Z

This was first observed in #34716, which describes a case of this problem triggered by the attempt to use a closed processor error. The trigger for this problem turned out to be unexpected processor reuse, but even with processor reuse the system should have remained functional and been able to publish events. The processor reuse should ideally result in a warning log instead of a complete failure.

It is not clear why the Beat was publishing the following log line up to 10K times per second:

{"log.level":"error","@timestamp":"2023-03-02T11:59:42.394Z","message":"Failed to publish event: attempt to use a closed processor","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"publisher","log.origin":{"file.line":102,"file.name":"pipeline/client.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

The scope of this issue is to identify the mechanism causing the Beat to loop continuously trying to publish the failed event when a closed processor error is detected, instead of logging a warning or dropping the event and continuing normally (not all events may pass through the affected processor for example).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-03-09T19:46:04Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2023-03-09T19:53:15Z

The error appears to come from

beats/libbeat/publisher/pipeline/client.go

Lines 94 to 104 in 9cd74f9

    
           if c.processors != nil { 
        
           	var err error 
        
           	event, err = c.processors.Run(event) 
        
           	publish = event != nil 
        
           	if err != nil { 
        
           		// If we introduce a dead-letter queue, this is where we should 
        
           		// route the event to it. 
        
           		log.Errorf("Failed to publish event: %v", err) 
        
           	} 
        
           }

This doesn't look like it should do anything other than log the error.

faec · 2023-03-09T20:02:58Z

My guess based on what I've seen is that some early part of the pipeline got into an infinite retry loop with a closed processor or events associated with it. A reasonable way to troubleshoot might be to intentionally inject a closed processor into the pipeline and confirm that ingestion can continue with ~1 logged error per processor failure.

P1llus · 2023-03-28T07:53:03Z

@rdner This was resolved at some point right?

rdner · 2023-03-28T11:11:45Z

@P1llus the error reported by the logs in the description has been resolved by #34761

This issue is not about that error itself but about the behaviour of our processing pipeline when it faces an unrecoverable error. This should not cause an infinite loop.

Did I get your question right?

rdner · 2024-01-05T16:31:33Z

I've tried to reproduce this issue by building a custom Filebeat from 2f7ff01 with this patch:

diff --git a/libbeat/processors/safe_processor.go b/libbeat/processors/safe_processor.go
index a0bbf5824d..b32b344c0d 100644
--- a/libbeat/processors/safe_processor.go
+++ b/libbeat/processors/safe_processor.go
@@ -35,10 +35,7 @@ type SafeProcessor struct {
 
 // Run allows to run processor only when `Close` was not called prior
 func (p *SafeProcessor) Run(event *beat.Event) (*beat.Event, error) {
-	if atomic.LoadUint32(&p.closed) == 1 {
-		return nil, ErrClosed
-	}
-	return p.Processor.Run(event)
+	return nil, ErrClosed
 }
 
 // Close makes sure the underlying `Close` function is called only once.

And I ran this modified Filebeat under Elastic Agent with the configuration taken from the original issue:

outputs:
  default:
    type: elasticsearch
    log_level: debug
    enabled: true
    hosts: [https://127.0.0.1:9200]
    username: "elastic"
    password: [password]
    allow_older_versions: true
    ssl:
      verification_mode: none
    shipper:
      enabled: true

inputs:
  - type: system/metrics
    id: unique-system-metrics-input
    data_stream.namespace: default
    use_output: default
    streams:
      - metricset: cpu
        data_stream.dataset: system.cpu
      - metricset: memory
        data_stream.dataset: system.memory
      - metricset: network
        data_stream.dataset: system.network
      - metricset: filesystem
        data_stream.dataset: system.filesystem

This is what I got in Elastic Agent logs:

logs.mov

The original issue description was reported as:

Intermittently, the monitoring started by Agent enters a loop where it repeats this message ~10K times a second.

This error is inconsistent -- on some runs it begins soon after startup, on many runs it never happens at all. When it does happen, it severely degrades or blocks other ingestion. Subsequent runs using identical configurations with identical binaries are still inconsistent as to whether this bug occurs.

I've failed to reproduce this behaviour, it seems like Filebeat logs a few errors, waits for some time (~10 seconds in my case) and retries but never goes into the described infinite loop.

Just in case I added a test in this PR #37491

cmacknz added the Team:Elastic-Agent Label for the Agent team label Mar 9, 2023

cmacknz mentioned this issue Mar 9, 2023

Cover possible processor re-use with tests #34783

Closed

rdner self-assigned this Mar 21, 2023

rdner mentioned this issue Dec 22, 2023

Add test for handling processing errors while publishing events #37491

Merged

3 tasks

rdner closed this as completed Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processor errors can cause the Beat pipeline to enter what appears to be an infinite loop #34792

Processor errors can cause the Beat pipeline to enter what appears to be an infinite loop #34792

cmacknz commented Mar 9, 2023

elasticmachine commented Mar 9, 2023

cmacknz commented Mar 9, 2023 •

edited

Loading

faec commented Mar 9, 2023

P1llus commented Mar 28, 2023

rdner commented Mar 28, 2023

rdner commented Jan 5, 2024

Processor errors can cause the Beat pipeline to enter what appears to be an infinite loop #34792

Processor errors can cause the Beat pipeline to enter what appears to be an infinite loop #34792

Comments

cmacknz commented Mar 9, 2023

elasticmachine commented Mar 9, 2023

cmacknz commented Mar 9, 2023 • edited Loading

faec commented Mar 9, 2023

P1llus commented Mar 28, 2023

rdner commented Mar 28, 2023

rdner commented Jan 5, 2024

cmacknz commented Mar 9, 2023 •

edited

Loading