Beats no longer restart automatically when the output configuration changes. #1913

belimawr · 2022-12-08T16:34:18Z

Under V1 the Elastic-Agent would decide when to restart a Beat, one of the situations it restarted Beats was on output change. Now this control has been handed over the Beats, we need to ensure they're still work as expected.

Here is the implementation under V1:

elastic-agent/internal/pkg/core/plugin/common.go

Lines 20 to 50 in a827e93

    
           // IsRestartNeeded returns true if 
        
           // - spec is configured to support restart on change 
        
           // - output changes in between configs 
        
           func IsRestartNeeded(log *logger.Logger, spec program.Spec, cfgFetch configFetcher, newCfg map[string]interface{}) bool { 
        
           	if !spec.RestartOnOutputChange { 
        
           		// early exit if restart is not needed anyway 
        
           		return false 
        
           	} 
        
           	// compare outputs 
        
           	curCfgStr := cfgFetch.Config() 
        
           	if curCfgStr == "" { 
        
           		// no config currently applied 
        
           		return false 
        
           	} 
        
           	currentOutput, err := getOutputConfigFromString(curCfgStr) 
        
           	if err != nil { 
        
           		log.Errorf("failed to retrieve output config from current state: %v", err) 
        
           		return false 
        
           	} 
        
           	newOutput, err := getOutputConfigFromMap(newCfg) 
        
           	if err != nil { 
        
           		log.Errorf("failed to retrieve output config from new state: %v", err) 
        
           		return false 
        
           	} 
        
           	// restart needed only if output changed 
        
           	return currentOutput != newOutput 
        
           }

Some references:

cmacknz · 2022-12-08T19:17:43Z

We were definitely hitting this RestartOnOutput change condition in V1:

https://github.com/elastic/elastic-agent/blob/a827e93586b119d9f12113a0bb9be37763f098e8/internal/spec/filebeat.yml#LL12C13-L12C31

artifact: beats/filebeat
restart_on_output_change: true
rules:
- fix_stream: {}

cmacknz · 2022-12-12T14:57:04Z

restart_on_output_change: true is set for every Beat, and the underlying reasons for this have not been addressed on the Beat side.

We are going to have to handle this on the Beat side in V2. @blakerouse if we want the Beats to initiate a restart we will want a way for them to restart without appearing unhealthy.

Could we add a restarting observed state to the control protocol, to either tell the agent to expect a restart or maybe just have the agent execute the restart on behalf of the sub-process?

blakerouse · 2022-12-12T16:59:12Z

A quick restart would only bounce the unhealthy-ness for a quick moment, is that bad? Being that it is restarting and necessarily not working as it should, is it really not unhealthy?

We do not report unhealthy state to Fleet as soon as it changes, actually we only report it after the longpoll timeout. We should look at changing that really, and add a debounce on status changes. So even if it did report unhealthy, it should not report that to Fleet unless it remains unhealthy for longer than 30 seconds or so.

cmacknz · 2022-12-12T17:20:23Z

A quick restart would only bounce the unhealthy-ness for a quick moment, is that bad? Being that it is restarting and necessarily not working as it should, is it really not unhealthy?

In this specific case, where the Beat restarting would be intentional we should consider the Beat healthy and working as it restarts. Even if the unhealthy status is brief here, I think we want to avoid showing it to users so we can avoid having to explain it in support tickets or demos.

We do not report unhealthy state to Fleet as soon as it changes, actually we only report it after the longpoll timeout. We should look at changing that really, and add a debounce on status changes. So even if it did report unhealthy, it should not report that to Fleet unless it remains unhealthy for longer than 30 seconds or so.

The status debounce is another way to solve this, but I don't know if we want to also hide unintentional restarts as well. I think we'd want to avoid reporting a process as healthy if it is restarting continuously but within the debounce interval, or at least we would want that to be obvious in the logs.

blakerouse · 2022-12-12T20:41:22Z

A quick restart would only bounce the unhealthy-ness for a quick moment, is that bad? Being that it is restarting and necessarily not working as it should, is it really not unhealthy?

In this specific case, where the Beat restarting would be intentional we should consider the Beat healthy and working as it restarts. Even if the unhealthy status is brief here, I think we want to avoid showing it to users so we can avoid having to explain it in support tickets or demos.

I think the fact that beats needs to restart when the output changes is a problem in general and adding some weird workaround for handling the fact that it needs to restart is wrong.

We do not report unhealthy state to Fleet as soon as it changes, actually we only report it after the longpoll timeout. We should look at changing that really, and add a debounce on status changes. So even if it did report unhealthy, it should not report that to Fleet unless it remains unhealthy for longer than 30 seconds or so.

The status debounce is another way to solve this, but I don't know if we want to also hide unintentional restarts as well. I think we'd want to avoid reporting a process as healthy if it is restarting continuously but within the debounce interval, or at least we would want that to be obvious in the logs.

Status debounce is a requirement we need to have anyway. We cannot just drop and reconnect on every status change of the checkin longpoll or that would greatly increase the load on the Fleet Server for random status changes. Elastic Agent should only report to Fleet Server a status that is persistent for a period of time. The status change itself will be logged and sent to elasticsearch.

cmacknz · 2022-12-13T15:23:38Z

I think the fact that beats needs to restart when the output changes is a problem in general and adding some weird workaround for handling the fact that it needs to restart is wrong.

Agreed in general. In this case Tiago has found a history of issues going back years, so I am mostly concerned about our ability to fix this in 8.6 given the time we have left.

Status debounce is a requirement we need to have anyway. We cannot just drop and reconnect on every status change of the checkin longpoll or that would greatly increase the load on the Fleet Server for random status changes. Elastic Agent should only report to Fleet Server a status that is persistent for a period of time. The status change itself will be logged and sent to elasticsearch.

My only requirement here is that we don't show Beats as unhealthy when they restart on purpose. I agree that status debouncing in the agent is one way to accomplish this, and that we should probably do it anyway.

I will propose that our technical approach here for 8.6 should then be:

Implement status debouncing in the agent, to prevent brief and intentional restarts from causing the agent and affected units to appear unhealthy.
Have the Beats restart themselves automatically whenever the output configuration changes, modifying the implementation here.

@blakerouse or @fearful-symmetry any opposition to this plan as a fix to avoid regressions around output reloading in 8.6?

Long term we will want to address the underlying issues with output reloading.

fearful-symmetry · 2022-12-13T19:09:31Z

Sorry, I'm just catching up to this. I was under the impression that we no longer wanted the beats to restart on output change under V2. There's even been a few bugfixes to make this work, since the outputs had some interesting global state with monitoring variables. Are we tracking what issues require the restart?

Also, how would beats controlling the restart work? Would the beat just close itself and let elastic-agent restart?

cmacknz · 2022-12-13T20:53:10Z

Sorry, I'm just catching up to this. I was under the impression that we no longer wanted the beats to restart on output change under V2. There's even been a few bugfixes to make this work, since the outputs had some interesting global state with monitoring variables. Are we tracking what issues require the restart?

We do want to remove the restart eventually, but we don't believe we've actually fixed all of the bugs. In particular elastic/beats#24538 describes several known issues with output reloading, and is the original reason we had the agent automatically restart a Beat when its output configuration changed. As far as I can tell we haven't fixed any of these problems.

There is at least one known issue where the Beats need to be restarted when the output certificates change (elastic/beats#15956), and since the certificates can be managed through Fleet if we stop doing the restart this will be a regression vs 8.5 where the restart happened automatically.

Also, how would beats controlling the restart work? Would the beat just close itself and let elastic-agent restart?

@blakerouse does the agent automatically restart processes/components that have exited unexpectedly? I am assuming yes, so the Beat just needs to gracefully stop whenever an output configuration change is detected and the agent will handle the rest.

cmacknz · 2022-12-14T13:25:10Z

I am going to split out the agent debounce work into a separate issue, and this will just track getting the Beats to restart when the output changes. The lack of restart on output change is the reason this is a blocker, the debounce is just a UI optimization.

@fearful-symmetry one additional complexity here is that we are not going to want the restart logic to apply when libbeat is used from APM server, so we can't just do it unconditionally. It needs to be an opt-in behaviour.

This is the list of Beats that were opting into the restart on output change behaviour prior to 8.6 (note that this list excludes APM server):

rg restart internal/spec
internal/spec/filebeat.yml
12:restart_on_output_change: true

internal/spec/metricbeat.yml
12:restart_on_output_change: true

internal/spec/auditbeat.yml
13:restart_on_output_change: true

internal/spec/cloudbeat.yml
8:restart_on_output_change: true

internal/spec/heartbeat.yml
5:restart_on_output_change: true

internal/spec/osquerybeat.yml
4:restart_on_output_change: true

internal/spec/packetbeat.yml
5:restart_on_output_change: true

fearful-symmetry · 2022-12-14T17:08:30Z

It needs to be an opt-in behaviour.

Was already anticipating something like that. Current plan is to have a config flag that can be set with -E.

cmacknz · 2022-12-14T18:10:01Z

Separate issue for implementing the debounce logic on the agent side to avoid inadvertently reporting the agent as unhealthy when the Beats restart on purpose: #1946

cmacknz · 2022-12-22T16:14:41Z

Reopening, requires elastic/beats#34066

cmacknz · 2022-12-22T16:29:16Z

I've added the QA:Needs Validation label. The test we want for this is to change the output configuration of the output in Fleet, save it, and then ensure that each of the underlying Beats restarts. The agent may appear unhealthy for up to 5 minutes, but ideally it returns to the healthy state afterwards.

For example we can add the worker:2 and bulk_max_size:100 advanced YAML parameters as a test, save the configuration, and watch for the Beats to restart in the logs.

It would be best to run this test several times changing the values of the parameters used each time. For example, the bulk_max_size parameter can safely be switched from 50 to 100 back to 50 to test this.

cmacknz · 2022-12-23T19:17:19Z

For the QA team, the logs when the Beats restart because of an output change look something like:

14:10:22.393
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->CONFIGURING): Configuring
14:10:22.393
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (HEALTHY->CONFIGURING): Configuring
14:10:22.495
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring (HEALTHY->STOPPING): Restarting
14:10:22.498
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring (HEALTHY->STOPPING): Restarting
14:10:22.499
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->STOPPING): Stopping
14:10:22.502
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->STOPPING): Stopping
14:10:22.507
elastic_agent
[elastic_agent][error] Component state changed http/metrics-monitoring (HEALTHY->FAILED): Failed: pid '39123' exited with code '0'
14:10:22.507
elastic_agent
[elastic_agent][error] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (STOPPING->FAILED): Failed: pid '39123' exited with code '0'
14:10:22.507
elastic_agent
[elastic_agent][error] Unit state changed http/metrics-monitoring (STOPPING->FAILED): Failed: pid '39123' exited with code '0'
14:10:22.507
elastic_agent
[elastic_agent][error] Component state changed beat/metrics-monitoring (HEALTHY->FAILED): Failed: pid '39122' exited with code '0'
14:10:22.507
elastic_agent
[elastic_agent][error] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STOPPING->FAILED): Failed: pid '39122' exited with code '0'
14:10:22.508
elastic_agent
[elastic_agent][error] Unit state changed beat/metrics-monitoring (STOPPING->FAILED): Failed: pid '39122' exited with code '0'
14:10:32.518
elastic_agent
[elastic_agent][info] Component state changed beat/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '39249'
14:10:32.519
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (FAILED->STARTING): Starting: spawned pid '39249'
14:10:32.519
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '39249'
14:10:32.527
elastic_agent
[elastic_agent][info] Component state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '39250'
14:10:32.528
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (FAILED->STARTING): Starting: spawned pid '39250'
14:10:32.528
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '39250'
14:10:32.636
elastic_agent
[elastic_agent][info] Component state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy: communicating with pid '39249'
14:10:32.638
elastic_agent
[elastic_agent][info] Component state changed http/metrics-monitoring (STARTING->HEALTHY): Healthy: communicating with pid '39250'
14:10:32.748
elastic_agent
[elastic_agent][error] Component state changed beat/metrics-monitoring (HEALTHY->FAILED): Failed: pid '39249' exited with code '2

belimawr added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Dec 8, 2022

belimawr changed the title ~~Test output change under V2~~ Ensure output change works as expected under V2 Dec 8, 2022

cmacknz changed the title ~~Ensure output change works as expected under V2~~ Beats no longer restart automatically when the output configuration changes. Dec 12, 2022

cmacknz added Team:Elastic-Agent Label for the Agent team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Dec 12, 2022

cmacknz added the blocker label Dec 13, 2022

pierrehilbert assigned fearful-symmetry Dec 14, 2022

amitkanfer assigned amitkanfer and fearful-symmetry and unassigned fearful-symmetry and amitkanfer Dec 14, 2022

cmacknz mentioned this issue Dec 14, 2022

Investigate allowing the agent to check in more frequently when the agent status changes #1946

Open

fearful-symmetry mentioned this issue Dec 14, 2022

Add stop functionality for output config changes elastic/beats#34049

Closed

4 tasks

cmacknz mentioned this issue Dec 21, 2022

Fix BeatV2Manager to configure inputs and set log level elastic/beats#34066

Merged

3 tasks

blakerouse mentioned this issue Dec 22, 2022

Update the beats specifications for log level and output change. #1983

Merged

2 tasks

blakerouse closed this as completed in #1983 Dec 22, 2022

cmacknz reopened this Dec 22, 2022

cmacknz added the QA:Needs Validation Needs validation by the QA Team label Dec 22, 2022

blakerouse closed this as completed in elastic/beats#34066 Dec 22, 2022

cmacknz mentioned this issue Dec 22, 2022

Add specification file option to ignore unexpected process exits #1993

Closed

cmacknz mentioned this issue Dec 23, 2022

Add configurable numbness for component restarts #2003

Merged

6 tasks

cmacknz mentioned this issue Jan 4, 2023

Beats managed by agent should not restart unless the output configuration has actually changed elastic/beats#34178

Closed

cmacknz mentioned this issue Jan 11, 2023

Beats shipper output shouldn't keep pending events in the queue elastic/elastic-agent-shipper#97

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beats no longer restart automatically when the output configuration changes. #1913

Beats no longer restart automatically when the output configuration changes. #1913

belimawr commented Dec 8, 2022

cmacknz commented Dec 8, 2022 •

edited

Loading

cmacknz commented Dec 12, 2022

blakerouse commented Dec 12, 2022

cmacknz commented Dec 12, 2022

blakerouse commented Dec 12, 2022

cmacknz commented Dec 13, 2022

fearful-symmetry commented Dec 13, 2022

cmacknz commented Dec 13, 2022

cmacknz commented Dec 14, 2022 •

edited

Loading

fearful-symmetry commented Dec 14, 2022

cmacknz commented Dec 14, 2022

cmacknz commented Dec 22, 2022

cmacknz commented Dec 22, 2022

cmacknz commented Dec 23, 2022 •

edited

Loading

Beats no longer restart automatically when the output configuration changes. #1913

Beats no longer restart automatically when the output configuration changes. #1913

Comments

belimawr commented Dec 8, 2022

cmacknz commented Dec 8, 2022 • edited Loading

cmacknz commented Dec 12, 2022

blakerouse commented Dec 12, 2022

cmacknz commented Dec 12, 2022

blakerouse commented Dec 12, 2022

cmacknz commented Dec 13, 2022

fearful-symmetry commented Dec 13, 2022

cmacknz commented Dec 13, 2022

cmacknz commented Dec 14, 2022 • edited Loading

fearful-symmetry commented Dec 14, 2022

cmacknz commented Dec 14, 2022

cmacknz commented Dec 22, 2022

cmacknz commented Dec 22, 2022

cmacknz commented Dec 23, 2022 • edited Loading

cmacknz commented Dec 8, 2022 •

edited

Loading

cmacknz commented Dec 14, 2022 •

edited

Loading

cmacknz commented Dec 23, 2022 •

edited

Loading