Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beats no longer restart automatically when the output configuration changes. #1913

Closed
belimawr opened this issue Dec 8, 2022 · 14 comments · Fixed by elastic/beats#34066 or #1983
Closed
Assignees
Labels
blocker bug Something isn't working QA:Needs Validation Needs validation by the QA Team Team:Elastic-Agent Label for the Agent team

Comments

@belimawr
Copy link
Contributor

belimawr commented Dec 8, 2022

Under V1 the Elastic-Agent would decide when to restart a Beat, one of the situations it restarted Beats was on output change. Now this control has been handed over the Beats, we need to ensure they're still work as expected.

Here is the implementation under V1:

// IsRestartNeeded returns true if
// - spec is configured to support restart on change
// - output changes in between configs
func IsRestartNeeded(log *logger.Logger, spec program.Spec, cfgFetch configFetcher, newCfg map[string]interface{}) bool {
if !spec.RestartOnOutputChange {
// early exit if restart is not needed anyway
return false
}
// compare outputs
curCfgStr := cfgFetch.Config()
if curCfgStr == "" {
// no config currently applied
return false
}
currentOutput, err := getOutputConfigFromString(curCfgStr)
if err != nil {
log.Errorf("failed to retrieve output config from current state: %v", err)
return false
}
newOutput, err := getOutputConfigFromMap(newCfg)
if err != nil {
log.Errorf("failed to retrieve output config from new state: %v", err)
return false
}
// restart needed only if output changed
return currentOutput != newOutput
}

Some references:

@belimawr belimawr added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Dec 8, 2022
@belimawr belimawr changed the title Test output change under V2 Ensure output change works as expected under V2 Dec 8, 2022
@cmacknz
Copy link
Member

cmacknz commented Dec 8, 2022

We were definitely hitting this RestartOnOutput change condition in V1:

https://github.com/elastic/elastic-agent/blob/a827e93586b119d9f12113a0bb9be37763f098e8/internal/spec/filebeat.yml#LL12C13-L12C31

artifact: beats/filebeat
restart_on_output_change: true
rules:
- fix_stream: {}

@cmacknz
Copy link
Member

cmacknz commented Dec 12, 2022

restart_on_output_change: true is set for every Beat, and the underlying reasons for this have not been addressed on the Beat side.

We are going to have to handle this on the Beat side in V2. @blakerouse if we want the Beats to initiate a restart we will want a way for them to restart without appearing unhealthy.

Could we add a restarting observed state to the control protocol, to either tell the agent to expect a restart or maybe just have the agent execute the restart on behalf of the sub-process?

@cmacknz cmacknz changed the title Ensure output change works as expected under V2 Beats no longer restart automatically when the output configuration changes. Dec 12, 2022
@cmacknz cmacknz added Team:Elastic-Agent Label for the Agent team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Dec 12, 2022
@blakerouse
Copy link
Contributor

A quick restart would only bounce the unhealthy-ness for a quick moment, is that bad? Being that it is restarting and necessarily not working as it should, is it really not unhealthy?

We do not report unhealthy state to Fleet as soon as it changes, actually we only report it after the longpoll timeout. We should look at changing that really, and add a debounce on status changes. So even if it did report unhealthy, it should not report that to Fleet unless it remains unhealthy for longer than 30 seconds or so.

@cmacknz
Copy link
Member

cmacknz commented Dec 12, 2022

A quick restart would only bounce the unhealthy-ness for a quick moment, is that bad? Being that it is restarting and necessarily not working as it should, is it really not unhealthy?

In this specific case, where the Beat restarting would be intentional we should consider the Beat healthy and working as it restarts. Even if the unhealthy status is brief here, I think we want to avoid showing it to users so we can avoid having to explain it in support tickets or demos.

We do not report unhealthy state to Fleet as soon as it changes, actually we only report it after the longpoll timeout. We should look at changing that really, and add a debounce on status changes. So even if it did report unhealthy, it should not report that to Fleet unless it remains unhealthy for longer than 30 seconds or so.

The status debounce is another way to solve this, but I don't know if we want to also hide unintentional restarts as well. I think we'd want to avoid reporting a process as healthy if it is restarting continuously but within the debounce interval, or at least we would want that to be obvious in the logs.

@blakerouse
Copy link
Contributor

A quick restart would only bounce the unhealthy-ness for a quick moment, is that bad? Being that it is restarting and necessarily not working as it should, is it really not unhealthy?

In this specific case, where the Beat restarting would be intentional we should consider the Beat healthy and working as it restarts. Even if the unhealthy status is brief here, I think we want to avoid showing it to users so we can avoid having to explain it in support tickets or demos.

I think the fact that beats needs to restart when the output changes is a problem in general and adding some weird workaround for handling the fact that it needs to restart is wrong.

We do not report unhealthy state to Fleet as soon as it changes, actually we only report it after the longpoll timeout. We should look at changing that really, and add a debounce on status changes. So even if it did report unhealthy, it should not report that to Fleet unless it remains unhealthy for longer than 30 seconds or so.

The status debounce is another way to solve this, but I don't know if we want to also hide unintentional restarts as well. I think we'd want to avoid reporting a process as healthy if it is restarting continuously but within the debounce interval, or at least we would want that to be obvious in the logs.

Status debounce is a requirement we need to have anyway. We cannot just drop and reconnect on every status change of the checkin longpoll or that would greatly increase the load on the Fleet Server for random status changes. Elastic Agent should only report to Fleet Server a status that is persistent for a period of time. The status change itself will be logged and sent to elasticsearch.

@cmacknz
Copy link
Member

cmacknz commented Dec 13, 2022

I think the fact that beats needs to restart when the output changes is a problem in general and adding some weird workaround for handling the fact that it needs to restart is wrong.

Agreed in general. In this case Tiago has found a history of issues going back years, so I am mostly concerned about our ability to fix this in 8.6 given the time we have left.

Status debounce is a requirement we need to have anyway. We cannot just drop and reconnect on every status change of the checkin longpoll or that would greatly increase the load on the Fleet Server for random status changes. Elastic Agent should only report to Fleet Server a status that is persistent for a period of time. The status change itself will be logged and sent to elasticsearch.

My only requirement here is that we don't show Beats as unhealthy when they restart on purpose. I agree that status debouncing in the agent is one way to accomplish this, and that we should probably do it anyway.

I will propose that our technical approach here for 8.6 should then be:

  1. Implement status debouncing in the agent, to prevent brief and intentional restarts from causing the agent and affected units to appear unhealthy.
  2. Have the Beats restart themselves automatically whenever the output configuration changes, modifying the implementation here.

@blakerouse or @fearful-symmetry any opposition to this plan as a fix to avoid regressions around output reloading in 8.6?

Long term we will want to address the underlying issues with output reloading.

@fearful-symmetry
Copy link
Contributor

Sorry, I'm just catching up to this. I was under the impression that we no longer wanted the beats to restart on output change under V2. There's even been a few bugfixes to make this work, since the outputs had some interesting global state with monitoring variables. Are we tracking what issues require the restart?

Also, how would beats controlling the restart work? Would the beat just close itself and let elastic-agent restart?

@cmacknz
Copy link
Member

cmacknz commented Dec 13, 2022

Sorry, I'm just catching up to this. I was under the impression that we no longer wanted the beats to restart on output change under V2. There's even been a few bugfixes to make this work, since the outputs had some interesting global state with monitoring variables. Are we tracking what issues require the restart?

We do want to remove the restart eventually, but we don't believe we've actually fixed all of the bugs. In particular elastic/beats#24538 describes several known issues with output reloading, and is the original reason we had the agent automatically restart a Beat when its output configuration changed. As far as I can tell we haven't fixed any of these problems.

There is at least one known issue where the Beats need to be restarted when the output certificates change (elastic/beats#15956), and since the certificates can be managed through Fleet if we stop doing the restart this will be a regression vs 8.5 where the restart happened automatically.

Also, how would beats controlling the restart work? Would the beat just close itself and let elastic-agent restart?

@blakerouse does the agent automatically restart processes/components that have exited unexpectedly? I am assuming yes, so the Beat just needs to gracefully stop whenever an output configuration change is detected and the agent will handle the rest.

@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2022

I am going to split out the agent debounce work into a separate issue, and this will just track getting the Beats to restart when the output changes. The lack of restart on output change is the reason this is a blocker, the debounce is just a UI optimization.

@fearful-symmetry one additional complexity here is that we are not going to want the restart logic to apply when libbeat is used from APM server, so we can't just do it unconditionally. It needs to be an opt-in behaviour.

This is the list of Beats that were opting into the restart on output change behaviour prior to 8.6 (note that this list excludes APM server):

rg restart internal/spec
internal/spec/filebeat.yml
12:restart_on_output_change: true

internal/spec/metricbeat.yml
12:restart_on_output_change: true

internal/spec/auditbeat.yml
13:restart_on_output_change: true

internal/spec/cloudbeat.yml
8:restart_on_output_change: true

internal/spec/heartbeat.yml
5:restart_on_output_change: true

internal/spec/osquerybeat.yml
4:restart_on_output_change: true

internal/spec/packetbeat.yml
5:restart_on_output_change: true

@fearful-symmetry
Copy link
Contributor

It needs to be an opt-in behaviour.

Was already anticipating something like that. Current plan is to have a config flag that can be set with -E.

@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2022

Separate issue for implementing the debounce logic on the agent side to avoid inadvertently reporting the agent as unhealthy when the Beats restart on purpose: #1946

@cmacknz
Copy link
Member

cmacknz commented Dec 22, 2022

Reopening, requires elastic/beats#34066

@cmacknz cmacknz reopened this Dec 22, 2022
@cmacknz cmacknz added the QA:Needs Validation Needs validation by the QA Team label Dec 22, 2022
@cmacknz
Copy link
Member

cmacknz commented Dec 22, 2022

I've added the QA:Needs Validation label. The test we want for this is to change the output configuration of the output in Fleet, save it, and then ensure that each of the underlying Beats restarts. The agent may appear unhealthy for up to 5 minutes, but ideally it returns to the healthy state afterwards.

For example we can add the worker:2 and bulk_max_size:100 advanced YAML parameters as a test, save the configuration, and watch for the Beats to restart in the logs.

Screen Shot 2022-12-22 at 11 21 28 AM

It would be best to run this test several times changing the values of the parameters used each time. For example, the bulk_max_size parameter can safely be switched from 50 to 100 back to 50 to test this.

@cmacknz
Copy link
Member

cmacknz commented Dec 23, 2022

For the QA team, the logs when the Beats restart because of an output change look something like:

14:10:22.393
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->CONFIGURING): Configuring
14:10:22.393
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (HEALTHY->CONFIGURING): Configuring
14:10:22.495
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring (HEALTHY->STOPPING): Restarting
14:10:22.498
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring (HEALTHY->STOPPING): Restarting
14:10:22.499
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->STOPPING): Stopping
14:10:22.502
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->STOPPING): Stopping
14:10:22.507
elastic_agent
[elastic_agent][error] Component state changed http/metrics-monitoring (HEALTHY->FAILED): Failed: pid '39123' exited with code '0'
14:10:22.507
elastic_agent
[elastic_agent][error] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (STOPPING->FAILED): Failed: pid '39123' exited with code '0'
14:10:22.507
elastic_agent
[elastic_agent][error] Unit state changed http/metrics-monitoring (STOPPING->FAILED): Failed: pid '39123' exited with code '0'
14:10:22.507
elastic_agent
[elastic_agent][error] Component state changed beat/metrics-monitoring (HEALTHY->FAILED): Failed: pid '39122' exited with code '0'
14:10:22.507
elastic_agent
[elastic_agent][error] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STOPPING->FAILED): Failed: pid '39122' exited with code '0'
14:10:22.508
elastic_agent
[elastic_agent][error] Unit state changed beat/metrics-monitoring (STOPPING->FAILED): Failed: pid '39122' exited with code '0'
14:10:32.518
elastic_agent
[elastic_agent][info] Component state changed beat/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '39249'
14:10:32.519
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (FAILED->STARTING): Starting: spawned pid '39249'
14:10:32.519
elastic_agent
[elastic_agent][info] Unit state changed beat/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '39249'
14:10:32.527
elastic_agent
[elastic_agent][info] Component state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '39250'
14:10:32.528
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (FAILED->STARTING): Starting: spawned pid '39250'
14:10:32.528
elastic_agent
[elastic_agent][info] Unit state changed http/metrics-monitoring (FAILED->STARTING): Starting: spawned pid '39250'
14:10:32.636
elastic_agent
[elastic_agent][info] Component state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy: communicating with pid '39249'
14:10:32.638
elastic_agent
[elastic_agent][info] Component state changed http/metrics-monitoring (STARTING->HEALTHY): Healthy: communicating with pid '39250'
14:10:32.748
elastic_agent
[elastic_agent][error] Component state changed beat/metrics-monitoring (HEALTHY->FAILED): Failed: pid '39249' exited with code '2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment