-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beats no longer restart automatically when the output configuration changes. #1913
Comments
We were definitely hitting this artifact: beats/filebeat
restart_on_output_change: true
rules:
- fix_stream: {} |
We are going to have to handle this on the Beat side in V2. @blakerouse if we want the Beats to initiate a restart we will want a way for them to restart without appearing unhealthy. Could we add a |
A quick restart would only bounce the unhealthy-ness for a quick moment, is that bad? Being that it is restarting and necessarily not working as it should, is it really not unhealthy? We do not report unhealthy state to Fleet as soon as it changes, actually we only report it after the longpoll timeout. We should look at changing that really, and add a debounce on status changes. So even if it did report unhealthy, it should not report that to Fleet unless it remains unhealthy for longer than 30 seconds or so. |
In this specific case, where the Beat restarting would be intentional we should consider the Beat healthy and working as it restarts. Even if the unhealthy status is brief here, I think we want to avoid showing it to users so we can avoid having to explain it in support tickets or demos.
The status debounce is another way to solve this, but I don't know if we want to also hide unintentional restarts as well. I think we'd want to avoid reporting a process as healthy if it is restarting continuously but within the debounce interval, or at least we would want that to be obvious in the logs. |
I think the fact that beats needs to restart when the output changes is a problem in general and adding some weird workaround for handling the fact that it needs to restart is wrong.
Status debounce is a requirement we need to have anyway. We cannot just drop and reconnect on every status change of the checkin longpoll or that would greatly increase the load on the Fleet Server for random status changes. Elastic Agent should only report to Fleet Server a status that is persistent for a period of time. The status change itself will be logged and sent to elasticsearch. |
Agreed in general. In this case Tiago has found a history of issues going back years, so I am mostly concerned about our ability to fix this in 8.6 given the time we have left.
My only requirement here is that we don't show Beats as unhealthy when they restart on purpose. I agree that status debouncing in the agent is one way to accomplish this, and that we should probably do it anyway. I will propose that our technical approach here for 8.6 should then be:
@blakerouse or @fearful-symmetry any opposition to this plan as a fix to avoid regressions around output reloading in 8.6? Long term we will want to address the underlying issues with output reloading. |
Sorry, I'm just catching up to this. I was under the impression that we no longer wanted the beats to restart on output change under V2. There's even been a few bugfixes to make this work, since the outputs had some interesting global state with monitoring variables. Are we tracking what issues require the restart? Also, how would beats controlling the restart work? Would the beat just close itself and let elastic-agent restart? |
We do want to remove the restart eventually, but we don't believe we've actually fixed all of the bugs. In particular elastic/beats#24538 describes several known issues with output reloading, and is the original reason we had the agent automatically restart a Beat when its output configuration changed. As far as I can tell we haven't fixed any of these problems. There is at least one known issue where the Beats need to be restarted when the output certificates change (elastic/beats#15956), and since the certificates can be managed through Fleet if we stop doing the restart this will be a regression vs 8.5 where the restart happened automatically.
@blakerouse does the agent automatically restart processes/components that have exited unexpectedly? I am assuming yes, so the Beat just needs to gracefully stop whenever an output configuration change is detected and the agent will handle the rest. |
I am going to split out the agent debounce work into a separate issue, and this will just track getting the Beats to restart when the output changes. The lack of restart on output change is the reason this is a blocker, the debounce is just a UI optimization. @fearful-symmetry one additional complexity here is that we are not going to want the restart logic to apply when libbeat is used from APM server, so we can't just do it unconditionally. It needs to be an opt-in behaviour. This is the list of Beats that were opting into the restart on output change behaviour prior to 8.6 (note that this list excludes APM server):
|
Was already anticipating something like that. Current plan is to have a config flag that can be set with |
Separate issue for implementing the debounce logic on the agent side to avoid inadvertently reporting the agent as unhealthy when the Beats restart on purpose: #1946 |
Reopening, requires elastic/beats#34066 |
I've added the QA:Needs Validation label. The test we want for this is to change the output configuration of the output in Fleet, save it, and then ensure that each of the underlying Beats restarts. The agent may appear unhealthy for up to 5 minutes, but ideally it returns to the healthy state afterwards. For example we can add the It would be best to run this test several times changing the values of the parameters used each time. For example, the |
For the QA team, the logs when the Beats restart because of an output change look something like:
|
Under V1 the Elastic-Agent would decide when to restart a Beat, one of the situations it restarted Beats was on output change. Now this control has been handed over the Beats, we need to ensure they're still work as expected.
Here is the implementation under V1:
elastic-agent/internal/pkg/core/plugin/common.go
Lines 20 to 50 in a827e93
Some references:
The text was updated successfully, but these errors were encountered: