-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beats managed by agent can be left without a configured output #2086
Comments
I've now seen this apply to more than just the monitoring Beats, example from some recent diagnostics with several Beats in this state. state:
state: 2
message: 'Healthy: communicating with pid ''7640'''
units:
? unittype: 1
unitid: log-default
: state: 4
message: '[output unit has no config]'
? unittype: 0
unitid: log-default-logfile-system-b856919c-ade6-4a47-9e0c-28d4822b7ada
: state: 4
message: '[output unit has no config]'
version_info:
name: beat-v2-client
version: 8.6.0
meta:
build_time: 2023-01-04 01:28:12 +0000 UTC
commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
state:
state: 2
message: 'Healthy: communicating with pid ''5984'''
units:
? unittype: 0
unitid: winlog-default-winlog-system-b856919c-ade6-4a47-9e0c-28d4822b7ada
: state: 4
message: '[output unit has no config]'
? unittype: 1
unitid: winlog-default
: state: 4
message: '[output unit has no config]'
version_info:
name: beat-v2-client
version: 8.6.0
meta:
build_time: 2023-01-04 01:28:12 +0000 UTC
commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
state:
state: 2
message: 'Healthy: communicating with pid ''2916'''
units:
? unittype: 0
unitid: beat/metrics-monitoring-metrics-monitoring-beats
: state: 4
message: '[output unit has no config]'
? unittype: 1
unitid: beat/metrics-monitoring
: state: 4
message: '[output unit has no config]'
version_info:
name: beat-v2-client
version: 8.6.0
meta:
build_time: 2023-01-04 01:30:05 +0000 UTC
commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
state:
state: 2
message: 'Healthy: communicating with pid ''7980'''
units:
? unittype: 0
unitid: filestream-monitoring-filestream-monitoring-agent
: state: 4
message: '[output unit has no config]'
? unittype: 1
unitid: filestream-monitoring
: state: 4
message: '[output unit has no config]' |
Is there any possible workaround for this issue? |
We have struggled to reproduce this internally so far. Triggering a restart of the Beats started by agent can clear it. The simplest way to do this is to change the log level of the agent. |
Editing any output setting for a Beat right now will trigger a restart of the Beats, so you can also trigger restarts of every agent in a policy by editing the output for that policy. The most benign way to do that is to add or subtract one to the bulk_max_size parameter of the Elasticsearch output. This has to be done as an advanced YAML parameter in Fleet, in this case increasing the bulk_max_size from the default 50 to 51: |
I wrote a script to update the output configuration of a standalone agent policy every few seconds continuously and I have been unable to reproduce this at all. In the initial report of this this affected a Fleet managed agent where the Logstash output was used, in particular that report was affected by a missing backport for elastic/fleet-server#2105 in 8.6.0. It may be possible that this is an interaction of that bug on the Fleet side with the agent. If this does affect the agent by itself it appears to be very rare. |
hey @cmacknz I can see lots of those errors when using elastic-agent in standalone mode. Elastic agent in my case is running inside k8s but I don't think that's relevant. I think also elastic-agent is running on the same node where a previous run was correctly stopped. Please let me know if I can be of help A sample of the error {
"log.level": "error",
"@timestamp": "2023-01-17T10:00:07.412Z",
"log.origin": {
"file.name": "coordinator/coordinator.go",
"file.line": 833
},
"message": "Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->FAILED): [output unit has no config]",
"component": {
"id": "beat/metrics-monitoring",
"state": "HEALTHY"
},
"unit": {
"id": "beat/metrics-monitoring-metrics-monitoring-beats",
"type": "input",
"state": "FAILED",
"old_state": "CONFIGURING"
},
"ecs.version": "1.6.0"
} I get the same error when starting elastic-agent in managed mode on the same nodes where I had previous runs of elastic-agent either in managed or standalone mode. |
Thanks. I found at least one other case where the agent got into this state shortly after startup when using only the Elasticsearch output so this isn't related to the output type or Fleet at all. @gsantoro does your agent stay in this state? Does it eventually self correct after a policy change/output change/log level change? |
hey @cmacknz , I am not sure that's the case. I noticed first this error when running elastic-agent in standalone mode for some testing. After this test, I removed completely the agent and then added back again after a while (1h maybe) but this time in managed mode. this time it looks like it recovered from the error since it should some info logs │ 2023-01-17T14:24:27.725Z │ info │ Unit state changed beat/metrics-monitoring (HEALTHY->STOPPING): Restarting │
│ 2023-01-17T14:24:27.733Z │ info │ Component state changed http/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '297498' exited with code '0' │
│ 2023-01-17T14:24:27.733Z │ info │ Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->STOPPED): Suppressing FAILED state due to restart for '297498' exited with code '0' │
│ 2023-01-17T14:24:27.734Z │ info │ Unit state changed http/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '297498' exited with code '0' │
│ 2023-01-17T14:24:27.758Z │ info │ Component state changed beat/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '297499' exited with code '0' │
│ 2023-01-17T14:24:27.758Z │ info │ Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->STOPPED): Suppressing FAILED state due to restart for '297499' exited with code '0' │
│ 2023-01-17T14:24:27.758Z │ info │ Unit state changed beat/metrics-monitoring (STOPPING->STOPPED): Suppressing FAILED state due to restart for '297499' exited with code '0' but a couple of minutes later │ 2023-01-17T14:29:49.277Z │ error │ Component state changed kubernetes/metrics-default (HEALTHY->FAILED): Failed: pid '300057' exited with code '-1' all the three agents are now in Unhealthy state. I don't remember to have seen this error before. I'm inclined to start with a fresh cluster to see if the error disappear. |
I have reproduced this issue with an agent that is running in a k8s cluster. Seems to be happening with the agent metrics component only, and it's persistently been in this state since I upgraded last week. I can "fix" it by disable agent metrics and then re-enabling, and it seems to stick even after pod restarts or creating new pods. Here's the state being reported to Fleet for the failing component:
I also have a similar problem happening on an agent running on a macbook, but this time with the endpoint component:
@cmacknz I can DM you my diags for the macbook one if that's helpful. |
I definitely want the diagnostics for the one affecting endpoint. |
DM'd it to you. FYI that k8s agent also went unhealthy again. Here's the logs:
|
Thanks looking at the Endpoint logs I see the following which looks like a symptom of the exact same problem the Beats are experiencing. {"@timestamp":"2023-01-18T09:58:10.585763Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":362,"name":"Response.cpp"}}},"message":"Response.cpp:362 Policy action configure_output: failure - Failed to read output configuration","process":{"pid":27888,"thread":{"id":30776130}}}
{"@timestamp":"2023-01-18T09:58:10.585766Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":557,"name":"Comms.cpp"}}},"message":"Comms.cpp:557 No valid comms client configured","process":{"pid":27888,"thread":{"id":30776130}}} If endpoint is experiencing this problem it must be in the agent somewhere since it uses a completely different agent client implementation. FYI @ferullo and @softengchick if you have not seen this issue yet. It appears under some currently unknown circumstances the agent will configure the processes it runs without an output. |
I think this is a false alarm for Endpoint, it does configure the output sometime later but it never clears the status. This might be a different bug specific to Endpoint not updating its status. {"@timestamp":"2023-01-18T09:58:39.772089Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":279,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:279 Parsing checkin v2 message","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:39.775817Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":110,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:110 Successfully parsed checkin v2 input unit (idx: 1)","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:39.775933Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":142,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:142 Successfully parsed checkin v2 output unit (idx: 1)","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:39.775945Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":328,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:328 Checkin v2 policy has changed, building new config","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:40.034328Z","agent":{"id":"b95a2ca7-1550-4621-a19f-9dcccae207fb","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":362,"name":"Response.cpp"}}},"message":"Response.cpp:362 Policy action configure_output: success - Successfully read output configuration","process":{"pid":27888,"thread":{"id":30777498}}} This might be an example of similar behaviour we saw in #1960 where endpoint sets a "DEGRADED" status and then never clears it. |
We have a lead on a fix now. We believe this is the same problem first reported in elastic/beats#34137, but our first attempt to fix it only made the problem less likely instead of preventing it completely. |
@cmacknz could you clarify how to resolve your last two comments and the PR that is up now? Based on the PR description it appears that this Agent bug may be affecting Endpoint too and not be an Endpoint-specific bug. Is that right? |
It can technically affect endpoint as well, but to be affected the endpoint service would have to restart itself which I'm not sure it does as part of regular operation. The bug here is effectively a race condition and not something that is guaranteed to be hit on each restart. For Beats this problem is being amplified by elastic/beats#34178, where the Beats are restarting any time a section of the agent policy that applies to that Beat changes. This is giving us many rolls of the dice and opportunities to hit this bug, but also to correct it. |
It has been hard to assess how widespread this problem truly is, because it has been a struggle to even reproduce it with targeted tests (and it made it past our pre-release testing) but people are clearly hitting it. |
I have observed this as well with a standalone configuration running the containerized elastic-agent in EKS. Oddly, it only appears to impact the redis input for me. |
This will be fixed in 8.6.1 |
When the Beats restart after an output configuration has changed, we've observed that starting in 8.6 some of them can be left in a failed state with an
output unit has no config
error. Example of the unit state from a recent agent diagnostics bundle:So far reports of this have primarily affected the Beats started for use in internal monitoring, but I have not been able to confirm if this bug exclusively affects the monitoring Beats.
We need to reproduce this and fix it. It appears to be triggered by the Beats restarting when the output changes. A similar problem was observed in elastic/beats#34137 which was fixed in #2036. There must be another problem with unit configurations after a Beat restarts.
Relevant log snippet of the full state transition:
Interestingly in these logs the non-monitoring Beats did not restart, only the monitoring Beats restarted.
The text was updated successfully, but these errors were encountered: