Beats managed by agent can be left without a configured output #2086

cmacknz · 2023-01-12T16:56:37Z

When the Beats restart after an output configuration has changed, we've observed that starting in 8.6 some of them can be left in a failed state with an output unit has no config error. Example of the unit state from a recent agent diagnostics bundle:

  state:
    state: 2
    message: 'Healthy: communicating with pid ''5565'''
    units:
      ? unittype: 0
        unitid: beat/metrics-monitoring-metrics-monitoring-beats
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 1
        unitid: beat/metrics-monitoring
      : state: 4
        message: '[output unit has no config]'
    version_info:
      name: beat-v2-client
      version: 8.6.0
      meta:
        build_time: 2023-01-04 01:30:07 +0000 UTC
        commit: 561a3e1839f1a50ce832e8e114de399b2bee2542

So far reports of this have primarily affected the Beats started for use in internal monitoring, but I have not been able to confirm if this bug exclusively affects the monitoring Beats.

We need to reproduce this and fix it. It appears to be triggered by the Beats restarting when the output changes. A similar problem was observed in elastic/beats#34137 which was fixed in #2036. There must be another problem with unit configurations after a Beat restarts.

Relevant log snippet of the full state transition:

{"log.level":"info","@timestamp":"2023-01-11T19:21:11.997Z","message":"beat is restarting because output changed","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":503,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:11.998Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":835},"message":"Unit state changed beat/metrics-monitoring (HEALTHY->STOPPING): Restarting","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"STOPPING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:11.999Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":835},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->STOPPING): Stopping","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"STOPPING","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.004Z","message":"Uptime: 1.228625389s","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"ecs.version":"1.6.0","log.logger":"monitoring","log.origin":{"file.line":196,"file.name":"log/log.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.004Z","message":"metricbeat stopped.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"service.name":"metricbeat","ecs.version":"1.6.0","log.origin":{"file.line":491,"file.name":"instance/beat.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.009Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":837},"message":"Component state changed beat/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '5536' exited with code '0'","component":{"id":"beat/metrics-monitoring","state":"STOPPED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.009Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":837},"message":"Unit state changed beat/metrics-monitoring (STOPPING->STOPPED): Suppressing FAILED state due to restart for '5536' exited with code '0'","component":{"id":"beat/metrics-monitoring","state":"STOPPED"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.009Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":837},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STOPPING->STOPPED): Suppressing FAILED state due to restart for '5536' exited with code '0'","component":{"id":"beat/metrics-monitoring","state":"STOPPED"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.011Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Spawned new component beat/metrics-monitoring: Starting: spawned pid '5565'","component":{"id":"beat/metrics-monitoring","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.011Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Spawned new unit beat/metrics-monitoring: Starting: spawned pid '5565'","component":{"id":"beat/metrics-monitoring","state":"STARTING"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.011Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Spawned new unit beat/metrics-monitoring-metrics-monitoring-beats: Starting: spawned pid '5565'","component":{"id":"beat/metrics-monitoring","state":"STARTING"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.180Z","message":"Output reload is enabled, the beat will restart as needed on change of output config","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.logger":"centralmgmt","log.origin":{"file.line":119,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.180Z","message":"running under elastic-agent, per-beat lockfiles disabled","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.origin":{"file.line":401,"file.name":"instance/beat.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.184Z","message":"Setup Beat: metricbeat; Version: 8.6.0","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"service.name":"metricbeat","ecs.version":"1.6.0","log.origin":{"file.line":296,"file.name":"instance/beat.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.188Z","message":"Output is configured through Central Management","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"ecs.version":"1.6.0","log.origin":{"file.line":325,"file.name":"instance/beat.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.224Z","message":"metricbeat start running.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.origin":{"file.line":486,"file.name":"instance/beat.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.231Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Component state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy: communicating with pid '5565'","component":{"id":"beat/metrics-monitoring","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.294Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STARTING->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-11T19:21:13.431Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed beat/metrics-monitoring (STARTING->FAILED): [output unit has no config]","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-11T19:21:13.431Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->FAILED): [output unit has no config]","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"FAILED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.499Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (FAILED->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"CONFIGURING","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-11T19:21:13.601Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->FAILED): [output unit has no config]","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"FAILED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}

Interestingly in these logs the non-monitoring Beats did not restart, only the monitoring Beats restarted.

The text was updated successfully, but these errors were encountered:

cmacknz · 2023-01-16T15:28:49Z

I've now seen this apply to more than just the monitoring Beats, example from some recent diagnostics with several Beats in this state.

  state:
    state: 2
    message: 'Healthy: communicating with pid ''7640'''
    units:
      ? unittype: 1
        unitid: log-default
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 0
        unitid: log-default-logfile-system-b856919c-ade6-4a47-9e0c-28d4822b7ada
      : state: 4
        message: '[output unit has no config]'
    version_info:
      name: beat-v2-client
      version: 8.6.0
      meta:
        build_time: 2023-01-04 01:28:12 +0000 UTC
        commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
  state:
    state: 2
    message: 'Healthy: communicating with pid ''5984'''
    units:
      ? unittype: 0
        unitid: winlog-default-winlog-system-b856919c-ade6-4a47-9e0c-28d4822b7ada
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 1
        unitid: winlog-default
      : state: 4
        message: '[output unit has no config]'
    version_info:
      name: beat-v2-client
      version: 8.6.0
      meta:
        build_time: 2023-01-04 01:28:12 +0000 UTC
        commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
  state:
    state: 2
    message: 'Healthy: communicating with pid ''2916'''
    units:
      ? unittype: 0
        unitid: beat/metrics-monitoring-metrics-monitoring-beats
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 1
        unitid: beat/metrics-monitoring
      : state: 4
        message: '[output unit has no config]'
    version_info:
      name: beat-v2-client
      version: 8.6.0
      meta:
        build_time: 2023-01-04 01:30:05 +0000 UTC
        commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
  state:
    state: 2
    message: 'Healthy: communicating with pid ''7980'''
    units:
      ? unittype: 0
        unitid: filestream-monitoring-filestream-monitoring-agent
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 1
        unitid: filestream-monitoring
      : state: 4
        message: '[output unit has no config]'

gui-elastic · 2023-01-16T16:48:37Z

Is there any possible workaround for this issue?

cmacknz · 2023-01-16T17:12:03Z

We have struggled to reproduce this internally so far. Triggering a restart of the Beats started by agent can clear it. The simplest way to do this is to change the log level of the agent.

cmacknz · 2023-01-16T18:56:12Z

Editing any output setting for a Beat right now will trigger a restart of the Beats, so you can also trigger restarts of every agent in a policy by editing the output for that policy.

The most benign way to do that is to add or subtract one to the bulk_max_size parameter of the Elasticsearch output. This has to be done as an advanced YAML parameter in Fleet, in this case increasing the bulk_max_size from the default 50 to 51:

cmacknz · 2023-01-16T21:01:03Z

I wrote a script to update the output configuration of a standalone agent policy every few seconds continuously and I have been unable to reproduce this at all.

In the initial report of this this affected a Fleet managed agent where the Logstash output was used, in particular that report was affected by a missing backport for elastic/fleet-server#2105 in 8.6.0.

It may be possible that this is an interaction of that bug on the Fleet side with the agent. If this does affect the agent by itself it appears to be very rare.

gsantoro · 2023-01-17T10:18:41Z

hey @cmacknz I can see lots of those errors when using elastic-agent in standalone mode. Elastic agent in my case is running inside k8s but I don't think that's relevant. I think also elastic-agent is running on the same node where a previous run was correctly stopped. Please let me know if I can be of help

A sample of the error

{
    "log.level": "error",
    "@timestamp": "2023-01-17T10:00:07.412Z",
    "log.origin": {
        "file.name": "coordinator/coordinator.go",
        "file.line": 833
    },
    "message": "Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->FAILED): [output unit has no config]",
    "component": {
        "id": "beat/metrics-monitoring",
        "state": "HEALTHY"
    },
    "unit": {
        "id": "beat/metrics-monitoring-metrics-monitoring-beats",
        "type": "input",
        "state": "FAILED",
        "old_state": "CONFIGURING"
    },
    "ecs.version": "1.6.0"
}

I get the same error when starting elastic-agent in managed mode on the same nodes where I had previous runs of elastic-agent either in managed or standalone mode.

cmacknz · 2023-01-17T13:57:23Z

Thanks. I found at least one other case where the agent got into this state shortly after startup when using only the Elasticsearch output so this isn't related to the output type or Fleet at all.

@gsantoro does your agent stay in this state? Does it eventually self correct after a policy change/output change/log level change?

gsantoro · 2023-01-17T14:33:10Z

hey @cmacknz , I am not sure that's the case.

I noticed first this error when running elastic-agent in standalone mode for some testing. After this test, I removed completely the agent and then added back again after a while (1h maybe) but this time in managed mode.

this time it looks like it recovered from the error since it should some info logs

│ 2023-01-17T14:24:27.725Z │ info      │ Unit state changed beat/metrics-monitoring (HEALTHY->STOPPING): Restarting                                                                                                  │
│ 2023-01-17T14:24:27.733Z │ info      │ Component state changed http/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '297498' exited with code '0'                               │
│ 2023-01-17T14:24:27.733Z │ info      │ Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->STOPPED): Suppressing FAILED state due to restart for '297498' exited with code '0'       │
│ 2023-01-17T14:24:27.734Z │ info      │ Unit state changed http/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '297498' exited with code '0'                                    │
│ 2023-01-17T14:24:27.758Z │ info      │ Component state changed beat/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '297499' exited with code '0'                               │
│ 2023-01-17T14:24:27.758Z │ info      │ Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->STOPPED): Suppressing FAILED state due to restart for '297499' exited with code '0'       │
│ 2023-01-17T14:24:27.758Z │ info      │ Unit state changed beat/metrics-monitoring (STOPPING->STOPPED): Suppressing FAILED state due to restart for '297499' exited with code '0'

but a couple of minutes later

│ 2023-01-17T14:29:49.277Z │ error     │ Component state changed kubernetes/metrics-default (HEALTHY->FAILED): Failed: pid '300057' exited with code '-1'

all the three agents are now in Unhealthy state.

I don't remember to have seen this error before. I'm inclined to start with a fresh cluster to see if the error disappear.

joshdover · 2023-01-18T10:31:45Z

I have reproduced this issue with an agent that is running in a k8s cluster. Seems to be happening with the agent metrics component only, and it's persistently been in this state since I upgraded last week.

I can "fix" it by disable agent metrics and then re-enabling, and it seems to stick even after pod restarts or creating new pods.

Here's the state being reported to Fleet for the failing component:

{
	"id": "beat/metrics-monitoring",
	"units": [
		{
			"id": "beat/metrics-monitoring-metrics-monitoring-beats",
			"type": "input",
			"message": "[output unit has no config]",
			"status": "FAILED"
		},
		{
			"id": "beat/metrics-monitoring",
			"type": "output",
			"message": "[output unit has no config]",
			"status": "FAILED"
		}
	],
	"type": "beat/metrics",
	"message": "Healthy: communicating with pid '3598748'",
	"status": "HEALTHY"
}

I also have a similar problem happening on an agent running on a macbook, but this time with the endpoint component:

{
	"2": {
		"id": "endpoint-default",
		"units": [
			{
				"payload": {
					"error": {
						"code": -273,
						"message": "Unable to connect to output server"
					}
				},
				"id": "endpoint-default-7cb20141-591a-408c-8fbd-931cac679147",
				"type": "input",
				"message": "Protecting with policy {7cb20141-591a-408c-8fbd-931cac679147}",
				"status": "DEGRADED"
			},
			{
				"payload": {
					"error": {
						"code": -273,
						"message": "Unable to connect to output server"
					}
				},
				"id": "endpoint-default",
				"type": "output",
				"message": "Protecting with policy {7cb20141-591a-408c-8fbd-931cac679147}",
				"status": "DEGRADED"
			}
		],
		"type": "endpoint",
		"message": "Healthy: communicating with endpoint service",
		"status": "HEALTHY"
	}
}

@cmacknz I can DM you my diags for the macbook one if that's helpful.

cmacknz · 2023-01-18T15:25:24Z

I definitely want the diagnostics for the one affecting endpoint.

joshdover · 2023-01-18T16:42:49Z

DM'd it to you. FYI that k8s agent also went unhealthy again. Here's the logs:

11:33:12.833	elastic_agent	[elastic_agent][info] Updating running component model
11:33:13.032	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy
11:33:13.032	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STARTING->CONFIGURING): Configuring
11:33:13.196	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring (HEALTHY->STOPPING): Restarting
11:33:16.037	elastic_agent	[elastic_agent][info] Component state changed beat/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '2628217' exited with code '0'
11:33:16.037	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->STOPPED): Suppressing FAILED state due to restart for '2628217' exited with code '0'
11:33:16.037	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring (STOPPING->STOPPED): Suppressing FAILED state due to restart for '2628217' exited with code '0'
11:33:17.041	elastic_agent	[elastic_agent][info] Spawned new component beat/metrics-monitoring: Starting: spawned pid '2630916'
11:33:17.041	elastic_agent	[elastic_agent][info] Spawned new unit beat/metrics-monitoring-metrics-monitoring-beats: Starting: spawned pid '2630916'
11:33:17.042	elastic_agent	[elastic_agent][info] Spawned new unit beat/metrics-monitoring: Starting: spawned pid '2630916'
11:33:17.450	elastic_agent	[elastic_agent][info] Component state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy: communicating with pid '2630916'
11:33:24.338	elastic_agent	[elastic_agent][error] Unit state changed http/metrics-monitoring (STARTING->FAILED): [output unit has no config]
11:33:24.340	elastic_agent	[elastic_agent][info] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (STARTING->CONFIGURING): Configuring
11:33:24.390	elastic_agent	[elastic_agent][error] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->FAILED): [output unit has no config]

cmacknz · 2023-01-18T17:58:48Z

Thanks looking at the Endpoint logs I see the following which looks like a symptom of the exact same problem the Beats are experiencing.

{"@timestamp":"2023-01-18T09:58:10.585763Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":362,"name":"Response.cpp"}}},"message":"Response.cpp:362 Policy action configure_output: failure - Failed to read output configuration","process":{"pid":27888,"thread":{"id":30776130}}}
{"@timestamp":"2023-01-18T09:58:10.585766Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":557,"name":"Comms.cpp"}}},"message":"Comms.cpp:557 No valid comms client configured","process":{"pid":27888,"thread":{"id":30776130}}}

If endpoint is experiencing this problem it must be in the agent somewhere since it uses a completely different agent client implementation.

FYI @ferullo and @softengchick if you have not seen this issue yet. It appears under some currently unknown circumstances the agent will configure the processes it runs without an output.

cmacknz · 2023-01-18T18:07:06Z

I think this is a false alarm for Endpoint, it does configure the output sometime later but it never clears the status. This might be a different bug specific to Endpoint not updating its status.

{"@timestamp":"2023-01-18T09:58:39.772089Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":279,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:279 Parsing checkin v2 message","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:39.775817Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":110,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:110 Successfully parsed checkin v2 input unit (idx: 1)","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:39.775933Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":142,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:142 Successfully parsed checkin v2 output unit (idx: 1)","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:39.775945Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":328,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:328 Checkin v2 policy has changed, building new config","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:40.034328Z","agent":{"id":"b95a2ca7-1550-4621-a19f-9dcccae207fb","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":362,"name":"Response.cpp"}}},"message":"Response.cpp:362 Policy action configure_output: success - Successfully read output configuration","process":{"pid":27888,"thread":{"id":30777498}}}

This might be an example of similar behaviour we saw in #1960 where endpoint sets a "DEGRADED" status and then never clears it.

cmacknz · 2023-01-18T18:57:04Z

We have a lead on a fix now. We believe this is the same problem first reported in elastic/beats#34137, but our first attempt to fix it only made the problem less likely instead of preventing it completely.

joshdover · 2023-01-19T09:18:39Z

@cmacknz could you clarify how to resolve your last two comments and the PR that is up now? Based on the PR description it appears that this Agent bug may be affecting Endpoint too and not be an Endpoint-specific bug. Is that right?

cmacknz · 2023-01-19T16:06:25Z

It can technically affect endpoint as well, but to be affected the endpoint service would have to restart itself which I'm not sure it does as part of regular operation. The bug here is effectively a race condition and not something that is guaranteed to be hit on each restart.

For Beats this problem is being amplified by elastic/beats#34178, where the Beats are restarting any time a section of the agent policy that applies to that Beat changes. This is giving us many rolls of the dice and opportunities to hit this bug, but also to correct it.

cmacknz · 2023-01-19T16:07:34Z

It has been hard to assess how widespread this problem truly is, because it has been a struggle to even reproduce it with targeted tests (and it made it past our pre-release testing) but people are clearly hitting it.

renzedj · 2023-01-19T16:36:19Z

I have observed this as well with a standalone configuration running the containerized elastic-agent in EKS. Oddly, it only appears to impact the redis input for me.

cmacknz · 2023-01-24T02:30:35Z

This will be fixed in 8.6.1

cmacknz added bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.6.0 labels Jan 12, 2023

pierrehilbert assigned blakerouse Jan 12, 2023

gsantoro mentioned this issue Jan 17, 2023

Uplifting Kubernetes Standalone Manifest #2113

Merged

6 tasks

cmacknz changed the title ~~Some Beats can be left without a configured output after a restart~~ Sub-processes can be left without a configured output Jan 18, 2023

cmacknz changed the title ~~Sub-processes can be left without a configured output~~ Processes managed by agent can be left without a configured output Jan 18, 2023

cmacknz changed the title ~~Processes managed by agent can be left without a configured output~~ Beats managed by agent can be left without a configured output Jan 18, 2023

cmacknz mentioned this issue Jan 18, 2023

Add new known issue to 8.6 agent release notes elastic/observability-docs#2525

Merged

blakerouse mentioned this issue Jan 18, 2023

Fix issue where its possible for a component to receive a unit without a config #2138

Merged

4 tasks

cmacknz mentioned this issue Jan 19, 2023

Agent gets unhealthy temporary on installation with agent policy having system integration. #2145

Closed

cmacknz mentioned this issue Jan 20, 2023

TestFleet* fail on 8.6.x elastic/cloud-on-k8s#6331

Closed

thbkrkr mentioned this issue Jan 23, 2023

TestFleetMode is failing elastic/cloud-on-k8s#6308

Closed

blakerouse closed this as completed in #2138 Jan 23, 2023

dedemorton mentioned this issue Jan 25, 2023

Add Fleet/Agent 8.6.1 relnotes elastic/observability-docs#2544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beats managed by agent can be left without a configured output #2086

Beats managed by agent can be left without a configured output #2086

cmacknz commented Jan 12, 2023 •

edited

Loading

cmacknz commented Jan 16, 2023

gui-elastic commented Jan 16, 2023

cmacknz commented Jan 16, 2023

cmacknz commented Jan 16, 2023

cmacknz commented Jan 16, 2023

gsantoro commented Jan 17, 2023 •

edited

Loading

cmacknz commented Jan 17, 2023

gsantoro commented Jan 17, 2023

joshdover commented Jan 18, 2023 •

edited

Loading

cmacknz commented Jan 18, 2023

joshdover commented Jan 18, 2023

cmacknz commented Jan 18, 2023

cmacknz commented Jan 18, 2023 •

edited

Loading

cmacknz commented Jan 18, 2023

joshdover commented Jan 19, 2023

cmacknz commented Jan 19, 2023

cmacknz commented Jan 19, 2023

renzedj commented Jan 19, 2023 •

edited

Loading

cmacknz commented Jan 24, 2023

Beats managed by agent can be left without a configured output #2086

Beats managed by agent can be left without a configured output #2086

Comments

cmacknz commented Jan 12, 2023 • edited Loading

cmacknz commented Jan 16, 2023

gui-elastic commented Jan 16, 2023

cmacknz commented Jan 16, 2023

cmacknz commented Jan 16, 2023

cmacknz commented Jan 16, 2023

gsantoro commented Jan 17, 2023 • edited Loading

cmacknz commented Jan 17, 2023

gsantoro commented Jan 17, 2023

joshdover commented Jan 18, 2023 • edited Loading

cmacknz commented Jan 18, 2023

joshdover commented Jan 18, 2023

cmacknz commented Jan 18, 2023

cmacknz commented Jan 18, 2023 • edited Loading

cmacknz commented Jan 18, 2023

joshdover commented Jan 19, 2023

cmacknz commented Jan 19, 2023

cmacknz commented Jan 19, 2023

renzedj commented Jan 19, 2023 • edited Loading

cmacknz commented Jan 24, 2023

cmacknz commented Jan 12, 2023 •

edited

Loading

gsantoro commented Jan 17, 2023 •

edited

Loading

joshdover commented Jan 18, 2023 •

edited

Loading

cmacknz commented Jan 18, 2023 •

edited

Loading

renzedj commented Jan 19, 2023 •

edited

Loading