Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beats managed by agent can be left without a configured output #2086

Closed
cmacknz opened this issue Jan 12, 2023 · 19 comments · Fixed by #2138
Closed

Beats managed by agent can be left without a configured output #2086

cmacknz opened this issue Jan 12, 2023 · 19 comments · Fixed by #2138
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.6.0

Comments

@cmacknz
Copy link
Member

cmacknz commented Jan 12, 2023

When the Beats restart after an output configuration has changed, we've observed that starting in 8.6 some of them can be left in a failed state with an output unit has no config error. Example of the unit state from a recent agent diagnostics bundle:

  state:
    state: 2
    message: 'Healthy: communicating with pid ''5565'''
    units:
      ? unittype: 0
        unitid: beat/metrics-monitoring-metrics-monitoring-beats
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 1
        unitid: beat/metrics-monitoring
      : state: 4
        message: '[output unit has no config]'
    version_info:
      name: beat-v2-client
      version: 8.6.0
      meta:
        build_time: 2023-01-04 01:30:07 +0000 UTC
        commit: 561a3e1839f1a50ce832e8e114de399b2bee2542

So far reports of this have primarily affected the Beats started for use in internal monitoring, but I have not been able to confirm if this bug exclusively affects the monitoring Beats.

We need to reproduce this and fix it. It appears to be triggered by the Beats restarting when the output changes. A similar problem was observed in elastic/beats#34137 which was fixed in #2036. There must be another problem with unit configurations after a Beat restarts.

Relevant log snippet of the full state transition:

{"log.level":"info","@timestamp":"2023-01-11T19:21:11.997Z","message":"beat is restarting because output changed","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":503,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:11.998Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":835},"message":"Unit state changed beat/metrics-monitoring (HEALTHY->STOPPING): Restarting","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"STOPPING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:11.999Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":835},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->STOPPING): Stopping","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"STOPPING","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.004Z","message":"Uptime: 1.228625389s","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"ecs.version":"1.6.0","log.logger":"monitoring","log.origin":{"file.line":196,"file.name":"log/log.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.004Z","message":"metricbeat stopped.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"service.name":"metricbeat","ecs.version":"1.6.0","log.origin":{"file.line":491,"file.name":"instance/beat.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.009Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":837},"message":"Component state changed beat/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '5536' exited with code '0'","component":{"id":"beat/metrics-monitoring","state":"STOPPED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.009Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":837},"message":"Unit state changed beat/metrics-monitoring (STOPPING->STOPPED): Suppressing FAILED state due to restart for '5536' exited with code '0'","component":{"id":"beat/metrics-monitoring","state":"STOPPED"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:12.009Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":837},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STOPPING->STOPPED): Suppressing FAILED state due to restart for '5536' exited with code '0'","component":{"id":"beat/metrics-monitoring","state":"STOPPED"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.011Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Spawned new component beat/metrics-monitoring: Starting: spawned pid '5565'","component":{"id":"beat/metrics-monitoring","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.011Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Spawned new unit beat/metrics-monitoring: Starting: spawned pid '5565'","component":{"id":"beat/metrics-monitoring","state":"STARTING"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.011Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":825},"message":"Spawned new unit beat/metrics-monitoring-metrics-monitoring-beats: Starting: spawned pid '5565'","component":{"id":"beat/metrics-monitoring","state":"STARTING"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.180Z","message":"Output reload is enabled, the beat will restart as needed on change of output config","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.logger":"centralmgmt","log.origin":{"file.line":119,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.180Z","message":"running under elastic-agent, per-beat lockfiles disabled","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.origin":{"file.line":401,"file.name":"instance/beat.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.184Z","message":"Setup Beat: metricbeat; Version: 8.6.0","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"service.name":"metricbeat","ecs.version":"1.6.0","log.origin":{"file.line":296,"file.name":"instance/beat.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.188Z","message":"Output is configured through Central Management","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"ecs.version":"1.6.0","log.origin":{"file.line":325,"file.name":"instance/beat.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.224Z","message":"metricbeat start running.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.origin":{"file.line":486,"file.name":"instance/beat.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.231Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Component state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy: communicating with pid '5565'","component":{"id":"beat/metrics-monitoring","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.294Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STARTING->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-11T19:21:13.431Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed beat/metrics-monitoring (STARTING->FAILED): [output unit has no config]","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-11T19:21:13.431Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->FAILED): [output unit has no config]","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"FAILED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-11T19:21:13.499Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (FAILED->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"CONFIGURING","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-11T19:21:13.601Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->FAILED): [output unit has no config]","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"FAILED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}

Interestingly in these logs the non-monitoring Beats did not restart, only the monitoring Beats restarted.

@cmacknz cmacknz added bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.6.0 labels Jan 12, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Jan 16, 2023

I've now seen this apply to more than just the monitoring Beats, example from some recent diagnostics with several Beats in this state.

  state:
    state: 2
    message: 'Healthy: communicating with pid ''7640'''
    units:
      ? unittype: 1
        unitid: log-default
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 0
        unitid: log-default-logfile-system-b856919c-ade6-4a47-9e0c-28d4822b7ada
      : state: 4
        message: '[output unit has no config]'
    version_info:
      name: beat-v2-client
      version: 8.6.0
      meta:
        build_time: 2023-01-04 01:28:12 +0000 UTC
        commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
  state:
    state: 2
    message: 'Healthy: communicating with pid ''5984'''
    units:
      ? unittype: 0
        unitid: winlog-default-winlog-system-b856919c-ade6-4a47-9e0c-28d4822b7ada
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 1
        unitid: winlog-default
      : state: 4
        message: '[output unit has no config]'
    version_info:
      name: beat-v2-client
      version: 8.6.0
      meta:
        build_time: 2023-01-04 01:28:12 +0000 UTC
        commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
  state:
    state: 2
    message: 'Healthy: communicating with pid ''2916'''
    units:
      ? unittype: 0
        unitid: beat/metrics-monitoring-metrics-monitoring-beats
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 1
        unitid: beat/metrics-monitoring
      : state: 4
        message: '[output unit has no config]'
    version_info:
      name: beat-v2-client
      version: 8.6.0
      meta:
        build_time: 2023-01-04 01:30:05 +0000 UTC
        commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
  state:
    state: 2
    message: 'Healthy: communicating with pid ''7980'''
    units:
      ? unittype: 0
        unitid: filestream-monitoring-filestream-monitoring-agent
      : state: 4
        message: '[output unit has no config]'
      ? unittype: 1
        unitid: filestream-monitoring
      : state: 4
        message: '[output unit has no config]'

@gui-elastic
Copy link
Member

Is there any possible workaround for this issue?

@cmacknz
Copy link
Member Author

cmacknz commented Jan 16, 2023

We have struggled to reproduce this internally so far. Triggering a restart of the Beats started by agent can clear it. The simplest way to do this is to change the log level of the agent.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 16, 2023

Editing any output setting for a Beat right now will trigger a restart of the Beats, so you can also trigger restarts of every agent in a policy by editing the output for that policy.

The most benign way to do that is to add or subtract one to the bulk_max_size parameter of the Elasticsearch output. This has to be done as an advanced YAML parameter in Fleet, in this case increasing the bulk_max_size from the default 50 to 51:

Screen Shot 2023-01-16 at 1 55 10 PM

@cmacknz
Copy link
Member Author

cmacknz commented Jan 16, 2023

I wrote a script to update the output configuration of a standalone agent policy every few seconds continuously and I have been unable to reproduce this at all.

In the initial report of this this affected a Fleet managed agent where the Logstash output was used, in particular that report was affected by a missing backport for elastic/fleet-server#2105 in 8.6.0.

It may be possible that this is an interaction of that bug on the Fleet side with the agent. If this does affect the agent by itself it appears to be very rare.

@gsantoro
Copy link
Contributor

gsantoro commented Jan 17, 2023

hey @cmacknz I can see lots of those errors when using elastic-agent in standalone mode. Elastic agent in my case is running inside k8s but I don't think that's relevant. I think also elastic-agent is running on the same node where a previous run was correctly stopped. Please let me know if I can be of help

A sample of the error

{
    "log.level": "error",
    "@timestamp": "2023-01-17T10:00:07.412Z",
    "log.origin": {
        "file.name": "coordinator/coordinator.go",
        "file.line": 833
    },
    "message": "Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->FAILED): [output unit has no config]",
    "component": {
        "id": "beat/metrics-monitoring",
        "state": "HEALTHY"
    },
    "unit": {
        "id": "beat/metrics-monitoring-metrics-monitoring-beats",
        "type": "input",
        "state": "FAILED",
        "old_state": "CONFIGURING"
    },
    "ecs.version": "1.6.0"
}

I get the same error when starting elastic-agent in managed mode on the same nodes where I had previous runs of elastic-agent either in managed or standalone mode.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 17, 2023

Thanks. I found at least one other case where the agent got into this state shortly after startup when using only the Elasticsearch output so this isn't related to the output type or Fleet at all.

@gsantoro does your agent stay in this state? Does it eventually self correct after a policy change/output change/log level change?

@gsantoro
Copy link
Contributor

hey @cmacknz , I am not sure that's the case.

I noticed first this error when running elastic-agent in standalone mode for some testing. After this test, I removed completely the agent and then added back again after a while (1h maybe) but this time in managed mode.

this time it looks like it recovered from the error since it should some info logs

│ 2023-01-17T14:24:27.725Z │ info      │ Unit state changed beat/metrics-monitoring (HEALTHY->STOPPING): Restarting                                                                                                  │
│ 2023-01-17T14:24:27.733Z │ info      │ Component state changed http/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '297498' exited with code '0'                               │
│ 2023-01-17T14:24:27.733Z │ info      │ Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->STOPPED): Suppressing FAILED state due to restart for '297498' exited with code '0'       │
│ 2023-01-17T14:24:27.734Z │ info      │ Unit state changed http/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '297498' exited with code '0'                                    │
│ 2023-01-17T14:24:27.758Z │ info      │ Component state changed beat/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '297499' exited with code '0'                               │
│ 2023-01-17T14:24:27.758Z │ info      │ Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->STOPPED): Suppressing FAILED state due to restart for '297499' exited with code '0'       │
│ 2023-01-17T14:24:27.758Z │ info      │ Unit state changed beat/metrics-monitoring (STOPPING->STOPPED): Suppressing FAILED state due to restart for '297499' exited with code '0'

but a couple of minutes later

│ 2023-01-17T14:29:49.277Z │ error     │ Component state changed kubernetes/metrics-default (HEALTHY->FAILED): Failed: pid '300057' exited with code '-1'

all the three agents are now in Unhealthy state.

I don't remember to have seen this error before. I'm inclined to start with a fresh cluster to see if the error disappear.

@joshdover
Copy link
Contributor

joshdover commented Jan 18, 2023

I have reproduced this issue with an agent that is running in a k8s cluster. Seems to be happening with the agent metrics component only, and it's persistently been in this state since I upgraded last week.

I can "fix" it by disable agent metrics and then re-enabling, and it seems to stick even after pod restarts or creating new pods.

Here's the state being reported to Fleet for the failing component:

{
	"id": "beat/metrics-monitoring",
	"units": [
		{
			"id": "beat/metrics-monitoring-metrics-monitoring-beats",
			"type": "input",
			"message": "[output unit has no config]",
			"status": "FAILED"
		},
		{
			"id": "beat/metrics-monitoring",
			"type": "output",
			"message": "[output unit has no config]",
			"status": "FAILED"
		}
	],
	"type": "beat/metrics",
	"message": "Healthy: communicating with pid '3598748'",
	"status": "HEALTHY"
}

I also have a similar problem happening on an agent running on a macbook, but this time with the endpoint component:

{
	"2": {
		"id": "endpoint-default",
		"units": [
			{
				"payload": {
					"error": {
						"code": -273,
						"message": "Unable to connect to output server"
					}
				},
				"id": "endpoint-default-7cb20141-591a-408c-8fbd-931cac679147",
				"type": "input",
				"message": "Protecting with policy {7cb20141-591a-408c-8fbd-931cac679147}",
				"status": "DEGRADED"
			},
			{
				"payload": {
					"error": {
						"code": -273,
						"message": "Unable to connect to output server"
					}
				},
				"id": "endpoint-default",
				"type": "output",
				"message": "Protecting with policy {7cb20141-591a-408c-8fbd-931cac679147}",
				"status": "DEGRADED"
			}
		],
		"type": "endpoint",
		"message": "Healthy: communicating with endpoint service",
		"status": "HEALTHY"
	}
}

@cmacknz I can DM you my diags for the macbook one if that's helpful.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 18, 2023

I definitely want the diagnostics for the one affecting endpoint.

@joshdover
Copy link
Contributor

DM'd it to you. FYI that k8s agent also went unhealthy again. Here's the logs:

11:33:12.833	elastic_agent	[elastic_agent][info] Updating running component model
11:33:13.032	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy
11:33:13.032	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (STARTING->CONFIGURING): Configuring
11:33:13.196	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring (HEALTHY->STOPPING): Restarting
11:33:16.037	elastic_agent	[elastic_agent][info] Component state changed beat/metrics-monitoring (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '2628217' exited with code '0'
11:33:16.037	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->STOPPED): Suppressing FAILED state due to restart for '2628217' exited with code '0'
11:33:16.037	elastic_agent	[elastic_agent][info] Unit state changed beat/metrics-monitoring (STOPPING->STOPPED): Suppressing FAILED state due to restart for '2628217' exited with code '0'
11:33:17.041	elastic_agent	[elastic_agent][info] Spawned new component beat/metrics-monitoring: Starting: spawned pid '2630916'
11:33:17.041	elastic_agent	[elastic_agent][info] Spawned new unit beat/metrics-monitoring-metrics-monitoring-beats: Starting: spawned pid '2630916'
11:33:17.042	elastic_agent	[elastic_agent][info] Spawned new unit beat/metrics-monitoring: Starting: spawned pid '2630916'
11:33:17.450	elastic_agent	[elastic_agent][info] Component state changed beat/metrics-monitoring (STARTING->HEALTHY): Healthy: communicating with pid '2630916'
11:33:24.338	elastic_agent	[elastic_agent][error] Unit state changed http/metrics-monitoring (STARTING->FAILED): [output unit has no config]
11:33:24.340	elastic_agent	[elastic_agent][info] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (STARTING->CONFIGURING): Configuring
11:33:24.390	elastic_agent	[elastic_agent][error] Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->FAILED): [output unit has no config]

@cmacknz
Copy link
Member Author

cmacknz commented Jan 18, 2023

Thanks looking at the Endpoint logs I see the following which looks like a symptom of the exact same problem the Beats are experiencing.

{"@timestamp":"2023-01-18T09:58:10.585763Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":362,"name":"Response.cpp"}}},"message":"Response.cpp:362 Policy action configure_output: failure - Failed to read output configuration","process":{"pid":27888,"thread":{"id":30776130}}}
{"@timestamp":"2023-01-18T09:58:10.585766Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":557,"name":"Comms.cpp"}}},"message":"Comms.cpp:557 No valid comms client configured","process":{"pid":27888,"thread":{"id":30776130}}}

If endpoint is experiencing this problem it must be in the agent somewhere since it uses a completely different agent client implementation.

FYI @ferullo and @softengchick if you have not seen this issue yet. It appears under some currently unknown circumstances the agent will configure the processes it runs without an output.

@cmacknz cmacknz changed the title Some Beats can be left without a configured output after a restart Sub-processes can be left without a configured output Jan 18, 2023
@cmacknz cmacknz changed the title Sub-processes can be left without a configured output Processes managed by agent can be left without a configured output Jan 18, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Jan 18, 2023

I think this is a false alarm for Endpoint, it does configure the output sometime later but it never clears the status. This might be a different bug specific to Endpoint not updating its status.

{"@timestamp":"2023-01-18T09:58:39.772089Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":279,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:279 Parsing checkin v2 message","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:39.775817Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":110,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:110 Successfully parsed checkin v2 input unit (idx: 1)","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:39.775933Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":142,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:142 Successfully parsed checkin v2 output unit (idx: 1)","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:39.775945Z","agent":{"id":"00000000-0000-0000-0000-000000000000","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":328,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:328 Checkin v2 policy has changed, building new config","process":{"pid":27888,"thread":{"id":30777498}}}
{"@timestamp":"2023-01-18T09:58:40.034328Z","agent":{"id":"b95a2ca7-1550-4621-a19f-9dcccae207fb","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":362,"name":"Response.cpp"}}},"message":"Response.cpp:362 Policy action configure_output: success - Successfully read output configuration","process":{"pid":27888,"thread":{"id":30777498}}}

This might be an example of similar behaviour we saw in #1960 where endpoint sets a "DEGRADED" status and then never clears it.

@cmacknz cmacknz changed the title Processes managed by agent can be left without a configured output Beats managed by agent can be left without a configured output Jan 18, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Jan 18, 2023

We have a lead on a fix now. We believe this is the same problem first reported in elastic/beats#34137, but our first attempt to fix it only made the problem less likely instead of preventing it completely.

@joshdover
Copy link
Contributor

@cmacknz could you clarify how to resolve your last two comments and the PR that is up now? Based on the PR description it appears that this Agent bug may be affecting Endpoint too and not be an Endpoint-specific bug. Is that right?

@cmacknz
Copy link
Member Author

cmacknz commented Jan 19, 2023

It can technically affect endpoint as well, but to be affected the endpoint service would have to restart itself which I'm not sure it does as part of regular operation. The bug here is effectively a race condition and not something that is guaranteed to be hit on each restart.

For Beats this problem is being amplified by elastic/beats#34178, where the Beats are restarting any time a section of the agent policy that applies to that Beat changes. This is giving us many rolls of the dice and opportunities to hit this bug, but also to correct it.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 19, 2023

It has been hard to assess how widespread this problem truly is, because it has been a struggle to even reproduce it with targeted tests (and it made it past our pre-release testing) but people are clearly hitting it.

@renzedj
Copy link

renzedj commented Jan 19, 2023

I have observed this as well with a standalone configuration running the containerized elastic-agent in EKS. Oddly, it only appears to impact the redis input for me.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 24, 2023

This will be fixed in 8.6.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.6.0
Projects
None yet
6 participants