Endpoint observed as degraded when the agent service is separately restarted #1960

ghost · 2022-12-16T09:16:32Z

Kibana version: 8.6 BC 7 Kibana Prod environment

Host OS and Browser version: Windows, Linux, All

Describe the bug:
Agent goes to unhealthy state on restarting the agent.

Build Details:

VERSION: 8.6.0-BC7
BUILD: 58773
COMMIT: 511e7c66ad8b290feb0af3ea262ab3fb08cc87be
ARTIFACTS LINK: https://staging.elastic.co/8.6.0-75d87829/summary-8.6.0.html

Preconditions:

8.6 BC7 Kibana Prod environment should be available.

Steps to reproduce:

Navigate to Fleet tab.
Install an agent.
Stop Agent from services and observe agent gone offline.
Now, Start the Agent from services.
Observe that agent goes to unhealthy state on restarting the agent.

Expected Result:
Agent should not go to unhealthy state and should comes to healthy state on restarting the agent.

Screenshots :

Logs:
Linux:
elastic-agent-diagnostics-2022-12-16T08-58-58Z-00.zip
Windows:
elastic-agent-diagnostics-2022-12-16T09-09-37Z-00.zip

The text was updated successfully, but these errors were encountered:

ghost · 2022-12-16T09:17:06Z

@dikshachauhan-qasource Please review.

dikshachauhan-qasource · 2022-12-16T09:38:13Z

Secondary Review is done

cmacknz · 2022-12-19T18:21:07Z

I see the endpoint component reporting itself as HEALTHY in the Ubuntu logs, but both the endpoint input and output are reporting as DEGRADED:

{"log.level":"info","@timestamp":"2022-12-16T06:50:08.329Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":325},"message":"Existing component state changed","component":{"id":"endpoint-default","state":"HEALTHY","message":"Healthy: communicating with endpoint service","inputs":[{"id":"endpoint-default-8c3407b7-1909-4ff2-a7f5-f08ea2255882","state":"DEGRADED","message":"Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}"}],"output":{"id":"endpoint-default","state":"DEGRADED","message":"Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}"}},"ecs.version":"1.6.0"}

cmacknz · 2022-12-19T19:41:47Z

The agent shuts down restarted at 2022-12-16T06:44:03.947Z":

{"log.level":"info","@timestamp":"2022-12-16T06:44:03.947Z","log.origin":{"file.name":"cmd/run.go","file.line":242},"message":"Shutting down completed.","ecs.version":"1.6.0"}

Endpoint starts looking for agent connectivity shortly after:

{"@timestamp":"2022-12-16T06:44:11.690675625Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":1963,"name":"Config.cpp"}}},"message":"Config.cpp:1963 Checking for agent connectivity","process":{"pid":2174,"thread":{"id":2224}}}

Endpoint eventually decides it can't connect to the agent:

{"@timestamp":"2022-12-16T06:45:17.813066648Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":91,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:91 Failed to find connection to validate. Is Agent listening on 127.0.0.1:6788?","process":{"pid":2174,"thread":{"id":2212}}}
{"@timestamp":"2022-12-16T06:45:17.813115883Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":123,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:123 Agent process is not root/admin or validation failed, disconnecting","process":{"pid":2174,"thread":{"id":2212}}}
{"@timestamp":"2022-12-16T06:45:17.813146717Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"warning","origin":{"file":{"line":180,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:180 Failed to established stage 1 connection to agent","process":{"pid":2174,"thread":{"id":2212}}}

Then several seconds later it succeeds, just before the 2022-12-16T06:50:08.329Z timestamp above where the agent decides the endpoint component is healthy but the units are degraded:

{"@timestamp":"2022-12-16T06:49:48.324309668Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":727,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:727 Agent connection established.","process":{"pid":2174,"thread":{"id":2212}}}
{"@timestamp":"2022-12-16T06:49:48.325473344Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":485,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:485 Successfully created reader/writer for CheckinV2Api","process":{"pid":2174,"thread":{"id":2558}}}
{"@timestamp":"2022-12-16T06:49:48.325754323Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":516,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:516 Waiting for CheckinExpected message via CheckinV2 api","process":{"pid":2174,"thread":{"id":2558}}}
{"@timestamp":"2022-12-16T06:49:48.329795433Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":522,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:522 Received CheckinExpected message from agent","process":{"pid":2174,"thread":{"id":2558}}}
{"@timestamp":"2022-12-16T06:49:48.329968084Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":279,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:279 Parsing checkin v2 message","process":{"pid":2174,"thread":{"id":2558}}}

cmacknz · 2022-12-19T20:34:25Z

If I look at the current state of the endpoint service in the agent state.yml from the diangostics bundle, it has the unit information that maps to the last received UnitObserved message:

message UnitObserved {
    // ...
    // Current state of the unit.
    State state = 4;
    // Human readable message for the state of the unit.
    // Exposed to users to provide more detail about the state for this single unit.
    string message = 5;
    // Payload for the current state.
    google.protobuf.Struct payload = 6;
}

In this case I see that endpoint seems to be reporting itself as DEGRADED but with a message of Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882} and a payload with a zero error code.

  state:
    state: 2 # HEALTHY
    message: 'Healthy: communicating with endpoint service'
    units:
      ? unittype: 0 # INPUT
        unitid: endpoint-default-8c3407b7-1909-4ff2-a7f5-f08ea2255882
      : state: 3 # DEGRADED
        message: Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}
        payload:
          error:
            code: 0
            message: Success
      ? unittype: 1 # OUTPUT
        unitid: endpoint-default
      : state: 3 # DEGRADED
        message: Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}
        payload:
          error:
            code: 0
            message: Success

@brian-mckinney Endpoint seems to be reporting it's agent input and output unit state as DEGRADED but without an error message. Any ideas on what might be causing this?

In this case the agent service restarted, but endpoint did not. My initial suspicion is that the degraded status was set when the agent restarted but then did not clear when the agent started running again.

ferullo · 2022-12-22T21:28:44Z

In this case the agent service restarted, but endpoint did not. My initial suspicion is that the degraded status was set when the agent restarted but then did not clear when the agent started running again.

I agree. @samratbhadra-qasource can you try some other things to see what happens?

After you trigger this issue can you change Endpoint's policy and see if the issue clears and Agent is now HEALTHY?
Also try to reproduce this but after installing Agent wait a few minutes for Agent and Endpoint to successfully apply the policy before stopping Agent. I think Agent will remain HEALTHY.

The results from those tests will help confirm my suspicion of what's happening.

I don't think this is a new issue or related to v2. Endpoint has always required an active connection to Agent when applying policy. Without an active connection, Endpoint reports failure since Endpoint can't be sure it's doing the right thing. That "Needs attention" comes from an Endpoint document it writes to Elasticsearch directly.

Assuming this is long standing behavior, I think it's best to leave things as-is for 8.6.0 and then consider a possible change without rushing it.

amolnater-qasource · 2022-12-23T10:57:57Z

Hi Team,
We have revalidated this issue on latest 8.6 BC8 Kibana cloud environment and found it working fine now.

Observations:

Endpoint remains Healthy on restarting agent service.
However agent gets Unhealthy temporarily under agents tab on restarting agent service.

Build details:
BUILD: 58807
COMMIT: 335b86a633af9ce1d9c8a92187ed1c8697ab139f

Screenshots:

(After some time)

Agent Logs:
[Endpoint]elastic-agent-diagnostics-2022-12-23T10-53-48Z-00.zip

Please let us know if anything else is required from our end.

Thanks

jlind23 · 2022-12-23T11:59:31Z

Thanks Amol! Closing this issue then.

dikshachauhan-qasource · 2023-04-04T10:32:39Z

Bug Conversion

Test-Case not required as this particular checkpoint is already covered in the following testcase:

https://elastic.testrail.io/index.php?/cases/view/150201

Thanks

ghost added bug Something isn't working Team:Elastic-Agent Label for the Agent team impact:high Short-term priority; add to current release, or definitely next. labels Dec 16, 2022

ghost assigned dikshachauhan-qasource Dec 16, 2022

cmacknz changed the title ~~Agent goes to unhealthy state on restarting the agent.~~ Endpoint observed as degraded when the agent service is separately restarted Dec 19, 2022

cmacknz added the 8.7-candidate label Dec 22, 2022

jlind23 closed this as completed Dec 23, 2022

cmacknz mentioned this issue Jan 18, 2023

Beats managed by agent can be left without a configured output #2086

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoint observed as degraded when the agent service is separately restarted #1960

Endpoint observed as degraded when the agent service is separately restarted #1960

ghost commented Dec 16, 2022 •

edited by ghost

Loading

ghost commented Dec 16, 2022

dikshachauhan-qasource commented Dec 16, 2022

cmacknz commented Dec 19, 2022 •

edited

Loading

cmacknz commented Dec 19, 2022

cmacknz commented Dec 19, 2022

ferullo commented Dec 22, 2022

amolnater-qasource commented Dec 23, 2022

jlind23 commented Dec 23, 2022

dikshachauhan-qasource commented Apr 4, 2023

Endpoint observed as degraded when the agent service is separately restarted #1960

Endpoint observed as degraded when the agent service is separately restarted #1960

Comments

ghost commented Dec 16, 2022 • edited by ghost Loading

ghost commented Dec 16, 2022

dikshachauhan-qasource commented Dec 16, 2022

cmacknz commented Dec 19, 2022 • edited Loading

cmacknz commented Dec 19, 2022

cmacknz commented Dec 19, 2022

ferullo commented Dec 22, 2022

amolnater-qasource commented Dec 23, 2022

jlind23 commented Dec 23, 2022

dikshachauhan-qasource commented Apr 4, 2023

Bug Conversion

ghost commented Dec 16, 2022 •

edited by ghost

Loading

cmacknz commented Dec 19, 2022 •

edited

Loading