Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint observed as degraded when the agent service is separately restarted #1960

Closed
ghost opened this issue Dec 16, 2022 · 9 comments
Closed
Assignees
Labels
8.7-candidate bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team

Comments

@ghost
Copy link

ghost commented Dec 16, 2022

Kibana version: 8.6 BC 7 Kibana Prod environment

Host OS and Browser version: Windows, Linux, All

Describe the bug:
Agent goes to unhealthy state on restarting the agent.

Build Details:

VERSION: 8.6.0-BC7
BUILD: 58773
COMMIT: 511e7c66ad8b290feb0af3ea262ab3fb08cc87be
ARTIFACTS LINK: https://staging.elastic.co/8.6.0-75d87829/summary-8.6.0.html

Preconditions:

  • 8.6 BC7 Kibana Prod environment should be available.

Steps to reproduce:

  • Navigate to Fleet tab.
  • Install an agent.
  • Stop Agent from services and observe agent gone offline.
  • Now, Start the Agent from services.
  • Observe that agent goes to unhealthy state on restarting the agent.

Expected Result:
Agent should not go to unhealthy state and should comes to healthy state on restarting the agent.

Screenshots :
Screenshot (1133)
Screenshot (1134)
Screenshot (1135)

Logs:
Linux:
elastic-agent-diagnostics-2022-12-16T08-58-58Z-00.zip
Windows:
elastic-agent-diagnostics-2022-12-16T09-09-37Z-00.zip

@ghost ghost added bug Something isn't working Team:Elastic-Agent Label for the Agent team impact:high Short-term priority; add to current release, or definitely next. labels Dec 16, 2022
@ghost
Copy link
Author

ghost commented Dec 16, 2022

@dikshachauhan-qasource Please review.

@dikshachauhan-qasource
Copy link

Secondary Review is done

@cmacknz
Copy link
Member

cmacknz commented Dec 19, 2022

I see the endpoint component reporting itself as HEALTHY in the Ubuntu logs, but both the endpoint input and output are reporting as DEGRADED:

{"log.level":"info","@timestamp":"2022-12-16T06:50:08.329Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":325},"message":"Existing component state changed","component":{"id":"endpoint-default","state":"HEALTHY","message":"Healthy: communicating with endpoint service","inputs":[{"id":"endpoint-default-8c3407b7-1909-4ff2-a7f5-f08ea2255882","state":"DEGRADED","message":"Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}"}],"output":{"id":"endpoint-default","state":"DEGRADED","message":"Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}"}},"ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Dec 19, 2022

The agent shuts down restarted at 2022-12-16T06:44:03.947Z":

{"log.level":"info","@timestamp":"2022-12-16T06:44:03.947Z","log.origin":{"file.name":"cmd/run.go","file.line":242},"message":"Shutting down completed.","ecs.version":"1.6.0"}

Endpoint starts looking for agent connectivity shortly after:

{"@timestamp":"2022-12-16T06:44:11.690675625Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":1963,"name":"Config.cpp"}}},"message":"Config.cpp:1963 Checking for agent connectivity","process":{"pid":2174,"thread":{"id":2224}}}

Endpoint eventually decides it can't connect to the agent:

{"@timestamp":"2022-12-16T06:45:17.813066648Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":91,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:91 Failed to find connection to validate. Is Agent listening on 127.0.0.1:6788?","process":{"pid":2174,"thread":{"id":2212}}}
{"@timestamp":"2022-12-16T06:45:17.813115883Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":123,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:123 Agent process is not root/admin or validation failed, disconnecting","process":{"pid":2174,"thread":{"id":2212}}}
{"@timestamp":"2022-12-16T06:45:17.813146717Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"warning","origin":{"file":{"line":180,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:180 Failed to established stage 1 connection to agent","process":{"pid":2174,"thread":{"id":2212}}}

Then several seconds later it succeeds, just before the 2022-12-16T06:50:08.329Z timestamp above where the agent decides the endpoint component is healthy but the units are degraded:

{"@timestamp":"2022-12-16T06:49:48.324309668Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":727,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:727 Agent connection established.","process":{"pid":2174,"thread":{"id":2212}}}
{"@timestamp":"2022-12-16T06:49:48.325473344Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":485,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:485 Successfully created reader/writer for CheckinV2Api","process":{"pid":2174,"thread":{"id":2558}}}
{"@timestamp":"2022-12-16T06:49:48.325754323Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":516,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:516 Waiting for CheckinExpected message via CheckinV2 api","process":{"pid":2174,"thread":{"id":2558}}}
{"@timestamp":"2022-12-16T06:49:48.329795433Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":522,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:522 Received CheckinExpected message from agent","process":{"pid":2174,"thread":{"id":2558}}}
{"@timestamp":"2022-12-16T06:49:48.329968084Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":279,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:279 Parsing checkin v2 message","process":{"pid":2174,"thread":{"id":2558}}}

@cmacknz
Copy link
Member

cmacknz commented Dec 19, 2022

If I look at the current state of the endpoint service in the agent state.yml from the diangostics bundle, it has the unit information that maps to the last received UnitObserved message:

message UnitObserved {
    // ...
    // Current state of the unit.
    State state = 4;
    // Human readable message for the state of the unit.
    // Exposed to users to provide more detail about the state for this single unit.
    string message = 5;
    // Payload for the current state.
    google.protobuf.Struct payload = 6;
}

In this case I see that endpoint seems to be reporting itself as DEGRADED but with a message of Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882} and a payload with a zero error code.

  state:
    state: 2 # HEALTHY
    message: 'Healthy: communicating with endpoint service'
    units:
      ? unittype: 0 # INPUT
        unitid: endpoint-default-8c3407b7-1909-4ff2-a7f5-f08ea2255882
      : state: 3 # DEGRADED
        message: Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}
        payload:
          error:
            code: 0
            message: Success
      ? unittype: 1 # OUTPUT
        unitid: endpoint-default
      : state: 3 # DEGRADED
        message: Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}
        payload:
          error:
            code: 0
            message: Success

@brian-mckinney Endpoint seems to be reporting it's agent input and output unit state as DEGRADED but without an error message. Any ideas on what might be causing this?

In this case the agent service restarted, but endpoint did not. My initial suspicion is that the degraded status was set when the agent restarted but then did not clear when the agent started running again.

@cmacknz cmacknz changed the title Agent goes to unhealthy state on restarting the agent. Endpoint observed as degraded when the agent service is separately restarted Dec 19, 2022
@ferullo
Copy link

ferullo commented Dec 22, 2022

In this case the agent service restarted, but endpoint did not. My initial suspicion is that the degraded status was set when the agent restarted but then did not clear when the agent started running again.

I agree. @samratbhadra-qasource can you try some other things to see what happens?

  1. After you trigger this issue can you change Endpoint's policy and see if the issue clears and Agent is now HEALTHY?
  2. Also try to reproduce this but after installing Agent wait a few minutes for Agent and Endpoint to successfully apply the policy before stopping Agent. I think Agent will remain HEALTHY.

The results from those tests will help confirm my suspicion of what's happening.

I don't think this is a new issue or related to v2. Endpoint has always required an active connection to Agent when applying policy. Without an active connection, Endpoint reports failure since Endpoint can't be sure it's doing the right thing. That "Needs attention" comes from an Endpoint document it writes to Elasticsearch directly.

Assuming this is long standing behavior, I think it's best to leave things as-is for 8.6.0 and then consider a possible change without rushing it.

@amolnater-qasource
Copy link

Hi Team,
We have revalidated this issue on latest 8.6 BC8 Kibana cloud environment and found it working fine now.

Observations:

  • Endpoint remains Healthy on restarting agent service.
  • However agent gets Unhealthy temporarily under agents tab on restarting agent service.

Build details:
BUILD: 58807
COMMIT: 335b86a633af9ce1d9c8a92187ed1c8697ab139f

Screenshots:
8
9
10
11
12
13
(After some time)
14

Agent Logs:
[Endpoint]elastic-agent-diagnostics-2022-12-23T10-53-48Z-00.zip

Please let us know if anything else is required from our end.

Thanks

@jlind23
Copy link
Contributor

jlind23 commented Dec 23, 2022

Thanks Amol! Closing this issue then.

@dikshachauhan-qasource
Copy link

Bug Conversion

Test-Case not required as this particular checkpoint is already covered in the following testcase:

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.7-candidate bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

5 participants