-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpoint observed as degraded when the agent service is separately restarted #1960
Comments
@dikshachauhan-qasource Please review. |
Secondary Review is done |
I see the endpoint component reporting itself as HEALTHY in the Ubuntu logs, but both the endpoint input and output are reporting as DEGRADED: {"log.level":"info","@timestamp":"2022-12-16T06:50:08.329Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":325},"message":"Existing component state changed","component":{"id":"endpoint-default","state":"HEALTHY","message":"Healthy: communicating with endpoint service","inputs":[{"id":"endpoint-default-8c3407b7-1909-4ff2-a7f5-f08ea2255882","state":"DEGRADED","message":"Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}"}],"output":{"id":"endpoint-default","state":"DEGRADED","message":"Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}"}},"ecs.version":"1.6.0"} |
The agent shuts down restarted at {"log.level":"info","@timestamp":"2022-12-16T06:44:03.947Z","log.origin":{"file.name":"cmd/run.go","file.line":242},"message":"Shutting down completed.","ecs.version":"1.6.0"} Endpoint starts looking for agent connectivity shortly after: {"@timestamp":"2022-12-16T06:44:11.690675625Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":1963,"name":"Config.cpp"}}},"message":"Config.cpp:1963 Checking for agent connectivity","process":{"pid":2174,"thread":{"id":2224}}} Endpoint eventually decides it can't connect to the agent: {"@timestamp":"2022-12-16T06:45:17.813066648Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":91,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:91 Failed to find connection to validate. Is Agent listening on 127.0.0.1:6788?","process":{"pid":2174,"thread":{"id":2212}}}
{"@timestamp":"2022-12-16T06:45:17.813115883Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":123,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:123 Agent process is not root/admin or validation failed, disconnecting","process":{"pid":2174,"thread":{"id":2212}}}
{"@timestamp":"2022-12-16T06:45:17.813146717Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"warning","origin":{"file":{"line":180,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:180 Failed to established stage 1 connection to agent","process":{"pid":2174,"thread":{"id":2212}}} Then several seconds later it succeeds, just before the {"@timestamp":"2022-12-16T06:49:48.324309668Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":727,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:727 Agent connection established.","process":{"pid":2174,"thread":{"id":2212}}}
{"@timestamp":"2022-12-16T06:49:48.325473344Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":485,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:485 Successfully created reader/writer for CheckinV2Api","process":{"pid":2174,"thread":{"id":2558}}}
{"@timestamp":"2022-12-16T06:49:48.325754323Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":516,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:516 Waiting for CheckinExpected message via CheckinV2 api","process":{"pid":2174,"thread":{"id":2558}}}
{"@timestamp":"2022-12-16T06:49:48.329795433Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":522,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:522 Received CheckinExpected message from agent","process":{"pid":2174,"thread":{"id":2558}}}
{"@timestamp":"2022-12-16T06:49:48.329968084Z","agent":{"id":"24b78fc3-0ef1-4430-af65-07fef77c8e19","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":279,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:279 Parsing checkin v2 message","process":{"pid":2174,"thread":{"id":2558}}} |
If I look at the current state of the endpoint service in the agent message UnitObserved {
// ...
// Current state of the unit.
State state = 4;
// Human readable message for the state of the unit.
// Exposed to users to provide more detail about the state for this single unit.
string message = 5;
// Payload for the current state.
google.protobuf.Struct payload = 6;
} In this case I see that endpoint seems to be reporting itself as DEGRADED but with a message of state:
state: 2 # HEALTHY
message: 'Healthy: communicating with endpoint service'
units:
? unittype: 0 # INPUT
unitid: endpoint-default-8c3407b7-1909-4ff2-a7f5-f08ea2255882
: state: 3 # DEGRADED
message: Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}
payload:
error:
code: 0
message: Success
? unittype: 1 # OUTPUT
unitid: endpoint-default
: state: 3 # DEGRADED
message: Protecting with policy {8c3407b7-1909-4ff2-a7f5-f08ea2255882}
payload:
error:
code: 0
message: Success @brian-mckinney Endpoint seems to be reporting it's agent input and output unit state as DEGRADED but without an error message. Any ideas on what might be causing this? In this case the agent service restarted, but endpoint did not. My initial suspicion is that the degraded status was set when the agent restarted but then did not clear when the agent started running again. |
I agree. @samratbhadra-qasource can you try some other things to see what happens?
The results from those tests will help confirm my suspicion of what's happening. I don't think this is a new issue or related to v2. Endpoint has always required an active connection to Agent when applying policy. Without an active connection, Endpoint reports failure since Endpoint can't be sure it's doing the right thing. That "Needs attention" comes from an Endpoint document it writes to Elasticsearch directly. Assuming this is long standing behavior, I think it's best to leave things as-is for 8.6.0 and then consider a possible change without rushing it. |
Hi Team, Observations:
Build details: Screenshots: Agent Logs: Please let us know if anything else is required from our end. Thanks |
Thanks Amol! Closing this issue then. |
Bug ConversionTest-Case not required as this particular checkpoint is already covered in the following testcase: Thanks |
Kibana version: 8.6 BC 7 Kibana Prod environment
Host OS and Browser version: Windows, Linux, All
Describe the bug:
Agent goes to unhealthy state on restarting the agent.
Build Details:
Preconditions:
Steps to reproduce:
Expected Result:
Agent should not go to unhealthy state and should comes to healthy state on restarting the agent.
Screenshots :
Logs:
Linux:
elastic-agent-diagnostics-2022-12-16T08-58-58Z-00.zip
Windows:
elastic-agent-diagnostics-2022-12-16T09-09-37Z-00.zip
The text was updated successfully, but these errors were encountered: