Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add watchdog mechanism to swss service and generate alert when swss have issue. #14686

Merged
merged 3 commits into from
Jun 6, 2023

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Apr 17, 2023

This PR depends on sonic-net/sonic-swss#2737 merge first.

What I did
Add orchagent watchdog to monitor and alert orchagent stuck issue.

Why I did it
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

How I verified it
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP ', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

Details if related
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306

@liuh-80 liuh-80 force-pushed the dev/liuh/add-heart-beat branch from d662879 to 2b05c34 Compare April 28, 2023 08:51
@liuh-80 liuh-80 changed the title [POC] Add heartbeat monitor for orchagent [POC] Add proc stuck watchdog for orchagent Apr 28, 2023
@liuh-80 liuh-80 changed the title [POC] Add proc stuck watchdog for orchagent Add watchdog mechanism to swss service and generate alert when swss have issue. May 15, 2023
@liuh-80 liuh-80 marked this pull request as ready for review May 15, 2023 02:40
@@ -19,6 +19,13 @@ autostart=true
autorestart=unexpected
buffer_size=1024

[eventlistener:supervisor-proc-watchdog-listener]
command=/usr/bin/supervisor-proc-watchdog-listener --container-name swss
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listener

Could you explore how much code could be reused if combined this listener with above "supervisor-proc-exit-listener"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, merged code to supervisor-proc-exit-listener

@liuh-80 liuh-80 force-pushed the dev/liuh/add-heart-beat branch from 93e8aa2 to 46cb307 Compare May 22, 2023 03:15
@@ -75,6 +75,7 @@ command=/usr/bin/orchagent.sh
priority=4
autostart=false
autorestart=false
stdout_capture_maxbytes=1MB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdout_capture_maxbytes

What is the reason of this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config will enable stdout capture on orchagent, then systemd will convert orchagent heartbeat message to systemd PROCESS_COMMUNICATION_STDOUT event.

Copy link
Collaborator

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, with an open question in the comment.

Copy link
Collaborator

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, with an open question in the comment.

liuh-80 added a commit to sonic-net/sonic-swss that referenced this pull request Jun 6, 2023
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
@qiluo-msft qiluo-msft merged commit 44427a2 into sonic-net:master Jun 6, 2023
yejianquan added a commit to yejianquan/sonic-buildimage that referenced this pull request Jun 8, 2023
wangxin pushed a commit that referenced this pull request Jun 9, 2023
…n swss have issue. (#14686)" (#15390)

This reverts commit 44427a2.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 20, 2023
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#14686)

This PR depends on sonic-net/sonic-swss#2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…n swss have issue. (sonic-net#14686)" (sonic-net#15390)

This reverts commit 44427a2.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants