-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add watchdog mechanism to swss service and generate alert when swss have issue. #14686
Conversation
d662879
to
2b05c34
Compare
@@ -19,6 +19,13 @@ autostart=true | |||
autorestart=unexpected | |||
buffer_size=1024 | |||
|
|||
[eventlistener:supervisor-proc-watchdog-listener] | |||
command=/usr/bin/supervisor-proc-watchdog-listener --container-name swss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, merged code to supervisor-proc-exit-listener
93e8aa2
to
46cb307
Compare
@@ -75,6 +75,7 @@ command=/usr/bin/orchagent.sh | |||
priority=4 | |||
autostart=false | |||
autorestart=false | |||
stdout_capture_maxbytes=1MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config will enable stdout capture on orchagent, then systemd will convert orchagent heartbeat message to systemd PROCESS_COMMUNICATION_STDOUT event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, with an open question in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, with an open question in the comment.
**What I did** Improve orch agent: output heartbeat message to systemd. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually validate the heartbeat message works correctly. **Details if related** Another inprogress PR will add watchdog for this heartbeat message: sonic-net/sonic-buildimage#14686 sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
…n swss have issue. (sonic-net#14686)" This reverts commit 44427a2.
**What I did** Improve orch agent: output heartbeat message to systemd. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually validate the heartbeat message works correctly. **Details if related** Another inprogress PR will add watchdog for this heartbeat message: sonic-net/sonic-buildimage#14686 sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
…ave issue. (sonic-net#14686) This PR depends on sonic-net/sonic-swss#2737 merge first. **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: sonic-net/sonic-swss#2737 UT PR: sonic-net/sonic-mgmt#8306
…n swss have issue. (sonic-net#14686)" (sonic-net#15390) This reverts commit 44427a2. Docker image not updated during PR validation and caused PR check failures. Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
This PR depends on sonic-net/sonic-swss#2737 merge first.
What I did
Add orchagent watchdog to monitor and alert orchagent stuck issue.
Why I did it
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.
How I verified it
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP ', check there are warning message exist in log:
Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
Details if related
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306