[config] config reload needs to handle exceptions while stopping sonic target #1694

rajendra-dendukuri · 2021-06-25T18:51:05Z

The below error is seen when a 'config reload' operation is performed
while the sonic.target is starting during system startup or a prior config
reload command.

Job for sonic.target canceled.

Fixes# sonic-net/sonic-buildimage#7508
fixes #7508

What I did

This error can be ignored if we can try to stop all the services associated
with sonic.target individually. Any processes which did not stop, will be
restarted as part of the "systemctl restart sonic.target" command.

How I did it

If sonic.target fails to stop, find all individual services which are part of sonic.target and stop them. All exceptions
are handled, logged and passed. The restart operation on sonic.target will ensure that the services are stopped and
started if they were not stopped earlier.

How to verify it

Use ZTP to push a configuration file. Also verified using ZTP unit tests.

root@sonic:/usr/lib/ztp/tests# pytest-3 -v -x -k test_configdb-json.py
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-3.10.1, py-1.7.0, pluggy-0.8.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /usr/lib/ztp/tests, inifile:
collected 199 items / 192 deselected

test_configdb-json.py::TestClass::test_constructor_wrong_type PASSED [ 14%]
test_configdb-json.py::TestClass::test_input_nfound PASSED [ 28%]
test_configdb-json.py::TestClass::test_invalid_input PASSED [ 42%]
test_configdb-json.py::TestClass::test_input_no_section PASSED [ 57%]
test_configdb-json.py::TestClass::test_json_config_nvalid PASSED [ 71%]
test_configdb-json.py::TestClass::test_json_config_valid_load PASSED [ 85%]
test_configdb-json.py::TestClass::test_json_config_valid_reload PASSED [100%]

================== 7 passed, 192 deselected in 40.48 seconds ===================
root@sonic:/usr/lib/ztp/tests#

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

…c.target The below error is seen when a 'config reload' operation is performed while the sonic.target is starting during system startup or a prior config reload command. Job for sonic.target canceled. This error can be ignored if we can try to stop all the services associated with sonic.target individually. Any processes which did not stop, will be restarted as part of the "systemctl restart sonic.target" command.

lgtm-com · 2021-06-25T19:04:29Z

This pull request introduces 3 alerts when merging a8229a5 into 38f8c06 - view on LGTM.com

new alerts:

3 for Except block handles 'BaseException'

vivekrnv · 2021-06-25T22:46:30Z

Hi,

Source of this issue is not just multiple config reload's getting executed at the same time, but it can also happen when there is a conflicting job installed even for one of the dependent stop jobs of sonic.target. I've detailed this in the original issue.

vivekrnv · 2021-06-25T22:53:00Z

config/main.py

-    clicommon.run_command("sudo systemctl stop sonic.target")
-
+    try:
+        clicommon.run_command("sudo systemctl stop sonic.target")


Instead of catching "Job Cancellation", we are also handling "Job failure" here.

Since, we're just dealing with Job Cancellation here, How about localizing the exception catch mechanism to just the "Cancellation".

The intention is indeed to handle all possible failures when we are stopping services. The "config reload" operation is a system affecting event. We need to ensure that this command performs the intended action in the entirety and also brings the system back to a well defined state. By not handling errors, we may be leaving the system in a broken state and in some cases inaccessible.

The proposed changes handle errors in the stop flow. It is to be noted that we are restarting all services at a later stage even if failures were seen during stop.

It is clear that the "config reload" is not robust enough. ZTP relies on "config reload" to load user provided configuration and as well as its own ZTP default configuration. So it is essential we fix "config reload" and not depend on ZTP to perform config reload again. Having said that, ZTP does perform a "config reload " if the "config reload" pushed by the user fails. This is to ensure that ZTP retries a new configuration that user might push after noticing the errors.

I agree, we are restarting these services later in the pipeline but from the current design, we are trying to stop all the dependent services before applying the configuration.

I am not entirely sure about the repercussions of keeping the services active when the config changes are applied. If that is deemed safe, we can proceed with the config reload, even if all the services haven't stopped.

In any case, i have a suggestion to reduce the chances of services being in active state

How about having a fixed number of retries on failed services (maybe 3) and if the stop job doesn't success after 3 tries, we can then choose to proceed with whatever approach we choose to adopt (Exit or Move Forward)

MAX_TRIES = 3 services_to_stop = _get_sonic_services() failed_execution = [] tries = 0 while tries < MAX_TRIES: for s in services_to_stop: try: clicommon.run_command("systemctl stop {}".format(s)) except SystemExit as err: failed_execution.append(s) clicommon.run_command("systemctl reset-failed {}".format(s)) tries += 1 if tries >= MAX_TRIES and failed_execution: click.echo("Error encountered while stopping these services {}. Continuing..".format(failed_execution)) elif failed_execution: services_to_stop = failed_execution failed_execution = [] else: break

Let me know if this approach has any setbacks.

This issue was present even before sonic.target was introduced. We used to see this issue with swss.target and its failure was handled. However, after sonic.target we are back to seeing the stop operation being canceled. From my experience with SONiC and 'config reload', there is no explicit requirement to retry stopping failed individual services. One may argue why 3 and why not something else. I would thoroughly test the current approach before making an further assumptions on possibilities. But that is just my personal choice.

If the current solution is robust enough to work around the race conditions, then i guess it's okay to stick with this.

vivekrnv · 2021-06-25T22:56:34Z

config/main.py

@@ -663,6 +663,17 @@ def _get_disabled_services_list(config_db):

    return disabled_services_list

+def _stop_all_services():
+    # stop all other services


And if we were to localize the exception to "Cancellation", we could just try running the "systemctl stop sonic.target" again, since cancellation doesn't mean something is wrong with sonic.target's job.

I noticed during my testing that doing a stop on sonic.target doesn't always succeed. The service which caused the conflict needs to be stopped first and then sonic.target needs to be stopped. _stop_all_services() more or less does the same.

vivekrnv · 2021-06-29T02:10:37Z

config/main.py

+        try:
+            clicommon.run_command("sudo systemctl stop {}".format(s))
+        except SystemExit as err:
+            click.echo("Error encountered while stopping service {}. Continuing..".format(s))


Is it possible to capture the output returned by the subprocess and print that in the case of exception?

Are you suggesting that for logging the error to syslog, we capture the output? Typically the error message goes on to the stderr and click.echo() goes to stdout.

Instead of the current message, I am suggesting to make it more descriptive like "Error encountered while stopping service swss.service, Systemd Message: 'Job for swss.service cancelled' Continuing..",

dgsudharsan · 2021-07-19T16:48:10Z

@rajendra-dendukuri Can you please update the status on this PR?

vivekrnv · 2021-07-19T21:03:46Z

Verified by simulating a Job cancellation scenario on sonic.target

root@sonic:~# config reload -y
Running command: rm -rf /tmp/dropstat-*
Disabling container monitoring ...
Stopping SONiC target ...
Job for sonic.target canceled.
Failed to stop sonic.target. Stopping all services individually.
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment
Restarting SONiC target ...
Enabling container monitoring ...
Reloading Monit configuration ...
Reinitializing monit daemon


In paralell, 

root@r-lionfish-16:/home/admin# systemctl start dhcp_relay
Job for dhcp_relay.service canceled.

liat-grozovik · 2021-07-20T09:01:24Z

@vivekreddynv and @rajendra-dendukuri if this issue relevant also to 20212 and PR can be cleanly cherry picked please mention in the PR so it can be further handled. if cherry pick is not cleanly once this is merged, need to raise PR against 202012.
thanks.

vivekrnv · 2021-07-20T21:39:11Z

Changes look good to me

dgsudharsan · 2021-07-24T17:37:46Z

Request for 202012 Branch

qiluo-msft · 2021-07-29T11:17:30Z

config/main.py

+            click.echo("Ignoring an error encountered while stopping service {}.".format(s))
+            log.log_notice("Ignoring an error encountered while stopping service {}. rv={} out={} err={}.".format(s, rv, out.rstrip('\n'), err.rstrip('\n')))
+
+    (out, err, rv) = clicommon.run_command("sudo systemctl stop swss", return_err=True)


swss

swss is part of _get_sonic_services(), why need stop twice? #Closed

It is to doubly ensure that swss is down.

qiluo-msft · 2021-07-29T11:29:52Z

config/main.py

@@ -673,8 +685,12 @@ def _stop_services():
        pass

    click.echo("Stopping SONiC target ...")
-    clicommon.run_command("sudo systemctl stop sonic.target")


stop

If we tried stopping and failed, just report the failure. Stop all of them twice will not be a general solution since there is possibility of failing again. #WontFix

We can't report and exit out of a 'config reload' command. The system may be down and it is likely that it is not reachable. So we need to take a corrective action which is to perform stop operation on individual targets and ignore any failures. This is a general solution as the original stop operation and the stop_all operation are different in nature.

Adding stop_all method will certainly reduce the possibility of failing again because a "sonic.target" stop job is very sensitive to the conflicting jobs for any of the dependent services.

On the other hand stopping individual services would mean that the likelihood of race condition significantly reduces

"sonic.target" stop job is very sensitive to the conflicting jobs for any of the dependent services.

This seems something we need to fix first.

I'm not sure if this can be fixed. point to note here is that only "stop job" is sensitive. the start job isn't. The systemd documentation mentions about the startup behavior of a wants dependency. Link but not of a stop behavior.

qiluo-msft · 2021-08-02T16:18:37Z

config/main.py

+        click.echo("Failed to stop sonic.target rv={}.".format(rv))
+        log.log_notice("Failed to stop sonic.target rv={} stdout='{}' stderr='{}'.".format(rv, out.rstrip('\n'), err.rstrip('\n')))
+        click.echo("Stopping all services individually.")
+        _stop_all_services()

 def _get_sonic_services():
    out = clicommon.run_command("systemctl list-dependencies --plain sonic.target | sed '1d'", return_cmd=True)


systemctl list-dependencies --plain

Is it a general solution? If in future there is multiple level of dependencies as a tree, will this command still work?

I think that's why sonic.target was introduced, if some docker-startup service was to be added which, it should according to the current design have a WantedBy Dependency on the sonic.target and it'll then be listed by this line

@stepanblyschak, can you provide more info on this?

Is it same as systemctl stop sonic.target twice? If yes, this may be more general solution.

Not Exactly and as i've said earlier, Stopping services individually will reduce the likelihood of race condition around conflicting job installations. "sonic.target" is is very sensitive to the conflicting jobs for any of the dependent services. And a stop job for sonic.target takes around 1-2 sec. This entire period of time is prone to race conditions.

On the other hand stopping individual services would mean that the likelihood of race condition significantly reduces. So, i believe, this is more robust than just stopping sonic.target twice

Sorry for late comment.

Stopping services individually will reduce the likelihood of race condition around conflicting job installations. "sonic.target" is is very sensitive to the conflicting jobs for any of the dependent services

If sonic.target is not robust solution to stop all the services, what is the point to introduce it at first? Should we stop using it in config implementation completely, and use only the method of "Stopping services individually"? and since this method is robust, we only need it once.

Hmm makes sense, we can infact remove the sonic.target stop job

liat-grozovik · 2021-08-10T13:43:53Z

/azp run

azure-pipelines · 2021-08-10T13:44:05Z

Azure Pipelines successfully started running 1 pipeline(s).

dgsudharsan · 2021-08-12T15:39:07Z

@qiluo-msft I believe the outstanding commands are responded. Please let me know if this can be approved and merged

vivekrnv · 2021-08-23T18:17:50Z

@rajendra-dendukuri,

Please add this patch to the mock, without this the load_minigraph tests are failing

diff --git a/tests/config_test.py b/tests/config_test.py
index 21ad18a2..9cd43703 100644
--- a/tests/config_test.py
+++ b/tests/config_test.py
@@ -30,6 +30,9 @@ def mock_run_command_side_effect(*args, **kwargs):
     if kwargs.get('return_cmd'):
         return ''

+    if kwargs.get('return_err'):
+        return "", "", 0
+

vivekrnv · 2021-08-26T16:55:50Z

Hi @rajendra-dendukuri,

There is apparently a much effective & simple solution for this problem. i.e. to just replace sudo systemctl stop sonic.target with sudo systemctl stop sonic.target --job-mode replace-irreversibly Thanks @stepanblyschak for pointing this out.

I've tested this change:

With this change:

root@sonic:/home/admin# config reload -y -f
Running command: rm -rf /tmp/dropstat-*
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment
Restarting SONiC target ...
Enabling container monitoring ...
Reloading Monit configuration ...
Reinitializing monit daemon

In Paralell:
admin@sonic:~$ sudo systemctl start teamd.service
Failed to start teamd.service: Transaction for teamd.service/start is destructive (radv.service has 'stop' job queued, but 'start' is included in transaction).
See system logs and 'systemctl status teamd.service' for details.

Without this Change:

root@r-lionfish-16:/home/admin# config reload -y -f
Running command: rm -rf /tmp/dropstat-*
Disabling container monitoring ...
Stopping SONiC target ...
Job for sonic.target canceled.

In Paralell:
admin@sonic:~$ sudo systemctl start teamd.service.

Can you test it at your end and upstream it if it works?

gechiang · 2021-09-08T18:16:27Z

/azp run

azure-pipelines · 2021-09-08T18:16:38Z

Azure Pipelines successfully started running 1 pipeline(s).

gechiang · 2021-09-08T18:18:32Z

We are seeing similar issue on some buildout operations and need a solution to prevent further encounter of this issue in future.
This is needed in 202012 branch.

gechiang · 2021-09-08T18:48:25Z

@rajendra-dendukuri Can you please address the test run failure for the following testcases:
test_load_minigraph
test_load_minigraph_with_port_config
test_load_minigraph_with_port_config_inconsistent_port
test_load_minigraph_with_port_config_bad_format
These are all failing.

vivekrnv · 2021-09-11T04:10:38Z

The fix discussed is raised as a seperate PR: #1814

rajendra-dendukuri · 2021-09-11T11:50:51Z

Closing this PR as a different solution #1814 has been adopted.

rajendra-dendukuri added 2 commits June 25, 2021 13:12

Fix lgtm alerts

18838f6

Trap SystemExit exception which is raised by the run_command() API

779d79d

vivekrnv reviewed Jun 25, 2021

View reviewed changes

vivekrnv reviewed Jun 29, 2021

View reviewed changes

Log errors seen when service stop command gets canceled

6644316

vivekrnv approved these changes Jul 20, 2021

View reviewed changes

qiluo-msft reviewed Jul 29, 2021

View reviewed changes

qiluo-msft reviewed Aug 2, 2021

View reviewed changes

qiluo-msft requested review from lguohan and yxieca August 13, 2021 07:34

liat-grozovik added Request for 202012 Branch Request for 202106 Branch labels Aug 30, 2021

vivekrnv mentioned this pull request Sep 10, 2021

Mellanox 2700 Unable to configure mgmt ip and hostname info sonic-net/sonic-buildimage#8657

Closed

rajendra-dendukuri closed this Sep 11, 2021

[config] config reload needs to handle exceptions while stopping sonic target #1694

[config] config reload needs to handle exceptions while stopping sonic target #1694

Conversation

rajendra-dendukuri commented Jun 25, 2021

What I did

How I did it

How to verify it

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

lgtm-com bot commented Jun 25, 2021

vivekrnv commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vivekrnv Jun 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajendra-dendukuri Jun 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgsudharsan commented Jul 19, 2021

vivekrnv commented Jul 19, 2021

liat-grozovik commented Jul 20, 2021

vivekrnv commented Jul 20, 2021

dgsudharsan commented Jul 24, 2021

qiluo-msft Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiluo-msft Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vivekrnv Aug 13, 2021 • edited Loading

Choose a reason for hiding this comment

qiluo-msft Aug 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liat-grozovik commented Aug 10, 2021

azure-pipelines bot commented Aug 10, 2021

dgsudharsan commented Aug 12, 2021

vivekrnv commented Aug 23, 2021

vivekrnv commented Aug 26, 2021

gechiang commented Sep 8, 2021

azure-pipelines bot commented Sep 8, 2021

gechiang commented Sep 8, 2021

gechiang commented Sep 8, 2021

vivekrnv commented Sep 11, 2021

rajendra-dendukuri commented Sep 11, 2021

vivekrnv Jun 28, 2021 •

edited

Loading

rajendra-dendukuri Jun 29, 2021 •

edited

Loading

qiluo-msft Jul 29, 2021 •

edited

Loading

qiluo-msft Jul 29, 2021 •

edited

Loading

vivekrnv Aug 13, 2021 •

edited

Loading

qiluo-msft Aug 2, 2021 •

edited

Loading