Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[config] config reload needs to handle exceptions while stopping sonic target #1694

Conversation

rajendra-dendukuri
Copy link
Contributor

The below error is seen when a 'config reload' operation is performed
while the sonic.target is starting during system startup or a prior config
reload command.

Job for sonic.target canceled.

Fixes# sonic-net/sonic-buildimage#7508
fixes #7508

What I did

This error can be ignored if we can try to stop all the services associated
with sonic.target individually. Any processes which did not stop, will be
restarted as part of the "systemctl restart sonic.target" command.

How I did it

If sonic.target fails to stop, find all individual services which are part of sonic.target and stop them. All exceptions
are handled, logged and passed. The restart operation on sonic.target will ensure that the services are stopped and
started if they were not stopped earlier.

How to verify it

Use ZTP to push a configuration file. Also verified using ZTP unit tests.

root@sonic:/usr/lib/ztp/tests# pytest-3 -v -x -k test_configdb-json.py
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-3.10.1, py-1.7.0, pluggy-0.8.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /usr/lib/ztp/tests, inifile:
collected 199 items / 192 deselected

test_configdb-json.py::TestClass::test_constructor_wrong_type PASSED [ 14%]
test_configdb-json.py::TestClass::test_input_nfound PASSED [ 28%]
test_configdb-json.py::TestClass::test_invalid_input PASSED [ 42%]
test_configdb-json.py::TestClass::test_input_no_section PASSED [ 57%]
test_configdb-json.py::TestClass::test_json_config_nvalid PASSED [ 71%]
test_configdb-json.py::TestClass::test_json_config_valid_load PASSED [ 85%]
test_configdb-json.py::TestClass::test_json_config_valid_reload PASSED [100%]

================== 7 passed, 192 deselected in 40.48 seconds ===================
root@sonic:/usr/lib/ztp/tests#

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

…c.target

The below error is seen when a 'config reload' operation is performed
while the sonic.target is starting during system startup or a prior config
reload command.

Job for sonic.target canceled.

This error can be ignored if we can try to stop all the services associated
with sonic.target individually. Any processes which did not stop, will be
restarted as part of the "systemctl restart sonic.target" command.
@lgtm-com
Copy link

lgtm-com bot commented Jun 25, 2021

This pull request introduces 3 alerts when merging a8229a5 into 38f8c06 - view on LGTM.com

new alerts:

  • 3 for Except block handles 'BaseException'

@vivekrnv
Copy link
Contributor

Hi,

Source of this issue is not just multiple config reload's getting executed at the same time, but it can also happen when there is a conflicting job installed even for one of the dependent stop jobs of sonic.target. I've detailed this in the original issue.

config/main.py Outdated
clicommon.run_command("sudo systemctl stop sonic.target")

try:
clicommon.run_command("sudo systemctl stop sonic.target")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of catching "Job Cancellation", we are also handling "Job failure" here.

Since, we're just dealing with Job Cancellation here, How about localizing the exception catch mechanism to just the "Cancellation".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is indeed to handle all possible failures when we are stopping services. The "config reload" operation is a system affecting event. We need to ensure that this command performs the intended action in the entirety and also brings the system back to a well defined state. By not handling errors, we may be leaving the system in a broken state and in some cases inaccessible.

The proposed changes handle errors in the stop flow. It is to be noted that we are restarting all services at a later stage even if failures were seen during stop.

It is clear that the "config reload" is not robust enough. ZTP relies on "config reload" to load user provided configuration and as well as its own ZTP default configuration. So it is essential we fix "config reload" and not depend on ZTP to perform config reload again. Having said that, ZTP does perform a "config reload " if the "config reload" pushed by the user fails. This is to ensure that ZTP retries a new configuration that user might push after noticing the errors.

Copy link
Contributor

@vivekrnv vivekrnv Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we are restarting these services later in the pipeline but from the current design, we are trying to stop all the dependent services before applying the configuration.

I am not entirely sure about the repercussions of keeping the services active when the config changes are applied. If that is deemed safe, we can proceed with the config reload, even if all the services haven't stopped.

In any case, i have a suggestion to reduce the chances of services being in active state

How about having a fixed number of retries on failed services (maybe 3) and if the stop job doesn't success after 3 tries, we can then choose to proceed with whatever approach we choose to adopt (Exit or Move Forward)

MAX_TRIES = 3
services_to_stop = _get_sonic_services()
failed_execution  = []
tries  = 0
while tries < MAX_TRIES:
        for s in services_to_stop:
                try:
                    clicommon.run_command("systemctl stop {}".format(s))
                except SystemExit as err:
                    failed_execution.append(s)
                    clicommon.run_command("systemctl reset-failed {}".format(s))
      tries  += 1   
      if tries >= MAX_TRIES and failed_execution:
         click.echo("Error encountered while stopping these services {}. Continuing..".format(failed_execution))
      elif failed_execution:
         services_to_stop = failed_execution
         failed_execution = []
     else:
         break

Let me know if this approach has any setbacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue was present even before sonic.target was introduced. We used to see this issue with swss.target and its failure was handled. However, after sonic.target we are back to seeing the stop operation being canceled. From my experience with SONiC and 'config reload', there is no explicit requirement to retry stopping failed individual services. One may argue why 3 and why not something else. I would thoroughly test the current approach before making an further assumptions on possibilities. But that is just my personal choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the current solution is robust enough to work around the race conditions, then i guess it's okay to stick with this.

@@ -663,6 +663,17 @@ def _get_disabled_services_list(config_db):

return disabled_services_list

def _stop_all_services():
# stop all other services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we were to localize the exception to "Cancellation", we could just try running the "systemctl stop sonic.target" again, since cancellation doesn't mean something is wrong with sonic.target's job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed during my testing that doing a stop on sonic.target doesn't always succeed. The service which caused the conflict needs to be stopped first and then sonic.target needs to be stopped. _stop_all_services() more or less does the same.

config/main.py Outdated
try:
clicommon.run_command("sudo systemctl stop {}".format(s))
except SystemExit as err:
click.echo("Error encountered while stopping service {}. Continuing..".format(s))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to capture the output returned by the subprocess and print that in the case of exception?

Copy link
Contributor Author

@rajendra-dendukuri rajendra-dendukuri Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that for logging the error to syslog, we capture the output? Typically the error message goes on to the stderr and click.echo() goes to stdout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the current message, I am suggesting to make it more descriptive like "Error encountered while stopping service swss.service, Systemd Message: 'Job for swss.service cancelled' Continuing..",

@dgsudharsan
Copy link
Collaborator

@rajendra-dendukuri Can you please update the status on this PR?

@vivekrnv
Copy link
Contributor

Verified by simulating a Job cancellation scenario on sonic.target

root@sonic:~# config reload -y
Running command: rm -rf /tmp/dropstat-*
Disabling container monitoring ...
Stopping SONiC target ...
Job for sonic.target canceled.
Failed to stop sonic.target. Stopping all services individually.
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment
Restarting SONiC target ...
Enabling container monitoring ...
Reloading Monit configuration ...
Reinitializing monit daemon


In paralell, 

root@r-lionfish-16:/home/admin# systemctl start dhcp_relay
Job for dhcp_relay.service canceled.

@liat-grozovik
Copy link
Collaborator

@vivekreddynv and @rajendra-dendukuri if this issue relevant also to 20212 and PR can be cleanly cherry picked please mention in the PR so it can be further handled. if cherry pick is not cleanly once this is merged, need to raise PR against 202012.
thanks.

@vivekrnv
Copy link
Contributor

Changes look good to me

@dgsudharsan
Copy link
Collaborator

Request for 202012 Branch

click.echo("Ignoring an error encountered while stopping service {}.".format(s))
log.log_notice("Ignoring an error encountered while stopping service {}. rv={} out={} err={}.".format(s, rv, out.rstrip('\n'), err.rstrip('\n')))

(out, err, rv) = clicommon.run_command("sudo systemctl stop swss", return_err=True)
Copy link
Contributor

@qiluo-msft qiluo-msft Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swss

swss is part of _get_sonic_services(), why need stop twice? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to doubly ensure that swss is down.

@@ -673,8 +685,12 @@ def _stop_services():
pass

click.echo("Stopping SONiC target ...")
clicommon.run_command("sudo systemctl stop sonic.target")
Copy link
Contributor

@qiluo-msft qiluo-msft Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop

If we tried stopping and failed, just report the failure. Stop all of them twice will not be a general solution since there is possibility of failing again. #WontFix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't report and exit out of a 'config reload' command. The system may be down and it is likely that it is not reachable. So we need to take a corrective action which is to perform stop operation on individual targets and ignore any failures. This is a general solution as the original stop operation and the stop_all operation are different in nature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding stop_all method will certainly reduce the possibility of failing again because a "sonic.target" stop job is very sensitive to the conflicting jobs for any of the dependent services.

On the other hand stopping individual services would mean that the likelihood of race condition significantly reduces

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sonic.target" stop job is very sensitive to the conflicting jobs for any of the dependent services.

This seems something we need to fix first.

Copy link
Contributor

@vivekrnv vivekrnv Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this can be fixed. point to note here is that only "stop job" is sensitive. the start job isn't. The systemd documentation mentions about the startup behavior of a wants dependency. Link but not of a stop behavior.

click.echo("Failed to stop sonic.target rv={}.".format(rv))
log.log_notice("Failed to stop sonic.target rv={} stdout='{}' stderr='{}'.".format(rv, out.rstrip('\n'), err.rstrip('\n')))
click.echo("Stopping all services individually.")
_stop_all_services()

def _get_sonic_services():
out = clicommon.run_command("systemctl list-dependencies --plain sonic.target | sed '1d'", return_cmd=True)
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemctl list-dependencies --plain

Is it a general solution? If in future there is multiple level of dependencies as a tree, will this command still work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's why sonic.target was introduced, if some docker-startup service was to be added which, it should according to the current design have a WantedBy Dependency on the sonic.target and it'll then be listed by this line

@stepanblyschak, can you provide more info on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it same as systemctl stop sonic.target twice? If yes, this may be more general solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not Exactly and as i've said earlier, Stopping services individually will reduce the likelihood of race condition around conflicting job installations. "sonic.target" is is very sensitive to the conflicting jobs for any of the dependent services. And a stop job for sonic.target takes around 1-2 sec. This entire period of time is prone to race conditions.

On the other hand stopping individual services would mean that the likelihood of race condition significantly reduces. So, i believe, this is more robust than just stopping sonic.target twice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late comment.

Stopping services individually will reduce the likelihood of race condition around conflicting job installations. "sonic.target" is is very sensitive to the conflicting jobs for any of the dependent services

If sonic.target is not robust solution to stop all the services, what is the point to introduce it at first? Should we stop using it in config implementation completely, and use only the method of "Stopping services individually"? and since this method is robust, we only need it once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm makes sense, we can infact remove the sonic.target stop job

@liat-grozovik
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dgsudharsan
Copy link
Collaborator

@qiluo-msft I believe the outstanding commands are responded. Please let me know if this can be approved and merged

@qiluo-msft qiluo-msft requested review from lguohan and yxieca August 13, 2021 07:34
@vivekrnv
Copy link
Contributor

@rajendra-dendukuri,

Please add this patch to the mock, without this the load_minigraph tests are failing

diff --git a/tests/config_test.py b/tests/config_test.py
index 21ad18a2..9cd43703 100644
--- a/tests/config_test.py
+++ b/tests/config_test.py
@@ -30,6 +30,9 @@ def mock_run_command_side_effect(*args, **kwargs):
     if kwargs.get('return_cmd'):
         return ''

+    if kwargs.get('return_err'):
+        return "", "", 0
+

@vivekrnv
Copy link
Contributor

Hi @rajendra-dendukuri,

There is apparently a much effective & simple solution for this problem. i.e. to just replace sudo systemctl stop sonic.target with sudo systemctl stop sonic.target --job-mode replace-irreversibly Thanks @stepanblyschak for pointing this out.

I've tested this change:

With this change:

root@sonic:/home/admin# config reload -y -f
Running command: rm -rf /tmp/dropstat-*
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment
Restarting SONiC target ...
Enabling container monitoring ...
Reloading Monit configuration ...
Reinitializing monit daemon

In Paralell:
admin@sonic:~$ sudo systemctl start teamd.service
Failed to start teamd.service: Transaction for teamd.service/start is destructive (radv.service has 'stop' job queued, but 'start' is included in transaction).
See system logs and 'systemctl status teamd.service' for details.

Without this Change:

root@r-lionfish-16:/home/admin# config reload -y -f
Running command: rm -rf /tmp/dropstat-*
Disabling container monitoring ...
Stopping SONiC target ...
Job for sonic.target canceled.

In Paralell:
admin@sonic:~$ sudo systemctl start teamd.service.

Can you test it at your end and upstream it if it works?

@gechiang
Copy link
Contributor

gechiang commented Sep 8, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@gechiang
Copy link
Contributor

gechiang commented Sep 8, 2021

We are seeing similar issue on some buildout operations and need a solution to prevent further encounter of this issue in future.
This is needed in 202012 branch.

@gechiang
Copy link
Contributor

gechiang commented Sep 8, 2021

@rajendra-dendukuri Can you please address the test run failure for the following testcases:
test_load_minigraph
test_load_minigraph_with_port_config
test_load_minigraph_with_port_config_inconsistent_port
test_load_minigraph_with_port_config_bad_format
These are all failing.

@vivekrnv
Copy link
Contributor

The fix discussed is raised as a seperate PR: #1814

@rajendra-dendukuri
Copy link
Contributor Author

Closing this PR as a different solution #1814 has been adopted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants