-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[config] config reload needs to handle exceptions while stopping sonic target #1694
[config] config reload needs to handle exceptions while stopping sonic target #1694
Conversation
…c.target The below error is seen when a 'config reload' operation is performed while the sonic.target is starting during system startup or a prior config reload command. Job for sonic.target canceled. This error can be ignored if we can try to stop all the services associated with sonic.target individually. Any processes which did not stop, will be restarted as part of the "systemctl restart sonic.target" command.
This pull request introduces 3 alerts when merging a8229a5 into 38f8c06 - view on LGTM.com new alerts:
|
Hi, Source of this issue is not just multiple config reload's getting executed at the same time, but it can also happen when there is a conflicting job installed even for one of the dependent stop jobs of sonic.target. I've detailed this in the original issue. |
config/main.py
Outdated
clicommon.run_command("sudo systemctl stop sonic.target") | ||
|
||
try: | ||
clicommon.run_command("sudo systemctl stop sonic.target") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of catching "Job Cancellation", we are also handling "Job failure" here.
Since, we're just dealing with Job Cancellation here, How about localizing the exception catch mechanism to just the "Cancellation".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is indeed to handle all possible failures when we are stopping services. The "config reload" operation is a system affecting event. We need to ensure that this command performs the intended action in the entirety and also brings the system back to a well defined state. By not handling errors, we may be leaving the system in a broken state and in some cases inaccessible.
The proposed changes handle errors in the stop flow. It is to be noted that we are restarting all services at a later stage even if failures were seen during stop.
It is clear that the "config reload" is not robust enough. ZTP relies on "config reload" to load user provided configuration and as well as its own ZTP default configuration. So it is essential we fix "config reload" and not depend on ZTP to perform config reload again. Having said that, ZTP does perform a "config reload " if the "config reload" pushed by the user fails. This is to ensure that ZTP retries a new configuration that user might push after noticing the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we are restarting these services later in the pipeline but from the current design, we are trying to stop all the dependent services before applying the configuration.
I am not entirely sure about the repercussions of keeping the services active when the config changes are applied. If that is deemed safe, we can proceed with the config reload, even if all the services haven't stopped.
In any case, i have a suggestion to reduce the chances of services being in active state
How about having a fixed number of retries on failed services (maybe 3) and if the stop job doesn't success after 3 tries, we can then choose to proceed with whatever approach we choose to adopt (Exit or Move Forward)
MAX_TRIES = 3
services_to_stop = _get_sonic_services()
failed_execution = []
tries = 0
while tries < MAX_TRIES:
for s in services_to_stop:
try:
clicommon.run_command("systemctl stop {}".format(s))
except SystemExit as err:
failed_execution.append(s)
clicommon.run_command("systemctl reset-failed {}".format(s))
tries += 1
if tries >= MAX_TRIES and failed_execution:
click.echo("Error encountered while stopping these services {}. Continuing..".format(failed_execution))
elif failed_execution:
services_to_stop = failed_execution
failed_execution = []
else:
break
Let me know if this approach has any setbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue was present even before sonic.target was introduced. We used to see this issue with swss.target and its failure was handled. However, after sonic.target we are back to seeing the stop operation being canceled. From my experience with SONiC and 'config reload', there is no explicit requirement to retry stopping failed individual services. One may argue why 3 and why not something else. I would thoroughly test the current approach before making an further assumptions on possibilities. But that is just my personal choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the current solution is robust enough to work around the race conditions, then i guess it's okay to stick with this.
@@ -663,6 +663,17 @@ def _get_disabled_services_list(config_db): | |||
|
|||
return disabled_services_list | |||
|
|||
def _stop_all_services(): | |||
# stop all other services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we were to localize the exception to "Cancellation", we could just try running the "systemctl stop sonic.target" again, since cancellation doesn't mean something is wrong with sonic.target's job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed during my testing that doing a stop on sonic.target doesn't always succeed. The service which caused the conflict needs to be stopped first and then sonic.target needs to be stopped. _stop_all_services() more or less does the same.
config/main.py
Outdated
try: | ||
clicommon.run_command("sudo systemctl stop {}".format(s)) | ||
except SystemExit as err: | ||
click.echo("Error encountered while stopping service {}. Continuing..".format(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to capture the output returned by the subprocess and print that in the case of exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that for logging the error to syslog, we capture the output? Typically the error message goes on to the stderr and click.echo() goes to stdout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the current message, I am suggesting to make it more descriptive like "Error encountered while stopping service swss.service, Systemd Message: 'Job for swss.service cancelled' Continuing..",
@rajendra-dendukuri Can you please update the status on this PR? |
Verified by simulating a Job cancellation scenario on sonic.target
|
@vivekreddynv and @rajendra-dendukuri if this issue relevant also to 20212 and PR can be cleanly cherry picked please mention in the PR so it can be further handled. if cherry pick is not cleanly once this is merged, need to raise PR against 202012. |
Changes look good to me |
Request for 202012 Branch |
click.echo("Ignoring an error encountered while stopping service {}.".format(s)) | ||
log.log_notice("Ignoring an error encountered while stopping service {}. rv={} out={} err={}.".format(s, rv, out.rstrip('\n'), err.rstrip('\n'))) | ||
|
||
(out, err, rv) = clicommon.run_command("sudo systemctl stop swss", return_err=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is to doubly ensure that swss is down.
@@ -673,8 +685,12 @@ def _stop_services(): | |||
pass | |||
|
|||
click.echo("Stopping SONiC target ...") | |||
clicommon.run_command("sudo systemctl stop sonic.target") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't report and exit out of a 'config reload' command. The system may be down and it is likely that it is not reachable. So we need to take a corrective action which is to perform stop operation on individual targets and ignore any failures. This is a general solution as the original stop operation and the stop_all operation are different in nature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding stop_all method will certainly reduce the possibility of failing again because a "sonic.target" stop job is very sensitive to the conflicting jobs for any of the dependent services.
On the other hand stopping individual services would mean that the likelihood of race condition significantly reduces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"sonic.target" stop job is very sensitive to the conflicting jobs for any of the dependent services.
This seems something we need to fix first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this can be fixed. point to note here is that only "stop job" is sensitive. the start job isn't. The systemd documentation mentions about the startup behavior of a wants dependency. Link but not of a stop behavior.
click.echo("Failed to stop sonic.target rv={}.".format(rv)) | ||
log.log_notice("Failed to stop sonic.target rv={} stdout='{}' stderr='{}'.".format(rv, out.rstrip('\n'), err.rstrip('\n'))) | ||
click.echo("Stopping all services individually.") | ||
_stop_all_services() | ||
|
||
def _get_sonic_services(): | ||
out = clicommon.run_command("systemctl list-dependencies --plain sonic.target | sed '1d'", return_cmd=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's why sonic.target was introduced, if some docker-startup service was to be added which, it should according to the current design have a WantedBy Dependency on the sonic.target and it'll then be listed by this line
@stepanblyschak, can you provide more info on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it same as systemctl stop sonic.target
twice? If yes, this may be more general solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not Exactly and as i've said earlier, Stopping services individually will reduce the likelihood of race condition around conflicting job installations. "sonic.target" is is very sensitive to the conflicting jobs for any of the dependent services. And a stop job for sonic.target takes around 1-2 sec. This entire period of time is prone to race conditions.
On the other hand stopping individual services would mean that the likelihood of race condition significantly reduces. So, i believe, this is more robust than just stopping sonic.target twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late comment.
Stopping services individually will reduce the likelihood of race condition around conflicting job installations. "sonic.target" is is very sensitive to the conflicting jobs for any of the dependent services
If sonic.target
is not robust solution to stop all the services, what is the point to introduce it at first? Should we stop using it in config
implementation completely, and use only the method of "Stopping services individually"? and since this method is robust, we only need it once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm makes sense, we can infact remove the sonic.target stop job
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@qiluo-msft I believe the outstanding commands are responded. Please let me know if this can be approved and merged |
Please add this patch to the mock, without this the load_minigraph tests are failing
|
There is apparently a much effective & simple solution for this problem. i.e. to just replace I've tested this change: With this change:
Without this Change:
Can you test it at your end and upstream it if it works? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
We are seeing similar issue on some buildout operations and need a solution to prevent further encounter of this issue in future. |
@rajendra-dendukuri Can you please address the test run failure for the following testcases: |
The fix discussed is raised as a seperate PR: #1814 |
Closing this PR as a different solution #1814 has been adopted. |
The below error is seen when a 'config reload' operation is performed
while the sonic.target is starting during system startup or a prior config
reload command.
Job for sonic.target canceled.
Fixes# sonic-net/sonic-buildimage#7508
fixes #7508
What I did
This error can be ignored if we can try to stop all the services associated
with sonic.target individually. Any processes which did not stop, will be
restarted as part of the "systemctl restart sonic.target" command.
How I did it
If sonic.target fails to stop, find all individual services which are part of sonic.target and stop them. All exceptions
are handled, logged and passed. The restart operation on sonic.target will ensure that the services are stopped and
started if they were not stopped earlier.
How to verify it
Use ZTP to push a configuration file. Also verified using ZTP unit tests.
root@sonic:/usr/lib/ztp/tests# pytest-3 -v -x -k test_configdb-json.py
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-3.10.1, py-1.7.0, pluggy-0.8.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /usr/lib/ztp/tests, inifile:
collected 199 items / 192 deselected
test_configdb-json.py::TestClass::test_constructor_wrong_type PASSED [ 14%]
test_configdb-json.py::TestClass::test_input_nfound PASSED [ 28%]
test_configdb-json.py::TestClass::test_invalid_input PASSED [ 42%]
test_configdb-json.py::TestClass::test_input_no_section PASSED [ 57%]
test_configdb-json.py::TestClass::test_json_config_nvalid PASSED [ 71%]
test_configdb-json.py::TestClass::test_json_config_valid_load PASSED [ 85%]
test_configdb-json.py::TestClass::test_json_config_valid_reload PASSED [100%]
================== 7 passed, 192 deselected in 40.48 seconds ===================
root@sonic:/usr/lib/ztp/tests#
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)