-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Nornir + NAPALM threading issue #1081
Fix Nornir + NAPALM threading issue #1081
Conversation
We ran into a threading deadlock issue on Nornir when running NAPALM tasks. Our reference code was as follows: from nornir import InitNornir
from nornir.core.filter import F
from nornir.plugins.functions.text import print_result
from nornir.plugins.tasks import networking
def configure_vlan(task, vlan_id, vlan_name):
print(f"Starting task for {task.host}")
config_string = f"""vlan {vlan_id}
name {vlan_name}"""
task.run(task=networking.napalm_configure, configuration=config_string)
print(f"Task complete for {task.host}")
def main():
VLAN_ID = "123"
VLAN_NAME = "ntp_vlan"
nr = InitNornir(config_file="config.yaml")
nr = nr.filter(F(groups__contains="eos") | F(groups__contains="nxos"))
result = nr.run(
task=configure_vlan, vlan_id=VLAN_ID, vlan_name=VLAN_NAME, num_workers=20
)
print_result(result)
if __name__ == "__main__":
main() Basically this code would relatively infrequently deadlock (one out of four times or so). We observed this with napalm_get_facts in a nornir context as well. Code had no issues when num_workers=1 (i.e. single threaded). Cntl-z on the hung process would show a threading stack trace. The python program would never complete on its own--you could wait there 5+ minutes. We tested/debugged this for a long period of time and we believe the issue is centered are the get_network_driver() importing. Code would hang here: https://github.com/napalm-automation/napalm/blob/master/napalm/base/__init__.py#L92 When importing the actual relevant module (either napalm.eos or napalm.nxos). I initially tried to implement a fix to inspect whether sys.modules already had the module loaded, but I ran into threading problems with this fix as well. As far as I can tell, I think the root cause of the issue was some sort of a timing problem with respect to importing parts of the NAPALM library. For example, we would see error messages such as the following (though this was when I was trying to implement the sys.modules fix):
The easiest fix for this was to import the constituent NAPALM packages in the main napalm As a possibly related issue I saw this warning in the napalm unit tests related to pyeapi
I looked at that |
Adding @dbarrosop, @mirceaulinic, and @carlmontanari |
Related issue: |
Should we propose a fix to pyeapi instead and update requirements to make sure we have a version with the fix? |
Okay, I dug into this a bit more and it is not due to that pyeapi code--that code-path is never used by NAPALM. So this is really the only fix I have found that works and is handled in NAPALM. An end-user can fix it by doing the same thing (i.e. by pre-loading the napalm constituent libraries in their code). We're probably at like at more than a dozen hours working on this problem between @carlmontanari and I...as we ran into it earlier and never were able to isolate it until Carl was able to isolate it to the |
These changes look good to be. Is this good to go, @ktbyers? |
Let's give @dbarrosop until Monday/Tuesday to see if he has any additional thoughts or comments on it. |
I personally think it’s a bit weird that a deadlock is solved by adding some imports. To me it feels like we are masking some issue that may manifest itself later on via other codepaths. In any case, I don’t want to block this so feel free to proceed. Good job in identifying the issue and a potential workaround, this sounds like a nightmarish problem. |
Haven't looked that close, but might be related to this https://bugs.python.org/issue38095. |
Cool...from @ogenstad. Definitely possible that is the issue...we have seen the failure on PY3.6, 3.7, and 3.8. I did see some earlier Python importlib threading issues (but they were from around ~Py3.4/3.5 so they didn't seem like a good match). |
Given what I have observed I think the underlying issue is that in certain situations we import only napalm.base and in other contexts we import napalm.nxos or napalm.eos, et cetera (which then imports/uses napalm.base) and there is some underlying race condition where the napalm.nxos driver (in one thread) is trying to use the napalm.base thread imported in a different thread (which has not been fully loaded). The import fix then...is that all of napalm gets essentially anytime get_network_driver is loaded which likely reduces the race condition possibilities (though you could still see how there might be race conditions though just rarer). I think we should go ahead and merge this. We can leave an issue open and check that Python issue 6 months from now and see if they made any updates/changes and possibly regression test undoing this change. |
More details to follow...