Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Nornir + NAPALM threading issue #1081

Merged

Conversation

ktbyers
Copy link
Contributor

@ktbyers ktbyers commented Oct 30, 2019

More details to follow...

@coveralls
Copy link

coveralls commented Oct 30, 2019

Coverage Status

Coverage increased (+0.02%) to 81.347% when pulling 6de6dd1 on ktbyers:fix_threading_issue into f3a1edc on napalm-automation:develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 81.348% when pulling 6de6dd1 on ktbyers:fix_threading_issue into f3a1edc on napalm-automation:develop.

@ktbyers
Copy link
Contributor Author

ktbyers commented Oct 30, 2019

We ran into a threading deadlock issue on Nornir when running NAPALM tasks. Our reference code was as follows:

from nornir import InitNornir
from nornir.core.filter import F
from nornir.plugins.functions.text import print_result
from nornir.plugins.tasks import networking


def configure_vlan(task, vlan_id, vlan_name):
    print(f"Starting task for {task.host}")
    config_string = f"""vlan {vlan_id}
  name {vlan_name}"""
    task.run(task=networking.napalm_configure, configuration=config_string)
    print(f"Task complete for {task.host}")

def main():
    VLAN_ID = "123"
    VLAN_NAME = "ntp_vlan"
    nr = InitNornir(config_file="config.yaml")
    nr = nr.filter(F(groups__contains="eos") | F(groups__contains="nxos"))
    result = nr.run(
        task=configure_vlan, vlan_id=VLAN_ID, vlan_name=VLAN_NAME, num_workers=20
    )
    print_result(result)

if __name__ == "__main__":
    main()

Basically this code would relatively infrequently deadlock (one out of four times or so). We observed this with napalm_get_facts in a nornir context as well.

Code had no issues when num_workers=1 (i.e. single threaded).

Cntl-z on the hung process would show a threading stack trace. The python program would never complete on its own--you could wait there 5+ minutes.

We tested/debugged this for a long period of time and we believe the issue is centered are the get_network_driver() importing.

Code would hang here:

https://github.com/napalm-automation/napalm/blob/master/napalm/base/__init__.py#L92

When importing the actual relevant module (either napalm.eos or napalm.nxos).

I initially tried to implement a fix to inspect whether sys.modules already had the module loaded, but I ran into threading problems with this fix as well.

As far as I can tell, I think the root cause of the issue was some sort of a timing problem with respect to importing parts of the NAPALM library. For example, we would see error messages such as the following (though this was when I was trying to implement the sys.modules fix):

Exception.... No class inheriting "napalm.base.base.NetworkDriver" found in "napalm.eos".
                Exception....? local variable 'network_driver' referenced before assignment

The easiest fix for this was to import the constituent NAPALM packages in the main napalm __init__.py. Consequently, when you import get_network_driver you also import napalm.base, napalm.eos, napalm.ios et cetera.

As a possibly related issue I saw this warning in the napalm unit tests related to pyeapi

  /home/kbyers/VENV/incendio_test/lib/python3.6/site-packages/pyeapi/utils.py:34: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp

I looked at that utils.py code and noticed they had no locking around their importing which looks like a bug/issue (really two issues: 1)they should be using importlib, and 2)if they use imp library, then they need locking). I did a test where I implemented locking and it did not fix this Nornir+NAPALM deadlock.

@ktbyers
Copy link
Contributor Author

ktbyers commented Oct 30, 2019

Adding @dbarrosop, @mirceaulinic, and @carlmontanari

@ktbyers
Copy link
Contributor Author

ktbyers commented Oct 31, 2019

Related issue:

arista-eosplus/pyeapi#177

@dbarrosop
Copy link
Member

Should we propose a fix to pyeapi instead and update requirements to make sure we have a version with the fix?

@ktbyers
Copy link
Contributor Author

ktbyers commented Oct 31, 2019

Okay, I dug into this a bit more and it is not due to that pyeapi code--that code-path is never used by NAPALM.

So this is really the only fix I have found that works and is handled in NAPALM. An end-user can fix it by doing the same thing (i.e. by pre-loading the napalm constituent libraries in their code).

We're probably at like at more than a dozen hours working on this problem between @carlmontanari and I...as we ran into it earlier and never were able to isolate it until Carl was able to isolate it to the get_network_driver importing yesterday.

@mirceaulinic
Copy link
Member

These changes look good to be. Is this good to go, @ktbyers?

@ktbyers
Copy link
Contributor Author

ktbyers commented Nov 1, 2019

Let's give @dbarrosop until Monday/Tuesday to see if he has any additional thoughts or comments on it.

@mirceaulinic mirceaulinic added this to the 3.0.0 milestone Nov 1, 2019
@dbarrosop
Copy link
Member

I personally think it’s a bit weird that a deadlock is solved by adding some imports. To me it feels like we are masking some issue that may manifest itself later on via other codepaths. In any case, I don’t want to block this so feel free to proceed.

Good job in identifying the issue and a potential workaround, this sounds like a nightmarish problem.

@ogenstad
Copy link
Contributor

ogenstad commented Nov 2, 2019

Haven't looked that close, but might be related to this https://bugs.python.org/issue38095.

@ktbyers
Copy link
Contributor Author

ktbyers commented Nov 2, 2019

Cool...from @ogenstad.

Definitely possible that is the issue...we have seen the failure on PY3.6, 3.7, and 3.8.

I did see some earlier Python importlib threading issues (but they were from around ~Py3.4/3.5 so they didn't seem like a good match).

@ktbyers
Copy link
Contributor Author

ktbyers commented Nov 2, 2019

Given what I have observed I think the underlying issue is that in certain situations we import only napalm.base and in other contexts we import napalm.nxos or napalm.eos, et cetera (which then imports/uses napalm.base) and there is some underlying race condition where the napalm.nxos driver (in one thread) is trying to use the napalm.base thread imported in a different thread (which has not been fully loaded).

The import fix then...is that all of napalm gets essentially anytime get_network_driver is loaded which likely reduces the race condition possibilities (though you could still see how there might be race conditions though just rarer).

I think we should go ahead and merge this.

We can leave an issue open and check that Python issue 6 months from now and see if they made any updates/changes and possibly regression test undoing this change.

@mirceaulinic mirceaulinic merged commit 3e3dd91 into napalm-automation:develop Nov 6, 2019
ExaneServerTeam pushed a commit to ExaneServerTeam/napalm that referenced this pull request Mar 6, 2020
neelimapp pushed a commit to neelimapp/napalm that referenced this pull request Mar 20, 2020
bharath-ravindranath pushed a commit to bharath-ravindranath/napalm that referenced this pull request Apr 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants