Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: eos config_session not cleared if discard failed #1921

Conversation

kpetremann
Copy link
Contributor

@kpetremann kpetremann commented May 15, 2023

If the abort command is refused, for instance because of a TACACS issue, the config_session is never cleared because an exception is raised by self._run_commands.

This is an issue in Saltstack case:

  • a pending session could contain incomplete changes
  • this pending session is resumed on the next load_config call causing messed up configuration

This change ensure the config_session is always cleared.

If the abort command is refused, for instance because of a TACACS issue
the config_session is never cleared because an exception is raised by
self._run_commands.

This is an issue in Saltstack case:
- a pending session could contain incomplete changes
- this pending session is resumed on the next load_config call causing
  messed up configuration

This change ensure the config_session is always cleared.
@bewing
Copy link
Member

bewing commented May 15, 2023

Wouldn't the session still be in use on the switch, causing future issues due to the pending session preventing a config lock?

napalm/napalm/eos/eos.py

Lines 307 to 314 in cb4845c

def _lock(self):
sess = self._run_commands(["show configuration sessions"])[0]["sessions"]
if [
k
for k, v in sess.items()
if v["state"] == "pending" and k != self.config_session
]:
raise SessionLockedException("Session is already in use")

@kpetremann
Copy link
Contributor Author

hello @bewing,
Thanks for the quick response.

Yes totally, but "only" for 24 hours as EOS will cancel any pending changes older than that.
It would make it closer to the lock implementation for JunOS:

def _lock(self):
"""Lock the config DB."""
if not self.locked:
try:
self.device.cu.lock()
self.locked = True
except JnprLockError as jle:
raise LockError(str(jle))

It seems better to lock for 24 hours rather than having unexpected pending changes which can be applied by any future Salt state in uncontrolled manner. Knowing that such fails should not occur regularly and should be considered as incident. In our case, it was a TACACS issue which occurred during the changes in the config session (EOS checks permissions progressively for each line typed in the config session). We had multiple occurrences of this problem causing a mess in the configuration.

We could make this behavior optional adding a parameter at the init of the EOSDriver.

The other alternatives I see:

  • calling discard_commit each time we are about to do a change => not scalable as one could easily forget to do that in a Salt state and ending up with the same issue
  • add an optional alternative of the lock itself, depending on a new EOSDriver attribute, to have a kind of "private configuration" mode.

WDYT?

@ktbyers
Copy link
Contributor

ktbyers commented May 16, 2023

@kpetremann Why doesn't the exception get detected by Salt?

Also from you stated above, this implies Salt is reusing a NAPALM connection over very long periods of time--is that correct?

@kpetremann
Copy link
Contributor Author

kpetremann commented May 16, 2023

Hello @ktbyers,

Why doesn't the exception get detected by Salt?

When Salt proxy-minion are setup with Napalm as a driver, it uses the Napalm library to be vendor agnostic.
Even if Salt was catching the exception when it tried to discard the pending config, the only thing that Salt could do would be very specific to EOS, for instance by clearing manually the device.config_session.

The only generic alternative I see would be to do a change in Salt to make it completely drop the connection when there is an error, which seems overkill.

Also from you stated above, this implies Salt is reusing a NAPALM connection over very long periods of time--is that correct?

Yes, proxy-minions are running as a "daemon" and keep the connection active (by default, but there is an option reconnect at each request).

The proxy-minion keeps the Napalm Device object to reuse connection. It also keeps the config_session until the change is actually committed. This is to be able to load configuration without committing, then complete the config and then commit.
At least from what we can deduce from their documentation:

Note: we are not using commit=False, this is only to explain our understanding of Saltstack implementation of Napalm based proxy-minions.

@ktbyers
Copy link
Contributor

ktbyers commented May 16, 2023

I don't like the solution proposed as it is an unexpected thing to do: exception happens on discard() and then hide the configuration session (even though the config session still exists and is active).

Given Salt's very long running sessions (which are atypical) does it make more sense for Salt's logic to bake more checks in here (for example, call discard() and then ensure there are no config sessions at the end of that call, or call discard() and then do a compare_config() after that to ensure there are no pending changes)?

@kpetremann
Copy link
Contributor Author

From user perspective, Napalm have inconsistent behavior at least when comparing EOS and JunOS.
In JunOS, if the discard does not work, the lock is not unlocked. This is the opposite with EOS implementation.

exception happens on discard() and then hide the configuration session (even though the config session still exists and is active).

It is not really hidden as the lock would block this. Also, even if the session still exists, it is more pending than active. I mean, if you create another session, the pending changes are not resumed. It is like private config session on JunOS.

I don't like the solution proposed

What about the other propositions I made:

  • make this change an option => something like always_clear_session_on_discard (needs a better name)
  • or, add an optional alternative of the lock itself, depending on a new EOSDriver attribute, to have a kind of "private configuration" mode. Meaning having the exact same lock mechanism than JunOS: if there is a pending session => lock, even if the config_session is the one from self.config_session.

@kpetremann
Copy link
Contributor Author

kpetremann commented May 16, 2023

does it make more sense for Salt's logic to bake more checks in here (for example, call discard() and then ensure there are no config sessions at the end of that call, or call discard() and then do a compare_config() after that to ensure there are no pending changes)?

In the TACACS issue we had, the has_pending_commit/compare_config would trigger an exception as it requires permissions as well so it cannot be done at the end.

The only solution would be to add an option in Salt to check if there a pending session before calling load_config. If it fails, it would prevent Salt to call load_config. But again, this would add more calls to the device each time we update the config, only beause of EOS implementation...

@ktbyers
Copy link
Contributor

ktbyers commented May 16, 2023

@kpetremann So on Juniper we have two locking modes: a device lock (which locks/unlocks at the commit/discard level) and a session lock (session_config_lock).

On Arista we really only have a session_config_lock (i.e. the config_session is the lock). If you have this session id, then you possess the lock.

So your issue is, you are constantly reusing the config_session so you constantly have the lock. Consequently, when something fails like discard_config() you can get into potentially a bad state (i.e. there are still stale configuration changes hanging around in the candidate_config).

Is this an accurate description of the issue/problem? I am just trying to make sure I am properly understanding this.

@kpetremann
Copy link
Contributor Author

yes, exactly.

@ktbyers
Copy link
Contributor

ktbyers commented May 16, 2023

Yeah, maybe this is the best middle ground:

make this change an option => something like always_clear_session_on_discard (needs a better name)

Something like cfg_session_invalidate or force_cfg_session_invalid?

We probably should not tie it to discard (in case there are other scenarios where we need to do this invalidation)?

@kpetremann
Copy link
Contributor Author

sounds great :) I'll update the PR
thanks

@kpetremann kpetremann force-pushed the eos_always_clear_session_discard branch from 6f88980 to 08d6e51 Compare May 17, 2023 10:00
@kpetremann kpetremann force-pushed the eos_always_clear_session_discard branch from 08d6e51 to 6a4f4b7 Compare May 17, 2023 10:09
@kpetremann
Copy link
Contributor Author

@ktbyers, PR updated, please let me know if it needs further changes

@ktbyers
Copy link
Contributor

ktbyers commented May 19, 2023

Some slight changes here, but should have the same net-effect:

#1927

@ktbyers ktbyers closed this May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants