fix: eos config_session not cleared if discard failed #1921

kpetremann · 2023-05-15T16:55:22Z

If the abort command is refused, for instance because of a TACACS issue, the config_session is never cleared because an exception is raised by self._run_commands.

This is an issue in Saltstack case:

a pending session could contain incomplete changes
this pending session is resumed on the next load_config call causing messed up configuration

This change ensure the config_session is always cleared.

If the abort command is refused, for instance because of a TACACS issue the config_session is never cleared because an exception is raised by self._run_commands. This is an issue in Saltstack case: - a pending session could contain incomplete changes - this pending session is resumed on the next load_config call causing messed up configuration This change ensure the config_session is always cleared.

bewing · 2023-05-15T20:57:54Z

Wouldn't the session still be in use on the switch, causing future issues due to the pending session preventing a config lock?

napalm/napalm/eos/eos.py

Lines 307 to 314 in cb4845c

    
           def _lock(self): 
        
               sess = self._run_commands(["show configuration sessions"])[0]["sessions"] 
        
               if [ 
        
                   k 
        
                   for k, v in sess.items() 
        
                   if v["state"] == "pending" and k != self.config_session 
        
               ]: 
        
                   raise SessionLockedException("Session is already in use")

kpetremann · 2023-05-16T08:03:34Z

hello @bewing,
Thanks for the quick response.

Yes totally, but "only" for 24 hours as EOS will cancel any pending changes older than that.
It would make it closer to the lock implementation for JunOS:

napalm/napalm/junos/junos.py

Lines 159 to 166 in cb4845c

    
           def _lock(self): 
        
               """Lock the config DB.""" 
        
               if not self.locked: 
        
                   try: 
        
                       self.device.cu.lock() 
        
                       self.locked = True 
        
                   except JnprLockError as jle: 
        
                       raise LockError(str(jle))

It seems better to lock for 24 hours rather than having unexpected pending changes which can be applied by any future Salt state in uncontrolled manner. Knowing that such fails should not occur regularly and should be considered as incident. In our case, it was a TACACS issue which occurred during the changes in the config session (EOS checks permissions progressively for each line typed in the config session). We had multiple occurrences of this problem causing a mess in the configuration.

We could make this behavior optional adding a parameter at the init of the EOSDriver.

The other alternatives I see:

calling discard_commit each time we are about to do a change => not scalable as one could easily forget to do that in a Salt state and ending up with the same issue
add an optional alternative of the lock itself, depending on a new EOSDriver attribute, to have a kind of "private configuration" mode.

WDYT?

ktbyers · 2023-05-16T17:26:51Z

@kpetremann Why doesn't the exception get detected by Salt?

Also from you stated above, this implies Salt is reusing a NAPALM connection over very long periods of time--is that correct?

kpetremann · 2023-05-16T17:41:57Z

Hello @ktbyers,

Why doesn't the exception get detected by Salt?

When Salt proxy-minion are setup with Napalm as a driver, it uses the Napalm library to be vendor agnostic.
Even if Salt was catching the exception when it tried to discard the pending config, the only thing that Salt could do would be very specific to EOS, for instance by clearing manually the device.config_session.

The only generic alternative I see would be to do a change in Salt to make it completely drop the connection when there is an error, which seems overkill.

Also from you stated above, this implies Salt is reusing a NAPALM connection over very long periods of time--is that correct?

Yes, proxy-minions are running as a "daemon" and keep the connection active (by default, but there is an option reconnect at each request).

The proxy-minion keeps the Napalm Device object to reuse connection. It also keeps the config_session until the change is actually committed. This is to be able to load configuration without committing, then complete the config and then commit.
At least from what we can deduce from their documentation:

net.load_config can have commit=False
net.commit permits to commit pending changes

Note: we are not using commit=False, this is only to explain our understanding of Saltstack implementation of Napalm based proxy-minions.

ktbyers · 2023-05-16T18:16:24Z

I don't like the solution proposed as it is an unexpected thing to do: exception happens on discard() and then hide the configuration session (even though the config session still exists and is active).

Given Salt's very long running sessions (which are atypical) does it make more sense for Salt's logic to bake more checks in here (for example, call discard() and then ensure there are no config sessions at the end of that call, or call discard() and then do a compare_config() after that to ensure there are no pending changes)?

kpetremann · 2023-05-16T18:31:05Z

From user perspective, Napalm have inconsistent behavior at least when comparing EOS and JunOS.
In JunOS, if the discard does not work, the lock is not unlocked. This is the opposite with EOS implementation.

exception happens on discard() and then hide the configuration session (even though the config session still exists and is active).

It is not really hidden as the lock would block this. Also, even if the session still exists, it is more pending than active. I mean, if you create another session, the pending changes are not resumed. It is like private config session on JunOS.

I don't like the solution proposed

What about the other propositions I made:

make this change an option => something like always_clear_session_on_discard (needs a better name)
or, add an optional alternative of the lock itself, depending on a new EOSDriver attribute, to have a kind of "private configuration" mode. Meaning having the exact same lock mechanism than JunOS: if there is a pending session => lock, even if the config_session is the one from self.config_session.

kpetremann · 2023-05-16T18:37:53Z

does it make more sense for Salt's logic to bake more checks in here (for example, call discard() and then ensure there are no config sessions at the end of that call, or call discard() and then do a compare_config() after that to ensure there are no pending changes)?

In the TACACS issue we had, the has_pending_commit/compare_config would trigger an exception as it requires permissions as well so it cannot be done at the end.

The only solution would be to add an option in Salt to check if there a pending session before calling load_config. If it fails, it would prevent Salt to call load_config. But again, this would add more calls to the device each time we update the config, only beause of EOS implementation...

ktbyers · 2023-05-16T19:27:29Z

@kpetremann So on Juniper we have two locking modes: a device lock (which locks/unlocks at the commit/discard level) and a session lock (session_config_lock).

On Arista we really only have a session_config_lock (i.e. the config_session is the lock). If you have this session id, then you possess the lock.

So your issue is, you are constantly reusing the config_session so you constantly have the lock. Consequently, when something fails like discard_config() you can get into potentially a bad state (i.e. there are still stale configuration changes hanging around in the candidate_config).

Is this an accurate description of the issue/problem? I am just trying to make sure I am properly understanding this.

kpetremann · 2023-05-16T19:42:43Z

yes, exactly.

ktbyers · 2023-05-16T19:53:24Z

Yeah, maybe this is the best middle ground:

make this change an option => something like always_clear_session_on_discard (needs a better name)

Something like cfg_session_invalidate or force_cfg_session_invalid?

We probably should not tie it to discard (in case there are other scenarios where we need to do this invalidation)?

kpetremann · 2023-05-16T20:02:55Z

sounds great :) I'll update the PR
thanks

kpetremann · 2023-05-17T16:39:56Z

@ktbyers, PR updated, please let me know if it needs further changes

ktbyers · 2023-05-19T17:49:06Z

Some slight changes here, but should have the same net-effect:

#1927

kpetremann force-pushed the eos_always_clear_session_discard branch from 6f88980 to 08d6e51 Compare May 17, 2023 10:00

feat: add force_cfg_session_invalid to force clearing config_session

6a4f4b7

kpetremann force-pushed the eos_always_clear_session_discard branch from 08d6e51 to 6a4f4b7 Compare May 17, 2023 10:09

ktbyers approved these changes May 17, 2023

View reviewed changes

ktbyers closed this May 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: eos config_session not cleared if discard failed #1921

fix: eos config_session not cleared if discard failed #1921

kpetremann commented May 15, 2023 •

edited

Loading

bewing commented May 15, 2023 •

edited

Loading

kpetremann commented May 16, 2023

ktbyers commented May 16, 2023

kpetremann commented May 16, 2023 •

edited

Loading

ktbyers commented May 16, 2023

kpetremann commented May 16, 2023

kpetremann commented May 16, 2023 •

edited

Loading

ktbyers commented May 16, 2023

kpetremann commented May 16, 2023

ktbyers commented May 16, 2023

kpetremann commented May 16, 2023

kpetremann commented May 17, 2023

ktbyers commented May 19, 2023

fix: eos config_session not cleared if discard failed #1921

fix: eos config_session not cleared if discard failed #1921

Conversation

kpetremann commented May 15, 2023 • edited Loading

bewing commented May 15, 2023 • edited Loading

kpetremann commented May 16, 2023

ktbyers commented May 16, 2023

kpetremann commented May 16, 2023 • edited Loading

ktbyers commented May 16, 2023

kpetremann commented May 16, 2023

kpetremann commented May 16, 2023 • edited Loading

ktbyers commented May 16, 2023

kpetremann commented May 16, 2023

ktbyers commented May 16, 2023

kpetremann commented May 16, 2023

kpetremann commented May 17, 2023

ktbyers commented May 19, 2023

kpetremann commented May 15, 2023 •

edited

Loading

bewing commented May 15, 2023 •

edited

Loading

kpetremann commented May 16, 2023 •

edited

Loading

kpetremann commented May 16, 2023 •

edited

Loading