-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: eos config_session not cleared if discard failed #1921
fix: eos config_session not cleared if discard failed #1921
Conversation
If the abort command is refused, for instance because of a TACACS issue the config_session is never cleared because an exception is raised by self._run_commands. This is an issue in Saltstack case: - a pending session could contain incomplete changes - this pending session is resumed on the next load_config call causing messed up configuration This change ensure the config_session is always cleared.
Wouldn't the session still be in use on the switch, causing future issues due to the pending session preventing a config lock? Lines 307 to 314 in cb4845c
|
hello @bewing, Yes totally, but "only" for 24 hours as EOS will cancel any pending changes older than that. Lines 159 to 166 in cb4845c
It seems better to lock for 24 hours rather than having unexpected pending changes which can be applied by any future Salt state in uncontrolled manner. Knowing that such fails should not occur regularly and should be considered as incident. In our case, it was a TACACS issue which occurred during the changes in the config session (EOS checks permissions progressively for each line typed in the config session). We had multiple occurrences of this problem causing a mess in the configuration. We could make this behavior optional adding a parameter at the init of the EOSDriver. The other alternatives I see:
WDYT? |
@kpetremann Why doesn't the exception get detected by Salt? Also from you stated above, this implies Salt is reusing a NAPALM connection over very long periods of time--is that correct? |
Hello @ktbyers,
When Salt proxy-minion are setup with Napalm as a driver, it uses the Napalm library to be vendor agnostic. The only generic alternative I see would be to do a change in Salt to make it completely drop the connection when there is an error, which seems overkill.
Yes, proxy-minions are running as a "daemon" and keep the connection active (by default, but there is an option reconnect at each request). The proxy-minion keeps the Napalm Device object to reuse connection. It also keeps the config_session until the change is actually committed. This is to be able to load configuration without committing, then complete the config and then commit.
Note: we are not using commit=False, this is only to explain our understanding of Saltstack implementation of Napalm based proxy-minions. |
I don't like the solution proposed as it is an unexpected thing to do: Given Salt's very long running sessions (which are atypical) does it make more sense for Salt's logic to bake more checks in here (for example, call discard() and then ensure there are no config sessions at the end of that call, or call discard() and then do a compare_config() after that to ensure there are no pending changes)? |
From user perspective, Napalm have inconsistent behavior at least when comparing EOS and JunOS.
It is not really hidden as the lock would block this. Also, even if the session still exists, it is more pending than active. I mean, if you create another session, the pending changes are not resumed. It is like private config session on JunOS.
What about the other propositions I made:
|
In the TACACS issue we had, the has_pending_commit/compare_config would trigger an exception as it requires permissions as well so it cannot be done at the end. The only solution would be to add an option in Salt to check if there a pending session before calling load_config. If it fails, it would prevent Salt to call load_config. But again, this would add more calls to the device each time we update the config, only beause of EOS implementation... |
@kpetremann So on Juniper we have two locking modes: a device lock (which locks/unlocks at the commit/discard level) and a session lock (session_config_lock). On Arista we really only have a session_config_lock (i.e. the config_session is the lock). If you have this session id, then you possess the lock. So your issue is, you are constantly reusing the config_session so you constantly have the lock. Consequently, when something fails like discard_config() you can get into potentially a bad state (i.e. there are still stale configuration changes hanging around in the candidate_config). Is this an accurate description of the issue/problem? I am just trying to make sure I am properly understanding this. |
yes, exactly. |
Yeah, maybe this is the best middle ground:
Something like We probably should not tie it to |
sounds great :) I'll update the PR |
6f88980
to
08d6e51
Compare
08d6e51
to
6a4f4b7
Compare
@ktbyers, PR updated, please let me know if it needs further changes |
Some slight changes here, but should have the same net-effect: |
If the abort command is refused, for instance because of a TACACS issue, the config_session is never cleared because an exception is raised by self._run_commands.
This is an issue in Saltstack case:
This change ensure the config_session is always cleared.