-
Notifications
You must be signed in to change notification settings - Fork 678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CLI configuration options for teamd retry count feature #2642
Add CLI configuration options for teamd retry count feature #2642
Conversation
Add a SONiC CLI to more easily configure the retry count for port channels. This effectively acts like a wrapper around the underlying teamdctl command. Also add a python script that'll be installed into /usr/local/bin/teamd_increase_retry_count.py that will detect if the peer device likely supports this teamd feature (based on LLDP neighbor info) and increases the teamd retry count to 5 in preparation for warm upgrade. This script requires sudo to run. Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
… or set Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Scapy's IPv6 support appears to have caused some issues with older versions of scapy, which may be present on older SONiC images. Signed-off-by: Saikrishna Arcot <[email protected]>
Don't fail warm reboot if teamd retry count support doesn't happen to be present. Also use fastfast-reboot for Mellanox devices. Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Also update a failure message in the warm-reboot script if the retry count script fails. Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
scripts/fast-reboot
Outdated
/usr/local/bin/teamd_increase_retry_count.py --probe-only || TEAMD_RETRY_COUNT_PROBE_RC=$? | ||
if [[ TEAMD_RETRY_COUNT_PROBE_RC -ne 0 ]]; then | ||
debug "Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available" | ||
TEAMD_INCREASE_RETRY_COUNT=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we supposed to bail out from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the images we deploy to our internal environment, this will be changed to bail out and fail the warm-reboot if SONiC neighbors are not seen. For the public version, since we can't require using SONiC-only neighbors for warm reboot, this will only be a warning.
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
…equired or not Signed-off-by: Saikrishna Arcot <[email protected]>
This is so that in the case of the teamd retry count check failing, there's fewer changes that happen on the system (it'll fail faster). Signed-off-by: Saikrishna Arcot <[email protected]>
This doesn't need to be after the point-of-no-return, since this will detach and be sending LACPDUs on its own. Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Posted minor concerns.
scripts/fast-reboot
Outdated
if [[ "${REBOOT_TYPE}" = "warm-reboot" || "${REBOOT_TYPE}" = "fastfast-reboot" ]]; then | ||
TEAMD_RETRY_COUNT_PROBE_RC=0 | ||
/usr/local/bin/teamd_increase_retry_count.py --probe-only || TEAMD_RETRY_COUNT_PROBE_RC=$? | ||
if [[ $TEAMD_RETRY_COUNT_PROBE_RC -ne 0 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: use ${}
syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
/usr/local/bin/teamd_increase_retry_count.py --probe-only || TEAMD_RETRY_COUNT_PROBE_RC=$? | ||
if [[ $TEAMD_RETRY_COUNT_PROBE_RC -ne 0 ]]; then | ||
if [[ "${REQUIRE_TEAMD_RETRY_COUNT}" = "yes" ]]; then | ||
error "Could not confirm that all neighbor devices are running SONiC with the retry count feature" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are missing an exit w/ error code here?
Wouldn't the logic detect the error state and still continue to warm-reboot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think error
does the exit as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it doesn't, not sure why I didn't see this not failing before. Added an exit with a new error code here.
scripts/fast-reboot
Outdated
error "Could not confirm that all neighbor devices are running SONiC with the retry count feature" | ||
else | ||
debug "Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available" | ||
TEAMD_INCREASE_RETRY_COUNT=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed to be set to 0 again? The variable is initially set to 0 anyway, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, removed this line.
@@ -663,6 +688,10 @@ if [[ "$REBOOT_TYPE" = "fast-reboot" ]]; then | |||
fi | |||
fi | |||
|
|||
if [[ ( "${REBOOT_TYPE}" = "warm-reboot" || "${REBOOT_TYPE}" = "fastfast-reboot" ) && "${TEAMD_INCREASE_RETRY_COUNT}" -eq 1 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this feature is needed even for fast-reboot?
For fast-reboot it will help avoid LAG flaps too - if the boot up is slow enough that 90s is reached before even syncd does ASIC reset call. So essentially, preventing LAG flap other than where it should really happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make decision on this later too.
if os.geteuid() != 0: | ||
log.log_error("Root privileges required for this operation", also_print_to_console=True) | ||
sys.exit(1) | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return will never be reached after sys.exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the extra returns.
return False | ||
portChannels = getPortChannels() | ||
if not portChannels: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this main returning True/False? What is checking on these return values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at one point of time, I had some logic that would indicate success/failure via the True/False returns, but this isn't used at all. Removed the True/False returns.
activePortChannels = [] | ||
for portChannel in portChannels: | ||
state = portChannelTable.get(portChannel) | ||
if not state[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Risk of index error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think libswsscommon returns whether the call is a success or failure in the first index of the tuple that's returned, but regardless, I added a check to see if an empty tuple/list was returned.
log.log_warning("WARNING: No peer description available via LLDP for {}; skipping".format(portName)) | ||
continue | ||
portChannelChecked = True | ||
if "SONiC" not in peerInfo["descr"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Check w/ ignore case here might be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Signed-off-by: Saikrishna Arcot <[email protected]>
Update sonic-utilities submodule pointer to include the following: * 5c9b2177 Fix issue: out of range sflow polling interval is accepted and stored in config_db ([sonic-net#2847](sonic-net/sonic-utilities#2847)) * 72ca4848 Add CLI configuration options for teamd retry count feature ([sonic-net#2642](sonic-net/sonic-utilities#2642)) * 359dfc0c [Clock] Implement clock CLI ([sonic-net#2793](sonic-net/sonic-utilities#2793)) * b316fc27 Add transceiver status CLI to show output from TRANSCEIVER_STATUS table ([sonic-net#2772](sonic-net/sonic-utilities#2772)) * dc59dbd2 Replace pickle by json ([sonic-net#2849](sonic-net/sonic-utilities#2849)) * a66f41c4 [show] replace shell=True, replace xml by lxml, replace exit by sys.exit ([sonic-net#2666](sonic-net/sonic-utilities#2666)) * 57500572 [utilities_common] replace shell=True ([sonic-net#2718](sonic-net/sonic-utilities#2718)) * 6e0ee3e7 [CRM][DASH] Extend CRM utility to support DASH resources. ([sonic-net#2800](sonic-net/sonic-utilities#2800)) * b2c29b0b [config] Generate sysinfo in single asic ([sonic-net#2856](sonic-net/sonic-utilities#2856)) Signed-off-by: dprital <[email protected]>
…nic-utilities submodule on master (#15193) Dependency: sonic-net/sonic-utilities#2718 Why I did it This PR sonic-net/sonic-utilities#2718 reduce shell=True usage in utilities_common.cli.run_command() function. Work item tracking Microsoft ADO (number only): 15022050 How I did it Replace strings commands using utilities_common.cli.run_command() function to list of strings due to circular dependency, advance sonic-utilities submodule 72ca4848 (HEAD -> master, upstream/master, upstream/HEAD) Add CLI configuration options for teamd retry count feature (sonic-net/sonic-utilities#2642) 359dfc0c [Clock] Implement clock CLI (sonic-net/sonic-utilities#2793) b316fc27 Add transceiver status CLI to show output from TRANSCEIVER_STATUS table (sonic-net/sonic-utilities#2772) dc59dbd2 Replace pickle by json (sonic-net/sonic-utilities#2849) a66f41c4 [show] replace shell=True, replace xml by lxml, replace exit by sys.exit (sonic-net/sonic-utilities#2666) 57500572 [utilities_common] replace shell=True (sonic-net/sonic-utilities#2718) 6e0ee3e7 [CRM][DASH] Extend CRM utility to support DASH resources. (sonic-net/sonic-utilities#2800) b2c29b0b [config] Generate sysinfo in single asic (sonic-net/sonic-utilities#2856)
…t#2642) * Add CLI configuration options for teamd retry count feature Add a SONiC CLI to more easily configure the retry count for port channels. This effectively acts like a wrapper around the underlying teamdctl command. Also add a python script that'll be installed into /usr/local/bin/teamd_increase_retry_count.py that will detect if the peer device likely supports this teamd feature (based on LLDP neighbor info) and increases the teamd retry count to 5 in preparation for warm upgrade. This script requires sudo to run. This is tied to sonic-net/sonic-buildimage#13453. Signed-off-by: Saikrishna Arcot <[email protected]> * Add test for error case from teamd when it's not running Signed-off-by: Saikrishna Arcot <[email protected]> * Fix up test cases Signed-off-by: Saikrishna Arcot <[email protected]> * Add some error handling if teamdctl doesn't exist Signed-off-by: Saikrishna Arcot <[email protected]> * Add probe functionality and sending current LACPDU packet functionality Signed-off-by: Saikrishna Arcot <[email protected]> * Check to see if the retry count feature is enabled before doing a get or set Signed-off-by: Saikrishna Arcot <[email protected]> * Add option to only send probe packets or only change retry count Signed-off-by: Saikrishna Arcot <[email protected]> * Call the teamd retry count script if doing a warm-reboot Signed-off-by: Saikrishna Arcot <[email protected]> * Fix pycheck errors, and disable scapy's IPv6 and verbose mode Scapy's IPv6 support appears to have caused some issues with older versions of scapy, which may be present on older SONiC images. Signed-off-by: Saikrishna Arcot <[email protected]> * Make teamd retry count support optional Don't fail warm reboot if teamd retry count support doesn't happen to be present. Also use fastfast-reboot for Mellanox devices. Signed-off-by: Saikrishna Arcot <[email protected]> * Address review comments, and restructure code to increase code coverage Signed-off-by: Saikrishna Arcot <[email protected]> * Address some review comments Signed-off-by: Saikrishna Arcot <[email protected]> * Replace tabs with spaces Signed-off-by: Saikrishna Arcot <[email protected]> * Verify that expected keys are present in the data returned from teamdctl Also update a failure message in the warm-reboot script if the retry count script fails. Signed-off-by: Saikrishna Arcot <[email protected]> * Fix TimeoutExpired undefined error Signed-off-by: Saikrishna Arcot <[email protected]> * Add ability to mock subprocess calls (at a limited level) Signed-off-by: Saikrishna Arcot <[email protected]> * Return an actual subprocess object, and add a test for checking timeout Signed-off-by: Saikrishna Arcot <[email protected]> * Change variable syntax Signed-off-by: Saikrishna Arcot <[email protected]> * Fix set being accessed with an index Signed-off-by: Saikrishna Arcot <[email protected]> * Add option to warm-reboot script to control if teamd retry count is required or not Signed-off-by: Saikrishna Arcot <[email protected]> * Move the teamd retry count check to before orchagent This is so that in the case of the teamd retry count check failing, there's fewer changes that happen on the system (it'll fail faster). Signed-off-by: Saikrishna Arcot <[email protected]> * Move retry count script start to be prior to point-of-no-return This doesn't need to be after the point-of-no-return, since this will detach and be sending LACPDUs on its own. Signed-off-by: Saikrishna Arcot <[email protected]> * Set executable bit Signed-off-by: Saikrishna Arcot <[email protected]> * Address PR comments Signed-off-by: Saikrishna Arcot <[email protected]> * Change to case-insensitive string contains check Signed-off-by: Saikrishna Arcot <[email protected]> * Make sure the global abort variable is used Signed-off-by: Saikrishna Arcot <[email protected]> --------- Signed-off-by: Saikrishna Arcot <[email protected]>
…nic-utilities submodule on master (sonic-net#15193) Dependency: sonic-net/sonic-utilities#2718 Why I did it This PR sonic-net/sonic-utilities#2718 reduce shell=True usage in utilities_common.cli.run_command() function. Work item tracking Microsoft ADO (number only): 15022050 How I did it Replace strings commands using utilities_common.cli.run_command() function to list of strings due to circular dependency, advance sonic-utilities submodule 72ca4848 (HEAD -> master, upstream/master, upstream/HEAD) Add CLI configuration options for teamd retry count feature (sonic-net/sonic-utilities#2642) 359dfc0c [Clock] Implement clock CLI (sonic-net/sonic-utilities#2793) b316fc27 Add transceiver status CLI to show output from TRANSCEIVER_STATUS table (sonic-net/sonic-utilities#2772) dc59dbd2 Replace pickle by json (sonic-net/sonic-utilities#2849) a66f41c4 [show] replace shell=True, replace xml by lxml, replace exit by sys.exit (sonic-net/sonic-utilities#2666) 57500572 [utilities_common] replace shell=True (sonic-net/sonic-utilities#2718) 6e0ee3e7 [CRM][DASH] Extend CRM utility to support DASH resources. (sonic-net/sonic-utilities#2800) b2c29b0b [config] Generate sysinfo in single asic (sonic-net/sonic-utilities#2856)
Add a SONiC CLI to more easily configure the retry count for port channels. This effectively acts like a wrapper around the underlying teamdctl command.
Also add a python script that'll be installed into /usr/local/bin/teamd_increase_retry_count.py that will detect if the peer device likely supports this teamd feature (based on LLDP neighbor info) and increases the teamd retry count to 5 in preparation for warm upgrade. This script requires sudo to run.
This is tied to sonic-net/sonic-buildimage#13453.
Signed-off-by: Saikrishna Arcot [email protected]
What I did
How I did it
How to verify it
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)