Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[crmsh-4.5] Fix: Raise an exception as a rapid return of ssh-related operations to prevent hang (bsc#1228899) #1508

Merged

Conversation

liangxin1300
Copy link
Collaborator

@liangxin1300 liangxin1300 commented Aug 7, 2024

Problem

In some environments with a network firewall dropping ssh packets, any ssh operation could hang.

Solution

  • Add the 'core.no_ssh' option in crm.conf.
    "no" is the default.
    "yes" means all ssh-related operations are disabled and will directly return on purpose.
  • Define a 'utils.NoSSHError' exception and an ssh wrapper function. Raise the exception when no_ssh is yes.
  • Catch utils.NoSSHError in main so that SSH-related commands will log a general error message: ERROR: ssh-related operations are disabled. crm report works in local mode.
  • Catch 'utils.NoSSHError' when running certain commands that call ssh, and provide a more user-friendly message.
  • upgradeutil, which is called regardless of the command being called, will return without calling ssh, when no_ssh is yes.
  • Other commands that are not related to SSH will continue to function normally.

SSH related commands

  • crm cluster stop --all
  • crm cluster start --all
  • crm report

Copy link

codecov bot commented Aug 7, 2024

Codecov Report

Attention: Patch coverage is 66.00000% with 17 lines in your changes missing coverage. Please review.

Project coverage is 66.20%. Comparing base (d44b777) to head (e87d325).
Report is 52 commits behind head on crmsh-4.5.

Files Patch % Lines
crmsh/ui_cluster.py 46.66% 8 Missing ⚠️
crmsh/report/collect.py 0.00% 4 Missing ⚠️
crmsh/main.py 33.33% 2 Missing ⚠️
crmsh/report/core.py 60.00% 2 Missing ⚠️
crmsh/utils.py 92.85% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##           crmsh-4.5    #1508       +/-   ##
==============================================
+ Coverage      52.41%   66.20%   +13.78%     
==============================================
  Files             78       78               
  Lines          25123    25448      +325     
==============================================
+ Hits           13169    16848     +3679     
+ Misses         11954     8600     -3354     
Flag Coverage Δ
integration 52.44% <58.00%> (?)
unit 47.60% <40.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@liangxin1300 liangxin1300 changed the title Fix: Raise an exception if current environment disable ssh [crmsh-4.5] Fix: Raise an exception if current environment disable ssh Aug 8, 2024
@liangxin1300 liangxin1300 changed the title [crmsh-4.5] Fix: Raise an exception if current environment disable ssh [crmsh-4.5] Fix: Raise an exception if current environment disable ssh (bsc#1228899) Aug 8, 2024
@zzhou1 zzhou1 requested a review from nicholasyang2022 August 8, 2024 02:48
Copy link
Collaborator

@nicholasyang2022 nicholasyang2022 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If there are core.hosts in crm.conf, _guess_user_for_ssh will not be called. So this patch will not respect no_ssh in crm.conf if there are core.hosts. Please check no_ssh in UserOfHost.user_pair_for_ssh.
  2. ssh_command() is not used by other modules. Why is it a public function?

@liangxin1300 liangxin1300 force-pushed the 20240807_testing_for_ssh_hang branch from 34cbaf9 to 8181608 Compare August 8, 2024 05:53
Default value is 'no'. When ssh between cluster nodes is blocked,
set it to 'yes' to avoid ssh hang.
@liangxin1300 liangxin1300 force-pushed the 20240807_testing_for_ssh_hang branch from 8181608 to 700f12b Compare August 8, 2024 06:27
crmsh/utils.py Outdated Show resolved Hide resolved
crmsh/utils.py Outdated Show resolved Hide resolved
@liangxin1300 liangxin1300 changed the title [crmsh-4.5] Fix: Raise an exception if current environment disable ssh (bsc#1228899) [crmsh-4.5] Fix: Raise an exception if ssh is blocked between cluster nodes (bsc#1228899) Aug 8, 2024
@liangxin1300 liangxin1300 force-pushed the 20240807_testing_for_ssh_hang branch from 700f12b to 62d3dc2 Compare August 8, 2024 06:59
When checking the status of the cluster on peers,
catch utils.NoSSHError exception
@liangxin1300 liangxin1300 force-pushed the 20240807_testing_for_ssh_hang branch from 62d3dc2 to 0682bb1 Compare August 8, 2024 07:24
@zzhou1 zzhou1 changed the title [crmsh-4.5] Fix: Raise an exception if ssh is blocked between cluster nodes (bsc#1228899) [crmsh-4.5] Fix: Raise an exception as a rapid return of ssh-related operations to prevent hang (bsc#1228899) Aug 8, 2024
@liangxin1300 liangxin1300 force-pushed the 20240807_testing_for_ssh_hang branch 2 times, most recently from 709eb21 to 1cb9336 Compare August 9, 2024 02:33
@liangxin1300
Copy link
Collaborator Author

liangxin1300 commented Aug 9, 2024

Major use cases

15sp5-1:~ # ssh -o ConnectTimeout=5 15sp5-2
ssh: connect to host 15sp5-2 port 22: Connection timed out

15sp5-1:~ # cat /etc/crm/crm.conf |grep -E -A1 "^\[core\]"
[core]
no_ssh = yes

15sp5-1:~ # crm status
Status of pacemakerd: 'Pacemaker is running' (last updated 2024-08-09 10:34:16 +08:00)
Cluster Summary:
  * Stack: corosync
  * Current DC: 15sp5-2 (version 2.1.5+20221208.a3f44794f-150500.6.17.1-2.1.5+20221208.a3f44794f) - partition with quorum
  * Last updated: Fri Aug  9 10:34:16 2024
  * Last change:  Wed Aug  7 14:24:39 2024 by hacluster via crmd on 15sp5-1
  * 2 nodes configured
  * 0 resource instances configured

Node List:
  * Online: [ 15sp5-1 15sp5-2 ]

Full List of Resources:
  * No resources


15sp5-1:~ # crm corosync status
Printing ring status.
Local node ID 1
RING ID 0
	id	= 192.168.122.57
	status	= ring 0 active with no faults

15sp5-1:~ # crm configure show
node 1: 15sp5-1
node 2: 15sp5-2
property cib-bootstrap-options: \
	have-watchdog=false \
	dc-version="2.1.5+20221208.a3f44794f-150500.6.17.1-2.1.5+20221208.a3f44794f" \
	cluster-infrastructure=corosync \
	cluster-name=hacluster
rsc_defaults build-resource-defaults: \
	resource-stickiness=1

15sp5-1:~ # crm cluster stop --all
ERROR: ssh-related operations are disabled. crmsh works in local mode.
INFO: Please try 'crm cluster stop' on each node

15sp5-1:~ # crm report -S
15sp5-1: INFO: The report is saved in ./crm_report-Fri-09-Aug-2024.tar.gz
15sp5-1: INFO: Report timespan: 08/08/24 22:34:00 - 08/09/24 10:34:51
15sp5-1: INFO: Thank you for taking time to create this report.

15sp5-1:~ # crm report -n 15sp5-1
15sp5-1: INFO: The report is saved in ./crm_report-Fri-09-Aug-2024.tar.gz
15sp5-1: INFO: Report timespan: 08/08/24 22:34:00 - 08/09/24 10:34:59
15sp5-1: INFO: Thank you for taking time to create this report.

15sp5-1:~ # crm report
15sp5-1: ERROR: ssh-related operations are disabled. crm report works in local mode.
15sp5-1: INFO: The report is saved in ./crm_report-Fri-09-Aug-2024.tar.gz
15sp5-1: INFO: Report timespan: 08/08/24 22:35:00 - 08/09/24 10:35:12
15sp5-1: INFO: Thank you for taking time to create this report.

15sp5-1:~ # crm cluster start --all
INFO: The cluster stack already started on 15sp5-1
ERROR: ssh-related operations are disabled. crmsh works in local mode.
INFO: Please try 'crm cluster start' on each node

15sp5-1:~ # crm cluster stop
INFO: The cluster stack stopped on 15sp5-1

15sp5-1:~ # crm cluster start --all
ERROR: ssh-related operations are disabled. crmsh works in local mode.
INFO: Please try 'crm cluster start' on each node

crmsh/main.py Outdated
@@ -367,6 +367,9 @@ def run():
else:
upgradeutil.upgrade_if_needed()
return main_input_loop(context, user_args)
except utils.NoSSHError as msg:
logger.error(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.error(msg)
logger.error('%s', msg)

liangxin1300 added a commit that referenced this pull request Aug 29, 2024
…operations to prevent hang (bsc#1228899) (#1521)

port #1508 , also add functional test for ssh blocking case
liangxin1300 added a commit that referenced this pull request Aug 29, 2024
liangxin1300 added a commit that referenced this pull request Sep 2, 2024
…o prevent hang (bsc#1228899) (#1520)

port #1508 , also add functional test for ssh blocking case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants