Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conditional failure for 12sp5 AWS fencing #257

Merged
merged 1 commit into from
Aug 8, 2024

Conversation

BillAnastasiadis
Copy link
Collaborator

@BillAnastasiadis BillAnastasiadis commented Jul 30, 2024

This PR showcases the changes needed to make 12sp5 AWS fencing work. The problem highlighted in the initial ticket was only one of the 3 interesting findings (2 fails and one weird behaviour).

Check the comments on the code to see which of the below problems each change addresses:

  • Problem 1 (fatal - the one mentioned in the ticket): In all SPs, many ansible tasks have a warning about "lack of quorum" in stdout which is formatted like warning: .... In 12sp5, this warning is like this: ERROR: warning: ... and it is in stderr instead of stdout. RC is 0 in all cases, and the warning can be safely ignored in all cases. It seems like crm in 12sp5 has wrong behavior.

  • Problem 2 (non-fatal, but seemingly related): after fixing the first problem, in a later task, some INFO: ... messages that are appended to stdout in all other SPs, in 12sp5 they go to stderr. The RC is still 0, and the task passes because it relies on RC and not on stdout/err like the first case. But it seems weird that simple INFO messages are printed to stderr, and only in 12sp5.

  • Problem 3 (fatal - a new one, the problem was faulty qe-sap-deployment code): the task Wait for move IP move to complete has some code that is supposed to be retried. The code is like this:

crm resource locate (...)
register: reg_vip_location2
until: reg_vip_location2.stdout_lines[0] | trim | split(' ') | last == primary_hostname

This will fail if the first try doesn't get the correct result, due to not being ready. The failure is
error while evaluating conditional (reg_vip_location2.stdout_lines[0] | trim | split(' ') | last == primary_hostname): list object has no element 0

This was fixed by changing the until to
until: reg_vip_location2.stdout_lines | length > 0 and reg_vip_location2.stdout_lines[0] | trim | split(' ') | last == primary_hostname
, which is, adding an extra check for the length of the output list.

All in all, test was fixed, but the troubles with non-error output going to stderr may be a bug. You can see problem 1 in the jobs listed in the ticket, problems 2 and 3 can be seen in the ansible output of https://openqaworker15.qa.suse.cz/tests/292671 (look for TASK [Move cluster IP to primary HANA node] for problem 2 TASK [Wait for move IP move to complete] for problem 3)

failed_when: "'ERROR' in stonith_config_result.stderr"
failed_when: >
(ansible_distribution_major_version == '12' and 'ERROR' in stonith_config_result.stderr and 'ERROR: warning' not in stonith_config_result.stderr) or
(ansible_distribution_major_version != '12' and 'ERROR' in stonith_config_result.stderr)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part addresses Problem 1 (see description): it specifically checks if 12sp5 crm outputs a warning in stderr

@@ -401,7 +403,7 @@
when:
- is_primary
- reg_vip_location.stdout | trim | split(' ') | last != primary_hostname
until: reg_vip_location2.stdout_lines[0] | trim | split(' ') | last == primary_hostname
until: reg_vip_location2.stdout_lines | length > 0 and reg_vip_location2.stdout_lines[0] | trim | split(' ') | last == primary_hostname
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses Problem 3: it adds a check for output length, so the task retries if the output is not yet ready instead of immediately failing

Copy link
Collaborator

@alvarocarvajald alvarocarvajald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@lpalovsky lpalovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BillAnastasiadis BillAnastasiadis force-pushed the aws_fencing branch 9 times, most recently from 3a0aa67 to 1299a60 Compare August 7, 2024 15:14
@@ -222,6 +222,16 @@
migration_threshold: "{{ (crm_rsc_options_show.stdout | regex_search('migration-threshold=([0-9]*)', '\\1'))[0] | default('false') }}"
op_default_timeout: "{{ (crm_op_options_show.stdout | regex_search('timeout=([0-9]*)', '\\1'))[0] | default('false') }}"

- name: Set corosync facts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Set corosync facts
- name: Compose AWS stonith command

op stop interval=0 timeout=180 \
op monitor interval=120 timeout=60 \
meta target-role=Started \
params tag={{ aws_stonith_tag}} pcmk_delay_max=15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params tag={{ aws_stonith_tag}} pcmk_delay_max=15
params tag={{ aws_stonith_tag }} pcmk_delay_max=15

- not (use_sbd | bool)
- ansible_distribution_major_version == '12'
register: stonith_config_result
failed_when: >
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect

Copy link
Collaborator

@mpagot mpagot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- name: Set corosync facts
ansible.builtin.set_fact:
aws_stonith_cmd: |
crm configure primitive rsc_aws_stonith stonith:external/ec2 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from optional make static-ansible-lint target

jinja[spacing]: Jinja2 spacing could be improved: crm configure primitive rsc_aws_stonith stonith:external/ec2 \
op start interval=0 timeout=180 \
op stop interval=0 timeout=180 \
op monitor interval=120 timeout=60 \
meta target-role=Started \
params tag={{ aws_stonith_tag}} pcmk_delay_max=15
 -> crm configure primitive rsc_aws_stonith stonith:external/ec2 \
op start interval=0 timeout=180 \
op stop interval=0 timeout=180 \
op monitor interval=120 timeout=60 \
meta target-role=Started \
params tag={{ aws_stonith_tag }} pcmk_delay_max=15
 (warning)
ansible/playbooks/tasks/cluster-bootstrap.yaml:227 Jinja2 template rewrite recommendation: `crm configure primitive rsc_aws_stonith stonith:external/ec2 \
op start interval=0 timeout=180 \
op stop interval=0 timeout=180 \
op monitor interval=120 timeout=60 \
meta target-role=Started \
params tag={{ aws_stonith_tag }} pcmk_delay_max=15

@BillAnastasiadis BillAnastasiadis changed the title WIP: Add conditional failure for 12sp5 AWS fencing Add conditional failure for 12sp5 AWS fencing Aug 8, 2024
@mpagot mpagot merged commit 439254a into SUSE:main Aug 8, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants