-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conditional failure for 12sp5 AWS fencing #257
Conversation
failed_when: "'ERROR' in stonith_config_result.stderr" | ||
failed_when: > | ||
(ansible_distribution_major_version == '12' and 'ERROR' in stonith_config_result.stderr and 'ERROR: warning' not in stonith_config_result.stderr) or | ||
(ansible_distribution_major_version != '12' and 'ERROR' in stonith_config_result.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part addresses Problem 1 (see description): it specifically checks if 12sp5 crm
outputs a warning in stderr
@@ -401,7 +403,7 @@ | |||
when: | |||
- is_primary | |||
- reg_vip_location.stdout | trim | split(' ') | last != primary_hostname | |||
until: reg_vip_location2.stdout_lines[0] | trim | split(' ') | last == primary_hostname | |||
until: reg_vip_location2.stdout_lines | length > 0 and reg_vip_location2.stdout_lines[0] | trim | split(' ') | last == primary_hostname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addresses Problem 3: it adds a check for output length, so the task retries if the output is not yet ready instead of immediately failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
3a0aa67
to
1299a60
Compare
1299a60
to
90dd2d7
Compare
@@ -222,6 +222,16 @@ | |||
migration_threshold: "{{ (crm_rsc_options_show.stdout | regex_search('migration-threshold=([0-9]*)', '\\1'))[0] | default('false') }}" | |||
op_default_timeout: "{{ (crm_op_options_show.stdout | regex_search('timeout=([0-9]*)', '\\1'))[0] | default('false') }}" | |||
|
|||
- name: Set corosync facts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- name: Set corosync facts | |
- name: Compose AWS stonith command |
op stop interval=0 timeout=180 \ | ||
op monitor interval=120 timeout=60 \ | ||
meta target-role=Started \ | ||
params tag={{ aws_stonith_tag}} pcmk_delay_max=15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
params tag={{ aws_stonith_tag}} pcmk_delay_max=15 | |
params tag={{ aws_stonith_tag }} pcmk_delay_max=15 |
- not (use_sbd | bool) | ||
- ansible_distribution_major_version == '12' | ||
register: stonith_config_result | ||
failed_when: > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- name: Set corosync facts | ||
ansible.builtin.set_fact: | ||
aws_stonith_cmd: | | ||
crm configure primitive rsc_aws_stonith stonith:external/ec2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from optional make static-ansible-lint
target
jinja[spacing]: Jinja2 spacing could be improved: crm configure primitive rsc_aws_stonith stonith:external/ec2 \
op start interval=0 timeout=180 \
op stop interval=0 timeout=180 \
op monitor interval=120 timeout=60 \
meta target-role=Started \
params tag={{ aws_stonith_tag}} pcmk_delay_max=15
-> crm configure primitive rsc_aws_stonith stonith:external/ec2 \
op start interval=0 timeout=180 \
op stop interval=0 timeout=180 \
op monitor interval=120 timeout=60 \
meta target-role=Started \
params tag={{ aws_stonith_tag }} pcmk_delay_max=15
(warning)
ansible/playbooks/tasks/cluster-bootstrap.yaml:227 Jinja2 template rewrite recommendation: `crm configure primitive rsc_aws_stonith stonith:external/ec2 \
op start interval=0 timeout=180 \
op stop interval=0 timeout=180 \
op monitor interval=120 timeout=60 \
meta target-role=Started \
params tag={{ aws_stonith_tag }} pcmk_delay_max=15
This PR showcases the changes needed to make 12sp5 AWS fencing work. The problem highlighted in the initial ticket was only one of the 3 interesting findings (2 fails and one weird behaviour).
Check the comments on the code to see which of the below problems each change addresses:
Problem 1 (fatal - the one mentioned in the ticket): In all SPs, many ansible tasks have a warning about "lack of quorum" in stdout which is formatted like
warning: ...
. In 12sp5, this warning is like this:ERROR: warning: ...
and it is in stderr instead of stdout. RC is 0 in all cases, and the warning can be safely ignored in all cases. It seems like crm in 12sp5 has wrong behavior.Problem 2 (non-fatal, but seemingly related): after fixing the first problem, in a later task, some INFO: ... messages that are appended to stdout in all other SPs, in 12sp5 they go to stderr. The RC is still 0, and the task passes because it relies on RC and not on stdout/err like the first case. But it seems weird that simple INFO messages are printed to stderr, and only in 12sp5.
Problem 3 (fatal - a new one, the problem was faulty qe-sap-deployment code): the task
Wait for move IP move to complete
has some code that is supposed to be retried. The code is like this:This will fail if the first try doesn't get the correct result, due to not being ready. The failure is
error while evaluating conditional (reg_vip_location2.stdout_lines[0] | trim | split(' ') | last == primary_hostname): list object has no element 0
This was fixed by changing the
until
tountil: reg_vip_location2.stdout_lines | length > 0 and reg_vip_location2.stdout_lines[0] | trim | split(' ') | last == primary_hostname
, which is, adding an extra check for the length of the output list.
All in all, test was fixed, but the troubles with non-error output going to stderr may be a bug. You can see problem 1 in the jobs listed in the ticket, problems 2 and 3 can be seen in the ansible output of https://openqaworker15.qa.suse.cz/tests/292671 (look for
TASK [Move cluster IP to primary HANA node]
for problem 2TASK [Wait for move IP move to complete]
for problem 3)