Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in error handling when ansible-runner install fails in traceback #13494

Closed
AlanCoding opened this issue Jan 31, 2023 · 4 comments
Closed
Assignees

Comments

@AlanCoding
Copy link
Member

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API
SUMMARY

The setup for this is the case that running any ansible-runner command results in a traceback. This is known to have happened in the past due to some pip issues. When this happens, we expect to get the traceback in the result_traceback from API details. We are not getting that now due to an unrelated (novel) error masking it.

STEPS TO REPRODUCE

I make this change to the ansible-runner install to simulate it.

diff --git a/ansible_runner/streaming.py b/ansible_runner/streaming.py
index 1a64be7..27fe008 100644
--- a/ansible_runner/streaming.py
+++ b/ansible_runner/streaming.py
@@ -71,6 +71,7 @@ class Worker(object):
 
         self.kwargs = kwargs
         self.job_kwargs = None
+        raise Exception('alan!')
 
         private_data_dir = kwargs.get('private_data_dir')
         if private_data_dir is None:
EXPECTED RESULTS

I expect the same thing that receptorctl work results unit_id gives

bash-5.1$ receptorctl work results zYiQtRNL
Warning: receptorctl and receptor are different versions, they may not be compatible
Traceback (most recent call last):
  File "/awx_devel/testing/ansible-runner/ansible_runner/__main__.py", line 864, in main
    res = run(**run_options)
  File "/awx_devel/testing/ansible-runner/ansible_runner/interface.py", line 210, in run
    r = init_runner(**kwargs)
  File "/awx_devel/testing/ansible-runner/ansible_runner/interface.py", line 103, in init_runner
    stream_worker = Worker(**kwargs)
  File "/awx_devel/testing/ansible-runner/ansible_runner/streaming.py", line 74, in __init__
    raise Exception('alan!')
Exception: alan!
ERROR: Remote unit failed: exit status 1

to be shown in /api/v2/project_updates/N/ under result_traceback in my case.

ACTUAL RESULTS

Job details look like this:

Screenshot from 2023-01-31 15-34-28

ADDITIONAL INFORMATION

This seems to be a new regression with #12961

If I revert that partially...

diff --git a/awx/main/tasks/receptor.py b/awx/main/tasks/receptor.py
index 94568ebd6c..47503314f6 100644
--- a/awx/main/tasks/receptor.py
+++ b/awx/main/tasks/receptor.py
@@ -431,10 +431,8 @@ class AWXReceptorJob:
                         # if receptor work unit failed and no events were emitted, work results may
                         # contain useful information about why the job failed. In case stdout is
                         # massive, only ask for last 1000 bytes
-                        startpos = max(stdout_size - 1000, 0)
-                        resultsock, resultfile = receptor_ctl.get_work_results(self.unit_id, startpos=startpos, return_socket=True, return_sockfile=True)
-                        resultsock.setblocking(False)  # this makes resultfile reads non blocking
-                        lines = resultfile.readlines()
+                        resultsock = receptor_ctl.get_work_results(self.unit_id, return_sockfile=True)
+                        lines = resultsock.readlines()
                         receptor_output = b"".join(lines).decode()
                     if receptor_output:
                         self.task.runner_callback.delay_update(result_traceback=receptor_output)

it gives the desired data

Screenshot from 2023-01-31 15-48-46

@AlanCoding
Copy link
Member Author

If I just remove the line resultsock.setblocking(False), then it turns out that fixes it.

Of course, this would not be a valid solution in practice. However, we could use python signals to set an alarm or something along those lines.

@AlanCoding
Copy link
Member Author

The approach with signal.alarm seems to work:

                        def last_ditch():
                            logger.warning(f'Failed to read results from receptor for unit_id={self.unit_id}')

                        signal.signal(signal.SIGALRM, last_ditch)
                        signal.alarm(1)
                        lines = resultfile.readlines()
                        signal.alarm(0)
                        receptor_output = b"".join(lines).decode()

But for one, I can't yet figure out a way to test this. Secondly, this doesn't have proper error handling to reset the prior state of signals.

@AlanCoding
Copy link
Member Author

From further testing, I've got some new info on this:

The error seems to not happen when gathering error information from container_group jobs. So it is not expected that AWX installs would face this.

Separately, I tested for both local work (like project updates) and work on execution nodes. In both cases, we hit this traceback. So that appears to be the critical difference - we get a different kind of socket in the VM case versus kubernetes case. We still need to fix this.

@AlanCoding
Copy link
Member Author

Ping @jay-steurer this is the original issue report, but we have it in Jira so I will close this as I merge the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants