Regression in error handling when ansible-runner install fails in traceback #13494

AlanCoding · 2023-01-31T20:49:04Z

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

SUMMARY

The setup for this is the case that running any ansible-runner command results in a traceback. This is known to have happened in the past due to some pip issues. When this happens, we expect to get the traceback in the result_traceback from API details. We are not getting that now due to an unrelated (novel) error masking it.

STEPS TO REPRODUCE

I make this change to the ansible-runner install to simulate it.

diff --git a/ansible_runner/streaming.py b/ansible_runner/streaming.py
index 1a64be7..27fe008 100644
--- a/ansible_runner/streaming.py
+++ b/ansible_runner/streaming.py
@@ -71,6 +71,7 @@ class Worker(object):
 
         self.kwargs = kwargs
         self.job_kwargs = None
+        raise Exception('alan!')
 
         private_data_dir = kwargs.get('private_data_dir')
         if private_data_dir is None:

EXPECTED RESULTS

I expect the same thing that receptorctl work results unit_id gives

bash-5.1$ receptorctl work results zYiQtRNL
Warning: receptorctl and receptor are different versions, they may not be compatible
Traceback (most recent call last):
  File "/awx_devel/testing/ansible-runner/ansible_runner/__main__.py", line 864, in main
    res = run(**run_options)
  File "/awx_devel/testing/ansible-runner/ansible_runner/interface.py", line 210, in run
    r = init_runner(**kwargs)
  File "/awx_devel/testing/ansible-runner/ansible_runner/interface.py", line 103, in init_runner
    stream_worker = Worker(**kwargs)
  File "/awx_devel/testing/ansible-runner/ansible_runner/streaming.py", line 74, in __init__
    raise Exception('alan!')
Exception: alan!
ERROR: Remote unit failed: exit status 1

to be shown in /api/v2/project_updates/N/ under result_traceback in my case.

ACTUAL RESULTS

Job details look like this:

ADDITIONAL INFORMATION

This seems to be a new regression with #12961

If I revert that partially...

diff --git a/awx/main/tasks/receptor.py b/awx/main/tasks/receptor.py
index 94568ebd6c..47503314f6 100644
--- a/awx/main/tasks/receptor.py
+++ b/awx/main/tasks/receptor.py
@@ -431,10 +431,8 @@ class AWXReceptorJob:
                         # if receptor work unit failed and no events were emitted, work results may
                         # contain useful information about why the job failed. In case stdout is
                         # massive, only ask for last 1000 bytes
-                        startpos = max(stdout_size - 1000, 0)
-                        resultsock, resultfile = receptor_ctl.get_work_results(self.unit_id, startpos=startpos, return_socket=True, return_sockfile=True)
-                        resultsock.setblocking(False)  # this makes resultfile reads non blocking
-                        lines = resultfile.readlines()
+                        resultsock = receptor_ctl.get_work_results(self.unit_id, return_sockfile=True)
+                        lines = resultsock.readlines()
                         receptor_output = b"".join(lines).decode()
                     if receptor_output:
                         self.task.runner_callback.delay_update(result_traceback=receptor_output)

it gives the desired data

The text was updated successfully, but these errors were encountered:

AlanCoding · 2023-01-31T21:19:30Z

If I just remove the line resultsock.setblocking(False), then it turns out that fixes it.

Of course, this would not be a valid solution in practice. However, we could use python signals to set an alarm or something along those lines.

AlanCoding · 2023-01-31T21:32:11Z

The approach with signal.alarm seems to work:

                        def last_ditch():
                            logger.warning(f'Failed to read results from receptor for unit_id={self.unit_id}')

                        signal.signal(signal.SIGALRM, last_ditch)
                        signal.alarm(1)
                        lines = resultfile.readlines()
                        signal.alarm(0)
                        receptor_output = b"".join(lines).decode()

But for one, I can't yet figure out a way to test this. Secondly, this doesn't have proper error handling to reset the prior state of signals.

AlanCoding · 2023-02-09T18:37:14Z

From further testing, I've got some new info on this:

The error seems to not happen when gathering error information from container_group jobs. So it is not expected that AWX installs would face this.

Separately, I tested for both local work (like project updates) and work on execution nodes. In both cases, we hit this traceback. So that appears to be the critical difference - we get a different kind of socket in the VM case versus kubernetes case. We still need to fix this.

AlanCoding · 2023-08-30T19:49:51Z

Ping @jay-steurer this is the original issue report, but we have it in Jira so I will close this as I merge the fix.

github-actions bot added the needs_triage label Jan 31, 2023

AlanCoding mentioned this issue Jan 31, 2023

Consume job_explanation from runner, fix error reporting error #13482

Merged

akus062381 removed the needs_triage label Feb 1, 2023

AlanCoding self-assigned this Feb 3, 2023

AlanCoding added the component:api label Feb 3, 2023

AlanCoding closed this as completed Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in error handling when ansible-runner install fails in traceback #13494

Regression in error handling when ansible-runner install fails in traceback #13494

AlanCoding commented Jan 31, 2023

AlanCoding commented Jan 31, 2023

AlanCoding commented Jan 31, 2023

AlanCoding commented Feb 9, 2023

AlanCoding commented Aug 30, 2023

Regression in error handling when ansible-runner install fails in traceback #13494

Regression in error handling when ansible-runner install fails in traceback #13494

Comments

AlanCoding commented Jan 31, 2023

ISSUE TYPE

COMPONENT NAME

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ADDITIONAL INFORMATION

AlanCoding commented Jan 31, 2023

AlanCoding commented Jan 31, 2023

AlanCoding commented Feb 9, 2023

AlanCoding commented Aug 30, 2023