-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work results stream sometimes exits before all data has been received #597
Comments
shanemcd
added a commit
to shanemcd/receptor
that referenced
this issue
Apr 22, 2022
shanemcd
added a commit
to shanemcd/receptor
that referenced
this issue
Apr 25, 2022
I just tried to reproduce ansible/ansible-runner#998 (comment) using the latest binary |
6 tasks
related PR: #600 ? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
We've seen several reports of AWX jobs entering the "error" state even though the playbook actually runs to completion.
See:
There may be more. If I find them I'll update the list above.
The problem
AWX uses Receptor to launch remote work. We utilize ansible-runner's streaming interface to convert both the job's payload and results into a format that can be delivered over a network.
The flow is something like this:
socketpair
Now with this context, let's take another look at ansible/ansible-runner#998. You can see that for some reason, the data it's trying to decode is not valid base64 data.
With that in mind, I added some debug logging to see what was going on when ansible-runner blew up. What I found was that subsequent reads from the socket were returning an empty byte string:
If you go and try it out, this is simply the behavior when trying to read from a
socket.makefile()
after the writer has closed the connection.So, why is the connection getting closed?
You can see that Receptor closes the connection when it thought all results had been written to a channel whose contents were being streamed back to the client:
receptor/pkg/workceptor/controlsvc.go
Lines 427 to 436 in 5f89364
If we look inside
GetResults
, it's a goroutine writing to the channel being read and sent to the client. The loop ejects when it sees an EOF event, if the work unit is complete, it considers the stream to finished too:receptor/pkg/workceptor/workceptor.go
Line 568 in 5f89364
Let's add some debug logging to inspect the variables when this condition hits:
For a job that blew up, I saw:
Wait... why is
filePos
less thanstdoutSize
?If you look closely, the file size is obtained directly above the suspect conditional:
receptor/pkg/workceptor/workceptor.go
Line 567 in a96b855
So now we're getting to the root of the problem:
Another potential bug that may be another race condition waiting to bite us is the fact that we make 2 calls to
unit.Status()
on the same linereceptor/pkg/workceptor/workceptor.go
Line 568 in a96b855
Putting all of these pieces together, the fix looks something like this:
The text was updated successfully, but these errors were encountered: