Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnect support doesn't appear in logs. "Error reading from pod ... context canceled" error in control-plane-ee logs #14130

Open
5 of 11 tasks
elibogomolnyi opened this issue Jun 15, 2023 · 11 comments

Comments

@elibogomolnyi
Copy link

elibogomolnyi commented Jun 15, 2023

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

We've upgraded our dev environments to 22.3.0 on EKS 1.24.
RECEPTOR_KUBE_SUPPORT_RECONNECT is enabled. These errors don't appear when RECEPTOR_KUBE_SUPPORT_RECONNECT is disabled.

Our performance tests run without problems, and everything seems to work as expected. But in this version, we started getting the following errors in logs frequently in the control-plane-ee pods:

WARNING 2023/06/14 17:12:04 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
WARNING 2023/06/14 17:12:04 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
WARNING 2023/06/14 17:12:30 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
WARNING 2023/06/14 17:12:30 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
ERROR 2023/06/14 17:12:39 [uItraDCd] Error reading from pod awx-workers/automation-job-1498-dd8ck: context canceled
ERROR 2023/06/14 17:12:42 Exceeded retries for reading stdout /tmp/receptor/awx-task-7594db4979-qf2nc/uItraDCd/stdout
WARNING 2023/06/14 17:12:43 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
WARNING 2023/06/14 17:12:43 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
WARNING 2023/06/15 09:23:01 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
WARNING 2023/06/15 09:23:01 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
ERROR 2023/06/15 09:23:20 [meuOi8Bb] Error reading from pod awx-workers/automation-job-1504-zxwd4: context canceled
ERROR 2023/06/15 09:23:20 [fugx4MT0] Error reading from pod awx-workers/automation-job-1501-h5hzj: context canceled

Our Prod environment runs on 21.10.2 and doesn't have these log errors.

To upgrade the Prod environments, we must ensure this error won't cause any issues when we have a larger workflow load. What causes this issue, can it be critical for production, and can it be fixed in future releases?

AWX version

22.3.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

yes

Web browser

Chrome

Steps to reproduce

Run workflows on AWX 22.3.0 in EKS 1.24

Expected results

No errors in the logs

Actual results

There are errors in the logs

Additional information

We use the awx-control-plane version 22.3.0 with the reverted bug in the receptor, so our version is closer to the latest.

@elibogomolnyi elibogomolnyi changed the title "Context canceled" error in control-plane-ee logs "Error reading from pod ... context canceled" error in control-plane-ee logs Jun 15, 2023
@TheRealHaoLiu
Copy link
Member

Can u provide the SHA for the awx-ee for controlplane

@TheRealHaoLiu
Copy link
Member

Error reading from pod awx-workers/automation-job-1504-zxwd4: context canceled

messages like this is probably a red herring its not actually Error reading from pod its context canceled meaning that the ctx we pass through the goroutine has been canceled which means probably workunit was release

@TheRealHaoLiu
Copy link
Member

is there a problem that you see when u see these error?

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Jun 15, 2023

Hi @TheRealHaoLiu,

Can u provide the SHA for the awx-ee for controlplane

Sure, it is ff9de65fea54b38f96c95f972e3ae0e6ecce893a3b68a5422c8cdec7bf172b1c (I could not find it for the previous image, but the same problem remains with this one, that is closer to latest).

messages like this is probably a red herring its not actually Error reading from pod its context canceled meaning that the ctx we pass through the goroutine has been canceled which means probably workunit was release

What can lead to this error and what does it mean for us?

is there a problem that you see when you see these errors?

I can't see any problems in UI—just the additional errors in logs.

But I also noticed another thing that concerns me. The RECEPTOR_KUBE_SUPPORT_RECONNECT should be enabled automatically (since it is not disabled), but for some reason, I can't see any logs related to it, while, in the previous AWX version, there were logs like this:
Kubernetes version v1.24.8 is at least v1.24.8, using reconnect support

Does it mean that there is no reconnect support anymore? I try to find any mentions of this functionality by searching "reconnect" in logs, but there are no logs like this at all =(

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Jun 18, 2023

@TheRealHaoLiu,

By the way, I see that reconnect support works with EKS with version 21.10.2, according to the logs, and it does not work in the versions after task/web separation. Is there any way to downgrade to this version? The reconnect support is important for us.

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Jun 19, 2023

@TheRealHaoLiu, I've run some performance tests and want to summarize the results:


When I use the control plane version 22.3.0 with the reverted receptor bug with RECEPTOR_KUBE_SUPPORT_RECONNECT enabled, I get these errors:

Error reading from pod awx-workers... client rate limiter Wait returned an error: context canceled

And I don't get the message like this:
Kubernetes version v1.24.8 is at least v1.24.8, using reconnect support


When I use the control plane version 22.3.0 with the reverted receptor bug with RECEPTOR_KUBE_SUPPORT_RECONNECT disabled, I don't get the above errors, but I get other errors like:

Error streaming pod logs to stdout for pod awx-workers/automation-job-2836-9j4cs. Error: context canceled

It seems to be related to the reconnect support functionality that stopped working in the new versions.


In version 21.10.2, it seems to be ok, and I see this message in the logs:

Kubernetes version v1.24.8 is at least v1.24.8, using reconnect support

But it doesn't contain the fix you did to reconnect the receptor when the K8s master pod is down.

@elibogomolnyi elibogomolnyi changed the title "Error reading from pod ... context canceled" error in control-plane-ee logs Reconnect support doesn't appear in logs. "Error reading from pod ... context canceled" error in control-plane-ee logs Jun 19, 2023
@TheRealHaoLiu
Copy link
Member

there's some change recently that turn debug level to INFO instead of DEBUG so some of the message that you previously see will not show up @fosterseth is working on re-enabling the ability to set log level to debug #14098 and ansible/awx-operator#1444

@elibogomolnyi
Copy link
Author

there's some change recently that turn debug level to INFO instead of DEBUG so some of the message that you previously see will not show up @fosterseth is working on re-enabling the ability to set log level to debug #14098 and ansible/awx-operator#1444

That sounds like good news because it means that reconnect support might still work. What bothers us is these logs that appear when we enable (or not disable) reconnect support:
Error reading from pod awx-workers... client rate limiter Wait returned an error: context canceled.

Can it mean that reconnect support doesn't work or there are some problems with it?

@TheRealHaoLiu
Copy link
Member

context canceled. is kinda red herring essentially we pass the ctx for the work down to kube call (which we shouldnt) and when a job is canceled/deleted the main go routine didnt catch it and stop and rather we pass it down to kube and kube caught it... ctx cancel is just means something pass the signal to cancel the goroutine the error message ahead client rate limiter Wait returned an error is just misleading error message

@elibogomolnyi
Copy link
Author

elibogomolnyi commented Jun 21, 2023

context canceled. is kinda red herring essentially we pass the ctx for the work down to kube call (which we shouldnt) and when a job is canceled/deleted the main go routine didnt catch it and stop and rather we pass it down to kube and kube caught it... ctx cancel is just means something pass the signal to cancel the goroutine the error message ahead client rate limiter Wait returned an error is just misleading error message

So most probably nothing serious if no affect on AWX functionality? I'm going to run performance tests on this version with RECONNECT_SUPPORT_ENABLED.

@TheRealHaoLiu
Copy link
Member

thanks @elibogomolnyi do you see any problem beside messages in the log?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants