-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconnect support doesn't appear in logs. "Error reading from pod ... context canceled" error in control-plane-ee logs #14130
Comments
Can u provide the SHA for the awx-ee for controlplane |
messages like this is probably a red herring its not actually |
is there a problem that you see when u see these error? |
Hi @TheRealHaoLiu,
Sure, it is ff9de65fea54b38f96c95f972e3ae0e6ecce893a3b68a5422c8cdec7bf172b1c (I could not find it for the previous image, but the same problem remains with this one, that is closer to latest).
What can lead to this error and what does it mean for us?
I can't see any problems in UI—just the additional errors in logs. But I also noticed another thing that concerns me. The RECEPTOR_KUBE_SUPPORT_RECONNECT should be enabled automatically (since it is not disabled), but for some reason, I can't see any logs related to it, while, in the previous AWX version, there were logs like this: Does it mean that there is no reconnect support anymore? I try to find any mentions of this functionality by searching "reconnect" in logs, but there are no logs like this at all =( |
By the way, I see that reconnect support works with EKS with version 21.10.2, according to the logs, and it does not work in the versions after task/web separation. Is there any way to downgrade to this version? The reconnect support is important for us. |
@TheRealHaoLiu, I've run some performance tests and want to summarize the results: When I use the control plane version 22.3.0 with the reverted receptor bug with RECEPTOR_KUBE_SUPPORT_RECONNECT enabled, I get these errors:
And I don't get the message like this: When I use the control plane version 22.3.0 with the reverted receptor bug with RECEPTOR_KUBE_SUPPORT_RECONNECT disabled, I don't get the above errors, but I get other errors like: Error streaming pod logs to stdout for pod awx-workers/automation-job-2836-9j4cs. Error: context canceled It seems to be related to the reconnect support functionality that stopped working in the new versions. In version 21.10.2, it seems to be ok, and I see this message in the logs:
But it doesn't contain the fix you did to reconnect the receptor when the K8s master pod is down. |
there's some change recently that turn debug level to INFO instead of DEBUG so some of the message that you previously see will not show up @fosterseth is working on re-enabling the ability to set log level to debug #14098 and ansible/awx-operator#1444 |
That sounds like good news because it means that reconnect support might still work. What bothers us is these logs that appear when we enable (or not disable) reconnect support: Can it mean that reconnect support doesn't work or there are some problems with it? |
|
So most probably nothing serious if no affect on AWX functionality? I'm going to run performance tests on this version with RECONNECT_SUPPORT_ENABLED. |
thanks @elibogomolnyi do you see any problem beside messages in the log? |
Please confirm the following
[email protected]
instead.)Bug Summary
We've upgraded our dev environments to 22.3.0 on EKS 1.24.
RECEPTOR_KUBE_SUPPORT_RECONNECT is enabled. These errors don't appear when RECEPTOR_KUBE_SUPPORT_RECONNECT is disabled.
Our performance tests run without problems, and everything seems to work as expected. But in this version, we started getting the following errors in logs frequently in the control-plane-ee pods:
Our Prod environment runs on 21.10.2 and doesn't have these log errors.
To upgrade the Prod environments, we must ensure this error won't cause any issues when we have a larger workflow load. What causes this issue, can it be critical for production, and can it be fixed in future releases?
AWX version
22.3.0
Select the relevant components
Installation method
kubernetes
Modifications
yes
Web browser
Chrome
Steps to reproduce
Run workflows on AWX 22.3.0 in EKS 1.24
Expected results
No errors in the logs
Actual results
There are errors in the logs
Additional information
We use the awx-control-plane version 22.3.0 with the reverted bug in the receptor, so our version is closer to the latest.
The text was updated successfully, but these errors were encountered: