-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWXBackup fails with WebSocketConnectionClosedException: Connection to remote host was lost. on AKS #1435
Comments
The workaround reported doesn't actually work, the pod does not die but the job never ends. |
pkill is not available in the container, I managed to make the backup complete successfully by killing the keepalive job with kill %1:
AWXBackup status:
The backup took around 16m and exited successfully, and no tasks failed. |
thanks for posting your workaround! |
We recently fix a similar thing in the migrating from old database to new We will try to incorporate your fix into the playbook If you have time would u open a PR for this? |
https://github.com/ansible/awx-operator/blob/devel/roles/installer/tasks/migrate_data.yml Our implantation is in here I recall we were seeing zombie processes with something similar to your implementation |
took a look at the backup role again it seems like we are using a separate db-mgmt pod to do dump command i think we are safe to not worry about zombie processes |
Please confirm the following
Bug Summary
When I create an awxbackup CR on AKS, the backup goes into a restart loop, the awxbackup-db-management pod dies after about 5 minutes.
To troubleshoot the problem, I created a custom version of the awx-operator and updated the task "Write pg_dump to backup on PVC" in roles/backup/tasks/postgres.yml by commenting out the failed_when statement, then created an awxbackup CR and got this in the operator logs:
The problem seems very likely related to other similar problems users have on AKS where exec connections get closed after ~5 minutes (ansible/awx#12530 (comment)), which is also the reason why AWX_RUNNER_KEEPALIVE_SECONDS has been implemented in AWX.
I'm testing the following workaround, and the pod is no longer dying after 5 minutes:
Is there a better way that I am not considering to handle this problem or is there something like AWX_RUNNER_KEEPALIVE_SECONDS that we can set in the operator? Or is this actually a bug?
AWX Operator version
2.0.1
AWX version
22.1.0
Kubernetes platform
other (please specify in additional information)
Kubernetes/Platform version
AKS
Modifications
no
Steps to reproduce
Expected results
The db-management shouldn't loop-restart and the backup should succeed.
Actual results
The db-management pod dies after ~5mins.
Additional information
No response
Operator Logs
No response
The text was updated successfully, but these errors were encountered: