AWXBackup fails with WebSocketConnectionClosedException: Connection to remote host was lost. on AKS #1435

Maxltrm · 2023-06-01T14:43:32Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

When I create an awxbackup CR on AKS, the backup goes into a restart loop, the awxbackup-db-management pod dies after about 5 minutes.

To troubleshoot the problem, I created a custom version of the awx-operator and updated the task "Write pg_dump to backup on PVC" in roles/backup/tasks/postgres.yml by commenting out the failed_when statement, then created an awxbackup CR and got this in the operator logs:

TASK [Write pg_dump to backup on PVC] ******************************************
task path: /opt/ansible/roles/backup/tasks/postgres.yml:97\nfatal: [localhost]: FAILED! => {\"changed\": false, \"module_stderr\": \"/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/__init__.py:10: DeprecationWarning: The package kubernetes.client.apis is renamed and deprecated, use kubernetes.client.api instead (please note that the trailing s was removed).\\n  warnings.warn(\\nTraceback (most recent call last):\\n  File \\\"/opt/ansible/.ansible/tmp/ansible-tmp-1685621049.5945885-1007-97727903579089/AnsiballZ_k8s_exec.py\\\", line 102, in <module>\\n    _ansiballz_main()\
  File \\\"/opt/ansible/.ansible/tmp/ansible-tmp-1685621049.5945885-1007-97727903579089/AnsiballZ_k8s_exec.py\\\", line 94, in _ansiballz_main\
    invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\
  File \\\"/opt/ansible/.ansible/tmp/ansible-tmp-1685621049.5945885-1007-97727903579089/AnsiballZ_k8s_exec.py\\\", line 40, in invoke_module\
    runpy.run_module(mod_name='ansible_collections.kubernetes.core.plugins.modules.k8s_exec', init_globals=None, run_name='__main__', alter_sys=True)\
  File \\\"/usr/lib64/python3.8/runpy.py\\\", line 207, in run_module\
    return _run_module_code(code, init_globals, run_name, mod_spec)\
  File \\\"/usr/lib64/python3.8/runpy.py\\\", line 97, in _run_module_code\
    _run_code(code, mod_globals, init_globals,\
  File \\\"/usr/lib64/python3.8/runpy.py\\\", line 87, in _run_code\
    exec(code, run_globals)\
  File \\\"/tmp/ansible_k8s_exec_payload_399gjeg1/ansible_k8s_exec_payload.zip/ansible_collections/kubernetes/core/plugins/modules/k8s_exec.py\\\", line 254, in <module>\
  File \\\"/tmp/ansible_k8s_exec_payload_399gjeg1/ansible_k8s_exec_payload.zip/ansible_collections/kubernetes/core/plugins/modules/k8s_exec.py\\\", line 248, in main\
  File \\\"/tmp/ansible_k8s_exec_payload_399gjeg1/ansible_k8s_exec_payload.zip/ansible_collections/kubernetes/core/plugins/modules/k8s_exec.py\\\", line 210, in execute_module\
  File \\\"/usr/local/lib/python3.8/site-packages/kubernetes/stream/ws_client.py\\\", line 192, in update\
    op_code, frame = self.sock.recv_data_frame(True)\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_core.py\\\", line 406, in recv_data_frame\
    frame = self.recv_frame()\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_core.py\\\", line 445, in recv_frame\
    return self.frame_buffer.recv_frame()\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_abnf.py\\\", line 338, in recv_frame\
    self.recv_header()\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_abnf.py\\\", line 294, in recv_header\
    header = self.recv_strict(2)\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_abnf.py\\\", line 373, in recv_strict\
    bytes_ = self.recv(min(16384, shortage))\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_core.py\\\", line 529, in _recv\
    return recv(self.sock, bufsize)\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_socket.py\\\", line 122, in recv\
    raise WebSocketConnectionClosedException(\
websocket._exceptions.WebSocketConnectionClosedException: Connection to remote host was lost.\
\", \"module_stdout\": \"\", \"msg\": \"MODULE FAILURE\
See stdout/stderr for the exact error\", \"rc\": 1}

The problem seems very likely related to other similar problems users have on AKS where exec connections get closed after ~5 minutes (ansible/awx#12530 (comment)), which is also the reason why AWX_RUNNER_KEEPALIVE_SECONDS has been implemented in AWX.

I'm testing the following workaround, and the pod is no longer dying after 5 minutes:

- name: Write pg_dump to backup on PVC
  k8s_exec:
    namespace: "{{ backup_pvc_namespace }}"
    pod: "{{ ansible_operator_meta.name }}-db-management"
    command: |
      bash -c """
      set -e -o pipefail
      keepalive () {
        while true;do echo 'keepalive'; sleep 3 ;done
      }
      keepalive &
      PGPASSWORD='{{ awx_postgres_pass }}' {{ pgdump }} > {{ backup_dir }}/tower.db
      pkill -P $$          # kills all descendants pids
      echo 'Successful'
      """
  register: data_migration
  no_log: "{{ no_log }}"
  failed_when: "'Successful' not in data_migration.stdout"

Is there a better way that I am not considering to handle this problem or is there something like AWX_RUNNER_KEEPALIVE_SECONDS that we can set in the operator? Or is this actually a bug?

AWX Operator version

2.0.1

AWX version

22.1.0

Kubernetes platform

other (please specify in additional information)

Kubernetes/Platform version

AKS

Modifications

no

Steps to reproduce

Install the AWX-OPERATOR on AKS
Create an AWX Instance with a database big enough to make the backup job last more than 5 mins.
Create an AWXBackup custom resource
The db-management pod will loop restart every ~5mins

Expected results

The db-management shouldn't loop-restart and the backup should succeed.

Actual results

The db-management pod dies after ~5mins.

Additional information

No response

Operator Logs

No response

The text was updated successfully, but these errors were encountered:

Maxltrm · 2023-06-01T16:19:49Z

The workaround reported doesn't actually work, the pod does not die but the job never ends.

Maxltrm · 2023-06-01T18:31:47Z

pkill is not available in the container, I managed to make the backup complete successfully by killing the keepalive job with kill %1:

- name: Write pg_dump to backup on PVC
  k8s_exec:
    namespace: "{{ backup_pvc_namespace }}"
    pod: "{{ ansible_operator_meta.name }}-db-management"
    command: |
      bash -c """
      set -e -o pipefail
      keepalive () {
        while true;do echo 'keepalive'; sleep 3 ;done
      }
      keepalive &
      PGPASSWORD='{{ awx_postgres_pass }}' {{ pgdump }} > {{ backup_dir }}/tower.db
      kill %1
      echo 'Successful'
      """
  register: data_migration
  no_log: "{{ no_log }}"
  failed_when: "'Successful' not in data_migration.stdout"

AWXBackup status:

  Conditions:
    Last Transition Time:  2023-06-01T18:05:23Z
    Reason:                Successful
    Status:                True
    Type:                  Successful

The backup took around 16m and exited successfully, and no tasks failed.

fosterseth · 2023-06-07T15:39:32Z

thanks for posting your workaround!

TheRealHaoLiu · 2023-10-06T03:04:28Z

We recently fix a similar thing in the migrating from old database to new

We will try to incorporate your fix into the playbook

If you have time would u open a PR for this?

TheRealHaoLiu · 2023-10-06T03:07:15Z

https://github.com/ansible/awx-operator/blob/devel/roles/installer/tasks/migrate_data.yml

Our implantation is in here I recall we were seeing zombie processes with something similar to your implementation

TheRealHaoLiu · 2023-10-06T13:43:15Z

took a look at the backup role again it seems like we are using a separate db-mgmt pod to do dump command i think we are safe to not worry about zombie processes

fixes ansible#1435

github-actions bot added needs_triage community labels Jun 1, 2023

fosterseth removed the needs_triage label Jun 7, 2023

TheRealHaoLiu self-assigned this Oct 6, 2023

TheRealHaoLiu added a commit to TheRealHaoLiu/awx-operator that referenced this issue Oct 6, 2023

Adding keepalive while doing pg_dump

45cdb0e

fixes ansible#1435

TheRealHaoLiu mentioned this issue Oct 6, 2023

Adding keepalive while doing pg_dump #1580

Merged

TheRealHaoLiu closed this as completed in #1580 Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWXBackup fails with WebSocketConnectionClosedException: Connection to remote host was lost. on AKS #1435

AWXBackup fails with WebSocketConnectionClosedException: Connection to remote host was lost. on AKS #1435

Maxltrm commented Jun 1, 2023 •

edited

Loading

Maxltrm commented Jun 1, 2023

Maxltrm commented Jun 1, 2023 •

edited

Loading

fosterseth commented Jun 7, 2023

TheRealHaoLiu commented Oct 6, 2023

TheRealHaoLiu commented Oct 6, 2023

TheRealHaoLiu commented Oct 6, 2023

AWXBackup fails with WebSocketConnectionClosedException: Connection to remote host was lost. on AKS #1435

AWXBackup fails with WebSocketConnectionClosedException: Connection to remote host was lost. on AKS #1435

Comments

Maxltrm commented Jun 1, 2023 • edited Loading

Please confirm the following

Bug Summary

AWX Operator version

AWX version

Kubernetes platform

Kubernetes/Platform version

Modifications

Steps to reproduce

Expected results

Actual results

Additional information

Operator Logs

Maxltrm commented Jun 1, 2023

Maxltrm commented Jun 1, 2023 • edited Loading

fosterseth commented Jun 7, 2023

TheRealHaoLiu commented Oct 6, 2023

TheRealHaoLiu commented Oct 6, 2023

TheRealHaoLiu commented Oct 6, 2023

Maxltrm commented Jun 1, 2023 •

edited

Loading

Maxltrm commented Jun 1, 2023 •

edited

Loading