Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWXBackup fails with WebSocketConnectionClosedException: Connection to remote host was lost. on AKS #1435

Closed
3 tasks done
Maxltrm opened this issue Jun 1, 2023 · 6 comments · Fixed by #1580
Closed
3 tasks done
Assignees

Comments

@Maxltrm
Copy link

Maxltrm commented Jun 1, 2023

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

When I create an awxbackup CR on AKS, the backup goes into a restart loop, the awxbackup-db-management pod dies after about 5 minutes.

To troubleshoot the problem, I created a custom version of the awx-operator and updated the task "Write pg_dump to backup on PVC" in roles/backup/tasks/postgres.yml by commenting out the failed_when statement, then created an awxbackup CR and got this in the operator logs:

TASK [Write pg_dump to backup on PVC] ******************************************
task path: /opt/ansible/roles/backup/tasks/postgres.yml:97\nfatal: [localhost]: FAILED! => {\"changed\": false, \"module_stderr\": \"/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/__init__.py:10: DeprecationWarning: The package kubernetes.client.apis is renamed and deprecated, use kubernetes.client.api instead (please note that the trailing s was removed).\\n  warnings.warn(\\nTraceback (most recent call last):\\n  File \\\"/opt/ansible/.ansible/tmp/ansible-tmp-1685621049.5945885-1007-97727903579089/AnsiballZ_k8s_exec.py\\\", line 102, in <module>\\n    _ansiballz_main()\
  File \\\"/opt/ansible/.ansible/tmp/ansible-tmp-1685621049.5945885-1007-97727903579089/AnsiballZ_k8s_exec.py\\\", line 94, in _ansiballz_main\
    invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\
  File \\\"/opt/ansible/.ansible/tmp/ansible-tmp-1685621049.5945885-1007-97727903579089/AnsiballZ_k8s_exec.py\\\", line 40, in invoke_module\
    runpy.run_module(mod_name='ansible_collections.kubernetes.core.plugins.modules.k8s_exec', init_globals=None, run_name='__main__', alter_sys=True)\
  File \\\"/usr/lib64/python3.8/runpy.py\\\", line 207, in run_module\
    return _run_module_code(code, init_globals, run_name, mod_spec)\
  File \\\"/usr/lib64/python3.8/runpy.py\\\", line 97, in _run_module_code\
    _run_code(code, mod_globals, init_globals,\
  File \\\"/usr/lib64/python3.8/runpy.py\\\", line 87, in _run_code\
    exec(code, run_globals)\
  File \\\"/tmp/ansible_k8s_exec_payload_399gjeg1/ansible_k8s_exec_payload.zip/ansible_collections/kubernetes/core/plugins/modules/k8s_exec.py\\\", line 254, in <module>\
  File \\\"/tmp/ansible_k8s_exec_payload_399gjeg1/ansible_k8s_exec_payload.zip/ansible_collections/kubernetes/core/plugins/modules/k8s_exec.py\\\", line 248, in main\
  File \\\"/tmp/ansible_k8s_exec_payload_399gjeg1/ansible_k8s_exec_payload.zip/ansible_collections/kubernetes/core/plugins/modules/k8s_exec.py\\\", line 210, in execute_module\
  File \\\"/usr/local/lib/python3.8/site-packages/kubernetes/stream/ws_client.py\\\", line 192, in update\
    op_code, frame = self.sock.recv_data_frame(True)\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_core.py\\\", line 406, in recv_data_frame\
    frame = self.recv_frame()\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_core.py\\\", line 445, in recv_frame\
    return self.frame_buffer.recv_frame()\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_abnf.py\\\", line 338, in recv_frame\
    self.recv_header()\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_abnf.py\\\", line 294, in recv_header\
    header = self.recv_strict(2)\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_abnf.py\\\", line 373, in recv_strict\
    bytes_ = self.recv(min(16384, shortage))\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_core.py\\\", line 529, in _recv\
    return recv(self.sock, bufsize)\
  File \\\"/usr/local/lib/python3.8/site-packages/websocket/_socket.py\\\", line 122, in recv\
    raise WebSocketConnectionClosedException(\
websocket._exceptions.WebSocketConnectionClosedException: Connection to remote host was lost.\
\", \"module_stdout\": \"\", \"msg\": \"MODULE FAILURE\
See stdout/stderr for the exact error\", \"rc\": 1}

The problem seems very likely related to other similar problems users have on AKS where exec connections get closed after ~5 minutes (ansible/awx#12530 (comment)), which is also the reason why AWX_RUNNER_KEEPALIVE_SECONDS has been implemented in AWX.

I'm testing the following workaround, and the pod is no longer dying after 5 minutes:

- name: Write pg_dump to backup on PVC
  k8s_exec:
    namespace: "{{ backup_pvc_namespace }}"
    pod: "{{ ansible_operator_meta.name }}-db-management"
    command: |
      bash -c """
      set -e -o pipefail
      keepalive () {
        while true;do echo 'keepalive'; sleep 3 ;done
      }
      keepalive &
      PGPASSWORD='{{ awx_postgres_pass }}' {{ pgdump }} > {{ backup_dir }}/tower.db
      pkill -P $$          # kills all descendants pids
      echo 'Successful'
      """
  register: data_migration
  no_log: "{{ no_log }}"
  failed_when: "'Successful' not in data_migration.stdout"

Is there a better way that I am not considering to handle this problem or is there something like AWX_RUNNER_KEEPALIVE_SECONDS that we can set in the operator? Or is this actually a bug?

AWX Operator version

2.0.1

AWX version

22.1.0

Kubernetes platform

other (please specify in additional information)

Kubernetes/Platform version

AKS

Modifications

no

Steps to reproduce

  1. Install the AWX-OPERATOR on AKS
  2. Create an AWX Instance with a database big enough to make the backup job last more than 5 mins.
  3. Create an AWXBackup custom resource
  4. The db-management pod will loop restart every ~5mins

Expected results

The db-management shouldn't loop-restart and the backup should succeed.

Actual results

The db-management pod dies after ~5mins.

Additional information

No response

Operator Logs

No response

@Maxltrm
Copy link
Author

Maxltrm commented Jun 1, 2023

The workaround reported doesn't actually work, the pod does not die but the job never ends.

@Maxltrm
Copy link
Author

Maxltrm commented Jun 1, 2023

pkill is not available in the container, I managed to make the backup complete successfully by killing the keepalive job with kill %1:

- name: Write pg_dump to backup on PVC
  k8s_exec:
    namespace: "{{ backup_pvc_namespace }}"
    pod: "{{ ansible_operator_meta.name }}-db-management"
    command: |
      bash -c """
      set -e -o pipefail
      keepalive () {
        while true;do echo 'keepalive'; sleep 3 ;done
      }
      keepalive &
      PGPASSWORD='{{ awx_postgres_pass }}' {{ pgdump }} > {{ backup_dir }}/tower.db
      kill %1
      echo 'Successful'
      """
  register: data_migration
  no_log: "{{ no_log }}"
  failed_when: "'Successful' not in data_migration.stdout"

AWXBackup status:

  Conditions:
    Last Transition Time:  2023-06-01T18:05:23Z
    Reason:                Successful
    Status:                True
    Type:                  Successful

The backup took around 16m and exited successfully, and no tasks failed.

@fosterseth
Copy link
Member

thanks for posting your workaround!

@TheRealHaoLiu
Copy link
Member

We recently fix a similar thing in the migrating from old database to new

We will try to incorporate your fix into the playbook

If you have time would u open a PR for this?

@TheRealHaoLiu
Copy link
Member

https://github.com/ansible/awx-operator/blob/devel/roles/installer/tasks/migrate_data.yml

Our implantation is in here I recall we were seeing zombie processes with something similar to your implementation

@TheRealHaoLiu
Copy link
Member

took a look at the backup role again it seems like we are using a separate db-mgmt pod to do dump command i think we are safe to not worry about zombie processes

@TheRealHaoLiu TheRealHaoLiu self-assigned this Oct 6, 2023
TheRealHaoLiu added a commit to TheRealHaoLiu/awx-operator that referenced this issue Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants