Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner seems to have difficulties to process large outputs #998

Closed
stanislav-zaprudskiy opened this issue Feb 15, 2022 · 15 comments
Closed

Comments

@stanislav-zaprudskiy
Copy link

We use ansible-runner (tried both 2.1.1 and devel) in combination with AWX (19.4) all running in Kubernetes.

Apart of normal AWX jobs, ansible-runner EEs also execute inventory sync jobs using our custom inventory plugin. The command under the hood is similar to that below:

ansible-inventory --list --export -i /runner/project/awx.sh --output /runner/artifacts/<job-id>/output.json --playbook-dir /runner/project/ -vvv

And whenever the plugin generates "large" artifacts, we run into the following errors in AWX (varying from case to case - see below 3 the distinct examples), even though the corresponding EE Pod finishes successfully (e.g. {"status": "successful", "runner_ident": "<job-id>"} message followed by base64 encoded zipfile and successful termination of the Pod).

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1499, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3029, in run
    res = self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3081, in _run_internal
    res = list(first_future.done)[0].result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1102, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3128, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 257, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 249, in run
    self.artifacts_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 228, in artifacts_callback
    unstream_dir(self._input, length, self.artifact_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/streaming.py", line 69, in unstream_dir
    with zipfile.ZipFile(tmp.name, "r") as archive:
  File "/usr/lib64/python3.8/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/usr/lib64/python3.8/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1499, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3029, in run
    res = self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3081, in _run_internal
    res = list(first_future.done)[0].result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1102, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3128, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 257, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 249, in run
    self.artifacts_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 228, in artifacts_callback
    unstream_dir(self._input, length, self.artifact_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/streaming.py", line 64, in unstream_dir
    data = source.read(chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/base64io.py", line 282, in read
    results.write(base64.b64decode(data))
  File "/usr/lib64/python3.8/base64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1499, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3029, in run
    res = self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3081, in _run_internal
    res = list(first_future.done)[0].result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1102, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3128, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 257, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 249, in run
    self.artifacts_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 228, in artifacts_callback
    unstream_dir(self._input, length, self.artifact_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/streaming.py", line 86, in unstream_dir
    archive.extract(info.filename, path=target_directory)
  File "/usr/lib64/python3.8/zipfile.py", line 1630, in extract
    return self._extract_member(member, path, pwd)
  File "/usr/lib64/python3.8/zipfile.py", line 1700, in _extract_member
    with self.open(member, pwd=pwd) as source, \
  File "/usr/lib64/python3.8/zipfile.py", line 1530, in open
    fheader = zef_file.read(sizeFileHeader)
  File "/usr/lib64/python3.8/zipfile.py", line 763, in read
    self._file.seek(self._pos)
OSError: [Errno 22] Invalid argument

Failures are reproducible on /runner/artifacts/<job-id>/output.json files ranging from 80 to 400 MB (raw), for which the corresponding zipfile ranges from 5 to 11 MB (extracted from K8s logs). And no errors are detected on generated output.json inventory files of few KBs.

Are there reasonable thresholds or recommendations for artifacts size? Is there anything which we could configure for now, except of re-designing the inventories/plugin?

@github-actions github-actions bot added the needs_triage New item that needs to be triaged label Feb 15, 2022
@AlanCoding
Copy link
Member

I can't reproduce this

https://github.com/AlanCoding/test-playbooks/blob/master/inventories/dyn_inventory_large_1000.py

This creates random keys / values in hostvars to maximize the compressed size. Locally looking at the sizes:

$ zip alan/output.zip alan/output.json
updating: alan/output.json (deflated 32%)
(env) [alancoding@alan-red-hat test-playbooks]$ ls -l --block-size=M alan
total 390M
-rw-rw-r--. 1 alancoding alancoding 233M Mar 22 12:46 output.json
-rw-rw-r--. 1 alancoding alancoding 158M Mar 22 12:46 output.zip

These should exceed your values sufficiently. Yet I'm able to run this in AWX, as a container_group job without a problem.

@stanislav-zaprudskiy
Copy link
Author

I'll check your example script on our AWX deployment, and will report back with results.

@stanislav-zaprudskiy
Copy link
Author

stanislav-zaprudskiy commented Apr 4, 2022

@AlanCoding, thank you for looking into that. Below are the details of our additional testing.

We did check the inventory script you referenced, and we ran into the same issues as reported originally. The stack traces are actually produced by AWX task container, as we could now see - the jobs in AWX are just marked as in Error state w/o any details in UI/API at all (another major problem we've been observing randomly for months with any other regular jobs executed via AWX, which we were not able to troubleshoot yet).

We suspected PostgeSQL, which we run separately and as a cluster (3 nodes), so we did the similar tests on a separate fresh clean installation of AWX with PostgreSQL managed by the operator. Issue was still there.

Next, we suspected that multi-node K8s cluster may be the cause, so we made a new fresh AWX installation making sure that AWX, operator managed PostgreSQL and automation-job Pods all run on the same k8s node (though, there were other nodes in the cluster running different unrelated Pods). Issue was still there.

We also played around EEs - but at the end performed all tests using the default awx-ee:latest EE (using e.g. receptor:devel among other). Issue was still there.

Finally, we performed the same test but now using K8s from Docker for Mac (running on a laptop), and this time all worked w/o a hitch. Both your example inventory and our own inventory scripts made it to completion and registered hosts in AWX. All previous (failing) tests, however, were performed on a normal multi-node K8s installation in our OpenStack cloud.

Also, looking more thoroughly on the automation-job- Pod logs reveals that base64 encoded zip archive stream is not being streamed to "the end" on normal K8s compared to Docker for Mac K8s - I see that using kubectl logs command. It's just cut down somewhere in-between on normal K8s, while on Docker for Mac K8s I could spot eof: true message at the end of the stream. Apparently, this causes the zip archive stream to either be failing on decoding, or on subsequent unzipping (original stack traces). - I now have cases with {"eof": true} reported in kubectl logs, but Error job and errors in AWX.

Would you have any ideas to check further? Anything we could do on top to better understand the cause?

kubelet config diff docker-desktop vs cloud node
{                                                                               {
  "enableServer": true,                                                           "enableServer": true,
  "staticPodPath": "/etc/kubernetes/manifests",                               <
  "syncFrequency": "1m0s",                                                        "syncFrequency": "1m0s",
  "fileCheckFrequency": "20s",                                                    "fileCheckFrequency": "20s",
  "httpCheckFrequency": "30s",                                                |   "httpCheckFrequency": "20s",
  "address": "0.0.0.0",                                                           "address": "0.0.0.0",
  "port": 10250,                                                                  "port": 10250,
  "tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",                              "tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
  "tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",                        "tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
                                                                              >   "rotateCertificates": true,
  "authentication": {                                                             "authentication": {
    "x509": {                                                                       "x509": {
      "clientCAFile": "/run/config/pki/ca.crt"                                |       "clientCAFile": "/etc/kubernetes/certs/kubelet-clients-ca.pem"
    },                                                                              },
    "webhook": {                                                                    "webhook": {
      "enabled": true,                                                                "enabled": true,
      "cacheTTL": "2m0s"                                                              "cacheTTL": "2m0s"
    },                                                                              },
    "anonymous": {                                                                  "anonymous": {
      "enabled": false                                                        |       "enabled": true
    }                                                                               }
  },                                                                              },
  "authorization": {                                                              "authorization": {
    "mode": "Webhook",                                                              "mode": "Webhook",
    "webhook": {                                                                    "webhook": {
      "cacheAuthorizedTTL": "5m0s",                                                   "cacheAuthorizedTTL": "5m0s",
      "cacheUnauthorizedTTL": "30s"                                                   "cacheUnauthorizedTTL": "30s"
    }                                                                               }
  },                                                                              },
  "registryPullQPS": 5,                                                           "registryPullQPS": 5,
  "registryBurst": 10,                                                            "registryBurst": 10,
  "eventRecordQPS": 5,                                                            "eventRecordQPS": 5,
  "eventBurst": 10,                                                               "eventBurst": 10,
  "enableDebuggingHandlers": true,                                                "enableDebuggingHandlers": true,
  "healthzPort": 10248,                                                           "healthzPort": 10248,
  "healthzBindAddress": "127.0.0.1",                                              "healthzBindAddress": "127.0.0.1",
  "oomScoreAdj": -999,                                                            "oomScoreAdj": -999,
  "clusterDomain": "cluster.local",                                               "clusterDomain": "cluster.local",
  "clusterDNS": [                                                                 "clusterDNS": [
    "10.96.0.10"                                                              |     "198.18.128.2"
  ],                                                                              ],
  "streamingConnectionIdleTimeout": "4h0m0s",                                     "streamingConnectionIdleTimeout": "4h0m0s",
  "nodeStatusUpdateFrequency": "30s",                                         |   "nodeStatusUpdateFrequency": "10s",
  "nodeStatusReportFrequency": "30s",                                         |   "nodeStatusReportFrequency": "5m0s",
  "nodeLeaseDurationSeconds": 40,                                             |   "nodeLeaseDurationSeconds": 20,
  "imageMinimumGCAge": "2m0s",                                                    "imageMinimumGCAge": "2m0s",
  "imageGCHighThresholdPercent": 85,                                              "imageGCHighThresholdPercent": 85,
  "imageGCLowThresholdPercent": 80,                                               "imageGCLowThresholdPercent": 80,
  "volumeStatsAggPeriod": "1m0s",                                                 "volumeStatsAggPeriod": "1m0s",
  "cgroupRoot": "kubepods",                                                   <
  "cgroupsPerQOS": true,                                                          "cgroupsPerQOS": true,
  "cgroupDriver": "cgroupfs",                                                 |   "cgroupDriver": "systemd",
  "cpuManagerPolicy": "none",                                                     "cpuManagerPolicy": "none",
  "cpuManagerReconcilePeriod": "10s",                                             "cpuManagerReconcilePeriod": "10s",
  "memoryManagerPolicy": "None",                                                  "memoryManagerPolicy": "None",
  "topologyManagerPolicy": "none",                                                "topologyManagerPolicy": "none",
  "topologyManagerScope": "container",                                            "topologyManagerScope": "container",
  "runtimeRequestTimeout": "2m0s",                                                "runtimeRequestTimeout": "2m0s",
  "hairpinMode": "promiscuous-bridge",                                            "hairpinMode": "promiscuous-bridge",
  "maxPods": 110,                                                                 "maxPods": 110,
  "podPidsLimit": -1,                                                             "podPidsLimit": -1,
  "resolvConf": "/etc/resolv.conf",                                               "resolvConf": "/etc/resolv.conf",
  "cpuCFSQuota": true,                                                            "cpuCFSQuota": true,
  "cpuCFSQuotaPeriod": "100ms",                                                   "cpuCFSQuotaPeriod": "100ms",
  "nodeStatusMaxImages": 50,                                                      "nodeStatusMaxImages": 50,
  "maxOpenFiles": 1000000,                                                        "maxOpenFiles": 1000000,
  "contentType": "application/vnd.kubernetes.protobuf",                           "contentType": "application/vnd.kubernetes.protobuf",
  "kubeAPIQPS": 5,                                                                "kubeAPIQPS": 5,
  "kubeAPIBurst": 10,                                                             "kubeAPIBurst": 10,
  "serializeImagePulls": true,                                                    "serializeImagePulls": true,
  "evictionHard": {                                                               "evictionHard": {
    "imagefs.available": "15%",                                                     "imagefs.available": "15%",
    "memory.available": "100Mi",                                                    "memory.available": "100Mi",
    "nodefs.available": "10%",                                                      "nodefs.available": "10%",
    "nodefs.inodesFree": "5%"                                                       "nodefs.inodesFree": "5%"
  },                                                                              },
  "evictionPressureTransitionPeriod": "5m0s",                                     "evictionPressureTransitionPeriod": "5m0s",
  "enableControllerAttachDetach": true,                                           "enableControllerAttachDetach": true,
  "makeIPTablesUtilChains": true,                                                 "makeIPTablesUtilChains": true,
  "iptablesMasqueradeBit": 14,                                                    "iptablesMasqueradeBit": 14,
  "iptablesDropBit": 15,                                                          "iptablesDropBit": 15,
  "failSwapOn": false,                                                        |   "featureGates": {
                                                                              >     "CSIMigration": true,
                                                                              >     "CSIMigrationOpenStack": true,
                                                                              >     "ExpandCSIVolumes": true,
                                                                              >     "IPv6DualStack": false
                                                                              >   },
                                                                              >   "failSwapOn": true,
  "containerLogMaxSize": "10Mi",                                                  "containerLogMaxSize": "10Mi",
  "containerLogMaxFiles": 5,                                                      "containerLogMaxFiles": 5,
  "configMapAndSecretChangeDetectionStrategy": "Watch",                           "configMapAndSecretChangeDetectionStrategy": "Watch",
  "systemReservedCgroup": "systemreserved",                                   |   "enforceNodeAllocatable": [
  "kubeReservedCgroup": "podruntime",                                         |     "pods"
  "volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",  |   ],
                                                                              >   "volumePluginDir": "/var/lib/kubelet/volumeplugins",
  "logging": {                                                                    "logging": {
    "format": "text"                                                                "format": "text"
  },                                                                              },
  "enableSystemLogHandler": true,                                                 "enableSystemLogHandler": true,
  "shutdownGracePeriod": "0s",                                                    "shutdownGracePeriod": "0s",
  "shutdownGracePeriodCriticalPods": "0s",                                        "shutdownGracePeriodCriticalPods": "0s",
  "enableProfilingHandler": true,                                                 "enableProfilingHandler": true,
  "enableDebugFlagsHandler": true,                                                "enableDebugFlagsHandler": true,
  "kind": "KubeletConfiguration",                                                 "kind": "KubeletConfiguration",
  "apiVersion": "kubelet.config.k8s.io/v1beta1"                                   "apiVersion": "kubelet.config.k8s.io/v1beta1"
}

This time we run AWX 20.0.1 (but clearly see the problem starting 19.4.0), K8s 1.21.5 (and saw the problem on 1.20.8 earlier), and for ansible-runner it's stable-2.10-latest and all other combinations (e.g. awx-ee:latest).
As for Docker for Mac K8s the version is 1.21.3.

@Shrews Shrews removed the needs_triage New item that needs to be triaged label Apr 5, 2022
@stanislav-zaprudskiy
Copy link
Author

stanislav-zaprudskiy commented Apr 5, 2022

I did some more tests using K8s cluster in GCP and AWS (managed by Gardener), and for both cases I ran into the same failures using the inventory script referenced.

Steps to reproduce:

  1. Get a K8s cluster

  2. Deploy AWX in basic configuration

    spec:
      admin_password_secret: awx-admin-password
      admin_user: admin
      annotations: 'checksum/custom-certs.yaml: 8ea379be3c8912aff3144b84eac92edd23ae36d1f2a8ba755a61b2d25d4923e9'
      bundle_cacert_secret: awx-custom-certs
      create_preload_data: true
      ee_resource_requirements: {}
      garbage_collect_secrets: false
      hostname: awx-local.some.hostname
      image_pull_policy: IfNotPresent
      ingress_type: none
      loadbalancer_port: 80
      loadbalancer_protocol: http
      nodeport_port: 30080
      postgres_configuration_secret: awx-postgres-configuration
      projects_persistence: false
      projects_storage_access_mode: ReadWriteMany
      projects_storage_size: 8Gi
      replicas: 1
      route_tls_termination_mechanism: Edge
      service_type: ClusterIP
      task_privileged: false
      task_resource_requirements: {}
      web_resource_requirements: {}
    
  3. Add new project https://github.com/AlanCoding/test-playbooks.git

  4. Add new inventory

  5. Add new inventory source for the inventory, sourced from the project created earlier, using inventory file inventories/dyn_inventory_large_1000.py

  6. Sync the inventory

  7. A new automation-job will be spawned using the default AWX EE quay.io/ansible/awx-ee:latest (and in the same namespace as AWX).

  8. Upon completion of the script run, the automation-job Pod finishes successfully. However, the sync job is in Error status w/o any errors in the output.

  9. At the same time, AWX Pod register the following messages:

    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:00,937 DEBUG    [6d2eb1fe2036428987ec6b89a6fca66b] awx.main.scheduler Running task manager.
    [pod/awx-654b597b58-jj42n/awx-ee] DEBUG 2022/04/05 15:58:00 Stdout complete - closing channel for: DavZpxzf
    [pod/awx-654b597b58-jj42n/awx-ee] WARNING 2022/04/05 15:58:00 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
    [pod/awx-654b597b58-jj42n/awx-ee] DEBUG 2022/04/05 15:58:00 Client disconnected from control service @
    [pod/awx-654b597b58-jj42n/awx-ee] WARNING 2022/04/05 15:58:00 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,069 DEBUG    [6d2eb1fe2036428987ec6b89a6fca66b] awx.main.dispatch task 3e857f4e-9626-4eda-95aa-510e20906189 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,150 DEBUG    [6d2eb1fe2036428987ec6b89a6fca66b] awx.main.scheduler Starting Scheduler
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,241 DEBUG    [6d2eb1fe2036428987ec6b89a6fca66b] awx.main.scheduler Finishing Scheduler
    [pod/awx-654b597b58-jj42n/awx-ee] DEBUG 2022/04/05 15:58:01 Client connected to control service @
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,890 ERROR    [1ab2556ef2364c7d95745517224ca0cb] awx.main.tasks.jobs inventory_update 8 (running) Exception occurred while running task
    [pod/awx-654b597b58-jj42n/awx-task] Traceback (most recent call last):
    [pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 560, in run
    [pod/awx-654b597b58-jj42n/awx-task]     res = receptor_job.run()
    [pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 269, in run
    [pod/awx-654b597b58-jj42n/awx-task]     res = self._run_internal(receptor_ctl)
    [pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 345, in _run_internal
    [pod/awx-654b597b58-jj42n/awx-task]     res = list(first_future.done)[0].result()
    [pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
    [pod/awx-654b597b58-jj42n/awx-task]     return self.__get_result()
    [pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    [pod/awx-654b597b58-jj42n/awx-task]     raise self._exception
    [pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
    [pod/awx-654b597b58-jj42n/awx-task]     result = self.fn(*self.args, **self.kwargs)
    [pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py", line 1168, in wrapper_cleanup_new_process
    [pod/awx-654b597b58-jj42n/awx-task]     return func(*args, **kwargs)
    [pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 401, in processor
    [pod/awx-654b597b58-jj42n/awx-task]     return ansible_runner.interface.run(
    [pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/interface.py", line 210, in run
    [pod/awx-654b597b58-jj42n/awx-task]     r.run()
    [pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 254, in run
    [pod/awx-654b597b58-jj42n/awx-task]     self.artifacts_callback(data)
    [pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 233, in artifacts_callback
    [pod/awx-654b597b58-jj42n/awx-task]     unstream_dir(self._input, length, self.artifact_dir)
    [pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/streaming.py", line 70, in unstream_dir
    [pod/awx-654b597b58-jj42n/awx-task]     with zipfile.ZipFile(tmp.name, "r") as archive:
    [pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/zipfile.py", line 1257, in __init__
    [pod/awx-654b597b58-jj42n/awx-task]     self._RealGetContents()
    [pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/zipfile.py", line 1324, in _RealGetContents
    [pod/awx-654b597b58-jj42n/awx-task]     raise BadZipFile("File is not a zip file")
    [pod/awx-654b597b58-jj42n/awx-task] zipfile.BadZipFile: File is not a zip file
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,893 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.tasks.jobs inventory_update 8 (running) finished running, producing 12 events.
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,894 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.analytics.job_lifecycle inventoryupdate-8 post run
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,932 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.models.inventory Going to update inventory computed fields, pk=2
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,941 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.models.inventory Finished updating inventory computed fields, pk=2, in 0.008 seconds
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,943 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.analytics.job_lifecycle inventoryupdate-8 finalize run
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,956 WARNING  [1ab2556ef2364c7d95745517224ca0cb] awx.main.dispatch inventory_update 8 (error) encountered an error (rc=None), please see task stdout for details.
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,958 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.dispatch task bbf88b96-8726-4711-b356-22f9399401e6 starting awx.main.tasks.system.handle_work_error(*['bbf88b96-8726-4711-b356-22f9399401e6'])
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,959 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.tasks.system Executing error task id bbf88b96-8726-4711-b356-22f9399401e6, subtasks: [{'type': 'inventory_update', 'id': 8}]
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:02,045 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.dispatch task bbf88b96-8726-4711-b356-22f9399401e6 starting awx.main.tasks.system.handle_work_success(*[])
    [pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:02,049 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.dispatch task 0620a91e-e5c9-4ec1-8bb0-7421b5945e86 starting awx.main.scheduler.tasks.run_task_manager(*[])
    

I also tried adding && sleep 60 after ansible-runner worker ... args for the Job container, but that doesn't solve the issue.

However, I have to admit, that scripts fail not necessarily 100% of the time. For our own inventory script we had some few successful runs long in the past. And also for the newly deployed Gardner AWS cluster it worked for the first run, but all further attempts were failing (10+), as well as for the new fresh AWS cluster it was failing all the time (as well as for new GCP)

I additionally tried reserving and allocating more memory for job Pods (2GB), and using OpenStack's Cinder storage for /runner folder on the runner - all w/o any success.

@AlanCoding
Copy link
Member

Thanks for digging into this. I'm now even more convinced that this is the same thing which has been reported elsewhere in AWX. There are lots of duplicate issues that lay out the issue in an unclear way, but after searching through, I believe ansible/awx#11338 is the most accurate articulation of this bug. I suspect that there is a route to pinning down the error by combining the reproducer here with the tricks outlined there.

In my prior comment, I had deployed the operator on an OpenShift cluster, which was relatively quiet and didn't have much log history before that. I would like to get back around to this, and figure out how to force the issue with minikube.

@shanemcd
Copy link
Member

I am pretty sure this is ansible/receptor#597

@shanemcd
Copy link
Member

@stanislav-zaprudskiy The latest version of awx-ee now has a patched version of receptor in it. Can you pull the latest version and try it out? You may need to use image_pull_policy on your AWX custom resource object: https://github.com/ansible/awx-operator#deploying-a-specific-version-of-awx

@shanemcd
Copy link
Member

shanemcd commented Apr 25, 2022

Never mind, I just saw your comment on ansible/receptor#597. That is perplexing.

Can you provide some information about your workload?

  • How many hosts?
  • How many job events? (You can find this at /api/v2/jobs/<id>)
  • Any information you can provide about the facts being gathered.

@shanemcd
Copy link
Member

@stanislav-zaprudskiy Have you tried the workaround described in ansible/awx#11338 (comment)?

@stanislav-zaprudskiy
Copy link
Author

@shanemcd,

We have 2 installations ATM: one running v20.x of AWX, and another v15.x. We are in process of migration of all workloads from the older installation to the newer one (for various reasons), and this problem with inventories is currently hindering the progress.

How many hosts?

In the new installation we already have 20k hosts, but many inventories can't sync, so it's not a final number. The old installation has about 40k hosts, and that's something which we'd probably get eventually in the new set-up. Not all hosts are unique though - unique would be around 10k.

How many job events? (You can find this at /api/v2/jobs/)

Would you elaborate on job events please? I took some random successful job /api/v2/jobs/23154/job_events/, for which it reports 964 as count. Shall I take some specific job example?

Any information you can provide about the facts being gathered.

As for facts - we don't gather them, but rather all the details come from inventory source scripts. We pull the data from Netbox using our own custom plugin to populate it.

Have you tried the workaround described in ansible/awx#11338 (comment)?

Not, we haven't.

  • I was concerned that at some point in time we might get a log, which will involve rotation anyway (big inventory log; or a smaller one, but which will have to cross the border of containerLogMaxSize , etc)
  • containerLogMaxSize value is the same on a our OpenStack Kubernetes node and on a docker-desktop node (find kubelet configs diff here - Runner seems to have difficulties to process large outputs #998 (comment)). However, things work on docker-desktop - which may indicate that this parameter is unrelated.
  • I was concerned about the exact impact which increase of the parameter may cause on the nodes (performance-wise), so I was reluctant to bump it to 500MB or something which some people indicated helped them.
  • Finally, there is no direct interface interface in our cloud Kubernetes service which allows tinkering around with kubelet parameters 😄 I'd have to go a long route to verify if it solves the issue.

@Tioborto
Copy link

Tioborto commented Apr 26, 2022

Hello @shanemcd,
To complete with another use case (we are another company), we run a template on a large inventory (~ 8000 hosts). This produces a large number of events (9725) with a page count (389) on /api/v2/jobs/{id}/job_events
We use the last version of ansible-runner and the last version of AWX (21.0.0).

{
    "id": 995231,
    "type": "job",
    "summary_fields": {
        "inventory": {
            "id": 484,
            "name": "linux",
            "description": "",
            "has_active_failures": true,
            "total_hosts": 8024,
            "hosts_with_active_failures": 119,
            "total_groups": 0,
            "has_inventory_sources": false,
            "total_inventory_sources": 0,
            "inventory_sources_with_failures": 0,
            "organization_id": 2,
            "kind": ""
        },
        "execution_environment": {
            "id": 40,
            "name": "Global EE Test",
            "description": "",
            "image": "....jfrog.io/adeo-awx-ee:0.0.0-main-global"
        },
        "instance_group": {
            "id": 4,
            "name": "default",
            "is_container_group": true
        }
    },
    "name": "dxp-check-connectivity-server-ask-inventory",
    "job_type": "run",
    "inventory": 484,
    "project": 754,
    "playbook": "check-connectivity/server.yaml",
    "scm_branch": "",
    "forks": 0,
    "limit": "",
    "verbosity": 0,
    "extra_vars": "{}",
    "job_tags": "",
    "force_handlers": false,
    "skip_tags": "",
    "start_at_task": "",
    "timeout": 0,
    "use_fact_cache": false,
    "organization": 2,
    "unified_job_template": 2143,
    "launch_type": "manual",
    "status": "error",
    "execution_environment": 40,
    "failed": true,
    "started": "2022-04-26T10:07:41.381674Z",
    "finished": "2022-04-26T11:03:48.440833Z",
    "canceled_on": null,
    "elapsed": 3367.059,
    "job_args": "[\"ssh-agent\", \"sh\", \"-c\", \"trap 'rm -f /runner/artifacts/995231/ssh_key_data' EXIT && ssh-add /runner/artifacts/995231/ssh_key_data && rm -f /runner/artifacts/995231/ssh_key_data && ansible-playbook -u ansibleremoteuser -i /runner/inventory/hosts -e @/runner/env/extravars check-connectivity/server.yaml\"]",
    "job_cwd": "/runner/project",
    "job_env": {
        "AWX_SERVICE_PORT_80_TCP_ADDR": "172.x.x.x",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT": "tcp://172.x.x.x:8443",
        "HOSTNAME": "automation-job-995231-76ptk",
        "AWX_SERVICE_PORT": "tcp://172.x.x.x:80",
        "AWX_SERVICE_PORT_80_TCP_PORT": "80",
        "KUBERNETES_PORT_443_TCP_PROTO": "tcp",
        "KUBERNETES_PORT_443_TCP_ADDR": "172.x.x.x",
        "container": "oci",
        "AWX_SERVICE_PORT_80_TCP_PROTO": "tcp",
        "KUBERNETES_PORT": "tcp://172.x.x.x:443",
        "PWD": "/runner",
        "HOME": "/var/lib/awx",
        "AWX_SERVICE_PORT_80_TCP": "tcp://172.x.x.x:80",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP_PORT": "8443",
        "KUBERNETES_SERVICE_PORT_HTTPS": "443",
        "AWX_SERVICE_SERVICE_PORT": "80",
        "KUBERNETES_PORT_443_TCP_PORT": "443",
        "AWX_SERVICE_SERVICE_HOST": "172.x.x.x",
        "KUBERNETES_PORT_443_TCP": "tcp://172.x.x.x:443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP": "tcp://172.x.x.x:8443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_SERVICE_PORT_HTTPS": "8443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_SERVICE_PORT": "8443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP_PROTO": "tcp",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_SERVICE_HOST": "172.x.x.x",
        "SHLVL": "0",
        "KUBERNETES_SERVICE_PORT": "443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP_ADDR": "172.x.x.x",
        "PATH": "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "AWX_SERVICE_SERVICE_PORT_HTTP": "80",
        "KUBERNETES_SERVICE_HOST": "172.x.x.x",
        "LC_CTYPE": "C.UTF-8",
        "ANSIBLE_FACT_CACHE_TIMEOUT": "0",
        "ANSIBLE_FORCE_COLOR": "True",
        "ANSIBLE_HOST_KEY_CHECKING": "False",
        "ANSIBLE_INVENTORY_UNPARSED_FAILED": "True",
        "ANSIBLE_PARAMIKO_RECORD_HOST_KEYS": "False",
        "AWX_PRIVATE_DATA_DIR": "/tmp/awx_995231_ndl6nifl",
        "JOB_ID": "995231",
        "INVENTORY_ID": "484",
        "PROJECT_REVISION": "d260ddaf2d6bff",
        "ANSIBLE_RETRY_FILES_ENABLED": "False",
        "MAX_EVENT_RES": "700000",
        "AWX_HOST": "https://.../",
        "ANSIBLE_SSH_CONTROL_PATH_DIR": "/runner/cp",
        "ANSIBLE_COLLECTIONS_PATHS": "/runner/requirements_collections:~/.ansible/collections:/usr/share/ansible/collections",
        "ANSIBLE_ROLES_PATH": "/runner/requirements_roles:~/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles",
        "PYTHONPATH": ":/usr/local/lib/python3.8/site-packages/ansible_runner/config/../callbacks",
        "ANSIBLE_CALLBACK_PLUGINS": "/usr/local/lib/python3.8/site-packages/ansible_runner/config/../callbacks",
        "ANSIBLE_STDOUT_CALLBACK": "awx_display",
        "AWX_ISOLATED_DATA_DIR": "/runner/artifacts/995231",
        "RUNNER_OMIT_EVENTS": "False",
        "RUNNER_ONLY_FAILED_EVENTS": "False"
    },
    "job_explanation": "Job terminated due to error",
    "execution_node": "",
    "controller_node": "awx-6f76575fd9-57k9r",
    "result_traceback": "",
    "event_processing_finished": true,
    "work_unit_id": "M495xlC7",
    "job_template": 2143,
    "passwords_needed_to_start": [],
    "allow_simultaneous": false,
    "artifacts": {},
    "scm_revision": "d260ddaf2d6bff9e4faa2dc375b8e1c0bb4fd7a1",
    "instance_group": 4,
    "diff_mode": false,
    "job_slice_number": 0,
    "job_slice_count": 1,
    "webhook_service": "",
    "webhook_credential": null,
    "webhook_guid": "",
    "host_status_counts": {},
    "playbook_counts": {
        "play_count": 1,
        "task_count": 1
    },
    "custom_virtualenv": null
}

For us, it's possible to try the workaround because we deployed our instance on a GKE cluster managed by another team (not doing specific).
At your disposal if you need more information

@shanemcd
Copy link
Member

Now I'm almost positive that @AlanCoding was correct - it has to be ansible/awx#11338.

Unfortunately we do not currently have the resources to rearchitect this part of our application, especially since it is not affecting our customers on OpenShift, and the fact that there is a workaround. Hopefully they will fix kubernetes/kubernetes#59902 soon.

@shanemcd
Copy link
Member

I'm going to close this as I think enough evidence has been provided to prove that the problem is not within ansible-runner.

@stanislav-zaprudskiy
Copy link
Author

@shanemcd, we did more research around receptor and made the test inventory script working using streammethod: tcp and some socat magic on the EE which makes ansible-runner to read the incoming data from the receptor and stream results back to it.

I was going to perform some more tests and then to roll out the change in our productive environment, and will be also looking for any side-effects. However, would you come up with anything inherently bad with this approach of streammethod: tcp? Reading through ansible/receptor#278, I understand that there are certain issues for the cases when receptor restarts during the communication - but for AWX workflows that would mean a failure of a job and a need to restart it manually. WDYT?

@shanemcd
Copy link
Member

@stanislav-zaprudskiy Nice digging, consider me impressed. If this helps you maybe we could look into exposing it via awx-operator. It's just more complex and we'd need to think about what the inputs look like... and certs. I used the k8s logger because we get access to it for free after the operator creates the service account. If you want to chat through anything, email is on my profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants