-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runner seems to have difficulties to process large outputs #998
Comments
I can't reproduce this https://github.com/AlanCoding/test-playbooks/blob/master/inventories/dyn_inventory_large_1000.py This creates random keys / values in hostvars to maximize the compressed size. Locally looking at the sizes:
These should exceed your values sufficiently. Yet I'm able to run this in AWX, as a container_group job without a problem. |
I'll check your example script on our AWX deployment, and will report back with results. |
@AlanCoding, thank you for looking into that. Below are the details of our additional testing. We did check the inventory script you referenced, and we ran into the same issues as reported originally. The stack traces are actually produced by AWX We suspected PostgeSQL, which we run separately and as a cluster (3 nodes), so we did the similar tests on a separate fresh clean installation of AWX with PostgreSQL managed by the operator. Issue was still there. Next, we suspected that multi-node K8s cluster may be the cause, so we made a new fresh AWX installation making sure that AWX, operator managed PostgreSQL and automation-job Pods all run on the same k8s node (though, there were other nodes in the cluster running different unrelated Pods). Issue was still there. We also played around EEs - but at the end performed all tests using the default Finally, we performed the same test but now using K8s from Docker for Mac (running on a laptop), and this time all worked w/o a hitch. Both your example inventory and our own inventory scripts made it to completion and registered hosts in AWX. All previous (failing) tests, however, were performed on a normal multi-node K8s installation in our OpenStack cloud. Also, looking more thoroughly on the automation-job- Pod logs reveals that base64 encoded zip archive stream is not being streamed to "the end" on normal K8s compared to Docker for Mac K8s - I see that using Would you have any ideas to check further? Anything we could do on top to better understand the cause? kubelet config diff docker-desktop vs cloud node{ {
"enableServer": true, "enableServer": true,
"staticPodPath": "/etc/kubernetes/manifests", <
"syncFrequency": "1m0s", "syncFrequency": "1m0s",
"fileCheckFrequency": "20s", "fileCheckFrequency": "20s",
"httpCheckFrequency": "30s", | "httpCheckFrequency": "20s",
"address": "0.0.0.0", "address": "0.0.0.0",
"port": 10250, "port": 10250,
"tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt", "tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
"tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key", "tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
> "rotateCertificates": true,
"authentication": { "authentication": {
"x509": { "x509": {
"clientCAFile": "/run/config/pki/ca.crt" | "clientCAFile": "/etc/kubernetes/certs/kubelet-clients-ca.pem"
}, },
"webhook": { "webhook": {
"enabled": true, "enabled": true,
"cacheTTL": "2m0s" "cacheTTL": "2m0s"
}, },
"anonymous": { "anonymous": {
"enabled": false | "enabled": true
} }
}, },
"authorization": { "authorization": {
"mode": "Webhook", "mode": "Webhook",
"webhook": { "webhook": {
"cacheAuthorizedTTL": "5m0s", "cacheAuthorizedTTL": "5m0s",
"cacheUnauthorizedTTL": "30s" "cacheUnauthorizedTTL": "30s"
} }
}, },
"registryPullQPS": 5, "registryPullQPS": 5,
"registryBurst": 10, "registryBurst": 10,
"eventRecordQPS": 5, "eventRecordQPS": 5,
"eventBurst": 10, "eventBurst": 10,
"enableDebuggingHandlers": true, "enableDebuggingHandlers": true,
"healthzPort": 10248, "healthzPort": 10248,
"healthzBindAddress": "127.0.0.1", "healthzBindAddress": "127.0.0.1",
"oomScoreAdj": -999, "oomScoreAdj": -999,
"clusterDomain": "cluster.local", "clusterDomain": "cluster.local",
"clusterDNS": [ "clusterDNS": [
"10.96.0.10" | "198.18.128.2"
], ],
"streamingConnectionIdleTimeout": "4h0m0s", "streamingConnectionIdleTimeout": "4h0m0s",
"nodeStatusUpdateFrequency": "30s", | "nodeStatusUpdateFrequency": "10s",
"nodeStatusReportFrequency": "30s", | "nodeStatusReportFrequency": "5m0s",
"nodeLeaseDurationSeconds": 40, | "nodeLeaseDurationSeconds": 20,
"imageMinimumGCAge": "2m0s", "imageMinimumGCAge": "2m0s",
"imageGCHighThresholdPercent": 85, "imageGCHighThresholdPercent": 85,
"imageGCLowThresholdPercent": 80, "imageGCLowThresholdPercent": 80,
"volumeStatsAggPeriod": "1m0s", "volumeStatsAggPeriod": "1m0s",
"cgroupRoot": "kubepods", <
"cgroupsPerQOS": true, "cgroupsPerQOS": true,
"cgroupDriver": "cgroupfs", | "cgroupDriver": "systemd",
"cpuManagerPolicy": "none", "cpuManagerPolicy": "none",
"cpuManagerReconcilePeriod": "10s", "cpuManagerReconcilePeriod": "10s",
"memoryManagerPolicy": "None", "memoryManagerPolicy": "None",
"topologyManagerPolicy": "none", "topologyManagerPolicy": "none",
"topologyManagerScope": "container", "topologyManagerScope": "container",
"runtimeRequestTimeout": "2m0s", "runtimeRequestTimeout": "2m0s",
"hairpinMode": "promiscuous-bridge", "hairpinMode": "promiscuous-bridge",
"maxPods": 110, "maxPods": 110,
"podPidsLimit": -1, "podPidsLimit": -1,
"resolvConf": "/etc/resolv.conf", "resolvConf": "/etc/resolv.conf",
"cpuCFSQuota": true, "cpuCFSQuota": true,
"cpuCFSQuotaPeriod": "100ms", "cpuCFSQuotaPeriod": "100ms",
"nodeStatusMaxImages": 50, "nodeStatusMaxImages": 50,
"maxOpenFiles": 1000000, "maxOpenFiles": 1000000,
"contentType": "application/vnd.kubernetes.protobuf", "contentType": "application/vnd.kubernetes.protobuf",
"kubeAPIQPS": 5, "kubeAPIQPS": 5,
"kubeAPIBurst": 10, "kubeAPIBurst": 10,
"serializeImagePulls": true, "serializeImagePulls": true,
"evictionHard": { "evictionHard": {
"imagefs.available": "15%", "imagefs.available": "15%",
"memory.available": "100Mi", "memory.available": "100Mi",
"nodefs.available": "10%", "nodefs.available": "10%",
"nodefs.inodesFree": "5%" "nodefs.inodesFree": "5%"
}, },
"evictionPressureTransitionPeriod": "5m0s", "evictionPressureTransitionPeriod": "5m0s",
"enableControllerAttachDetach": true, "enableControllerAttachDetach": true,
"makeIPTablesUtilChains": true, "makeIPTablesUtilChains": true,
"iptablesMasqueradeBit": 14, "iptablesMasqueradeBit": 14,
"iptablesDropBit": 15, "iptablesDropBit": 15,
"failSwapOn": false, | "featureGates": {
> "CSIMigration": true,
> "CSIMigrationOpenStack": true,
> "ExpandCSIVolumes": true,
> "IPv6DualStack": false
> },
> "failSwapOn": true,
"containerLogMaxSize": "10Mi", "containerLogMaxSize": "10Mi",
"containerLogMaxFiles": 5, "containerLogMaxFiles": 5,
"configMapAndSecretChangeDetectionStrategy": "Watch", "configMapAndSecretChangeDetectionStrategy": "Watch",
"systemReservedCgroup": "systemreserved", | "enforceNodeAllocatable": [
"kubeReservedCgroup": "podruntime", | "pods"
"volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/", | ],
> "volumePluginDir": "/var/lib/kubelet/volumeplugins",
"logging": { "logging": {
"format": "text" "format": "text"
}, },
"enableSystemLogHandler": true, "enableSystemLogHandler": true,
"shutdownGracePeriod": "0s", "shutdownGracePeriod": "0s",
"shutdownGracePeriodCriticalPods": "0s", "shutdownGracePeriodCriticalPods": "0s",
"enableProfilingHandler": true, "enableProfilingHandler": true,
"enableDebugFlagsHandler": true, "enableDebugFlagsHandler": true,
"kind": "KubeletConfiguration", "kind": "KubeletConfiguration",
"apiVersion": "kubelet.config.k8s.io/v1beta1" "apiVersion": "kubelet.config.k8s.io/v1beta1"
} This time we run AWX 20.0.1 (but clearly see the problem starting 19.4.0), K8s 1.21.5 (and saw the problem on 1.20.8 earlier), and for ansible-runner it's |
I did some more tests using K8s cluster in GCP and AWS (managed by Gardener), and for both cases I ran into the same failures using the inventory script referenced. Steps to reproduce:
I also tried adding However, I have to admit, that scripts fail not necessarily 100% of the time. For our own inventory script we had some few successful runs long in the past. And also for the newly deployed Gardner AWS cluster it worked for the first run, but all further attempts were failing (10+), as well as for the new fresh AWS cluster it was failing all the time (as well as for new GCP) I additionally tried reserving and allocating more memory for job Pods (2GB), and using OpenStack's Cinder storage for |
Thanks for digging into this. I'm now even more convinced that this is the same thing which has been reported elsewhere in AWX. There are lots of duplicate issues that lay out the issue in an unclear way, but after searching through, I believe ansible/awx#11338 is the most accurate articulation of this bug. I suspect that there is a route to pinning down the error by combining the reproducer here with the tricks outlined there. In my prior comment, I had deployed the operator on an OpenShift cluster, which was relatively quiet and didn't have much log history before that. I would like to get back around to this, and figure out how to force the issue with minikube. |
I am pretty sure this is ansible/receptor#597 |
@stanislav-zaprudskiy The latest version of awx-ee now has a patched version of receptor in it. Can you pull the latest version and try it out? You may need to use |
Never mind, I just saw your comment on ansible/receptor#597. That is perplexing. Can you provide some information about your workload?
|
@stanislav-zaprudskiy Have you tried the workaround described in ansible/awx#11338 (comment)? |
We have 2 installations ATM: one running v20.x of AWX, and another v15.x. We are in process of migration of all workloads from the older installation to the newer one (for various reasons), and this problem with inventories is currently hindering the progress.
In the new installation we already have 20k hosts, but many inventories can't sync, so it's not a final number. The old installation has about 40k hosts, and that's something which we'd probably get eventually in the new set-up. Not all hosts are unique though - unique would be around 10k.
Would you elaborate on job events please? I took some random successful job
As for facts - we don't gather them, but rather all the details come from inventory source scripts. We pull the data from Netbox using our own custom plugin to populate it.
Not, we haven't.
|
Hello @shanemcd,
For us, it's possible to try the workaround because we deployed our instance on a GKE cluster managed by another team (not doing specific). |
Now I'm almost positive that @AlanCoding was correct - it has to be ansible/awx#11338. Unfortunately we do not currently have the resources to rearchitect this part of our application, especially since it is not affecting our customers on OpenShift, and the fact that there is a workaround. Hopefully they will fix kubernetes/kubernetes#59902 soon. |
I'm going to close this as I think enough evidence has been provided to prove that the problem is not within ansible-runner. |
@shanemcd, we did more research around receptor and made the test inventory script working using I was going to perform some more tests and then to roll out the change in our productive environment, and will be also looking for any side-effects. However, would you come up with anything inherently bad with this approach of |
@stanislav-zaprudskiy Nice digging, consider me impressed. If this helps you maybe we could look into exposing it via awx-operator. It's just more complex and we'd need to think about what the inputs look like... and certs. I used the k8s logger because we get access to it for free after the operator creates the service account. If you want to chat through anything, email is on my profile. |
We use
ansible-runner
(tried both 2.1.1 and devel) in combination with AWX (19.4) all running in Kubernetes.Apart of normal AWX jobs,
ansible-runner
EEs also execute inventory sync jobs using our custom inventory plugin. The command under the hood is similar to that below:And whenever the plugin generates "large" artifacts, we run into the following errors in AWX (varying from case to case - see below 3 the distinct examples), even though the corresponding EE Pod finishes successfully (e.g.
{"status": "successful", "runner_ident": "<job-id>"}
message followed by base64 encoded zipfile and successful termination of the Pod).Failures are reproducible on
/runner/artifacts/<job-id>/output.json
files ranging from 80 to 400 MB (raw), for which the correspondingzipfile
ranges from 5 to 11 MB (extracted from K8s logs). And no errors are detected on generatedoutput.json
inventory files of few KBs.Are there reasonable thresholds or recommendations for artifacts size? Is there anything which we could configure for now, except of re-designing the inventories/plugin?
The text was updated successfully, but these errors were encountered: