-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workflow-controller memory usage increases monotonically #6532
Comments
I can see log |
The cluster is under pretty heavy load. This workflow The workflow object is no longer present in the namespace. |
@tymokvo it means TTL is deleting. Because of Big workflows and heavy load on new workflows controller informer size is growing. |
Yep, I will do that and follow up here. Thanks! |
Memory usage should be less that 1GB in most use cases:
Can we get a pprof dump please? This is an example: https://github.com/argoproj/argo-workflows/blob/master/hack/capture-pprof.sh |
We have seen it increase to ~10GB after running 215 workflows with a total of 55,492 nodes over the course of 8 hours at which point the node was OOM and killed the workflow-controller and a few of our other monitoring services. I will get the pprof dump asap. |
Between Here's a table of how the load over time breaks down:
|
It might be good to set-up 30m: https://bit.ly/book-30m-with-argo-team |
Ok, I will try to find some time. I ran it again and captured the heap dump svg here. |
I though I'd jump in - but I didn't see anything out of the ordinary in the pprof dumps. Can I ask you to review https://argoproj.github.io/argo-workflows/running-at-massive-scale/ Additionally, you didn't state if this is a regression or not, is this is a brand new installation? |
It's hard to tell if it's a regression, it's the first time that we've experienced this specific issue. We weren't running this kind of load before upgrading to Argo 3 and we've been using 3.1.0-rc14 for a while now (though we had to fork and patch to mitigate #6276 ) Re: running at massive scale, we've implemented parallelism limits, and we're using the emissary executor, and the k8s cluster can handle the load (the good thing about the issue that we've been having is that the workflows have all been completing successfully). But, the memory consumption by workflow controller is starving all the other infrastructure services (fluentbit, prometheus, some custom eventing services, etc.) I will look into implementing rate limiting and the reque time, though. |
Try the re-queue setting, if that works great, if not - lets get on a call. |
Coda - you really should be running the latest version which will be v3.1.5 (or v3.1.6 if that is released today). If needs be, fork and re-apply your patch. |
Ok, I'll start there. |
One thing I forgot to mention: we are using per-user namespaces with resource limits to prevent users starving each other of CPU/memory. This and our resource allocation settings amount to each namespace being able to run ~66 pods simultaneously. In order to test the suggested changes appropriately, I've been trying to replicate the memory consumption behavior on our test cluster and have gotten similar behavior by running jobs that create >200 workflows in two different user namespaces simultaneously. In the monitoring screenshot below, the spike in activity between 12:15 and 12:30 is in the same user namespace and the memory usage doesn't appear to increase. In the spike in activity between 12:40 and 12:55, the workflows are the same two as the previous spike, but in separate namespaces. Here, the memory consumption does seem to increase. workflow-controller logs. |
Did you try the newer version? I'm not clear. |
Yeah, sorry for the delay. We have been doing a lot of testing in addition to fixing our own systems. I just ran the latest test over night. We've upgraded to 3.1.6, increased the requeue time to 30s, and reduced the TTL time to 1200 seconds for all finished states. We're still seeing memory consumption by the controller exceed 10GB for some of our workflows. This is the memory consumption reported by GCP metrics for the one I ran last night: |
So generally reduced? |
Yeah, initially it seemed to be resolved until I ran this test case again last night. |
So we're making progress. Can I request dumps when the memory usage is high again? |
Sure thing, it's still over 10GB so I can do it now. |
So this is the current kubectl top output: POD NAME CPU(cores) MEMORY(bytes)
workflow-controller-7f7bd4cdc9-tf4qb argo-dumpster 2m 127Mi
workflow-controller-7f7bd4cdc9-tf4qb workflow-controller 89m 7733Mi Here's the pprof allocs, heap, and profile. The |
Hotspots
|
Can you run you controller with INDEX_WORKFLOW_SEMAPHORE_KEYS=false? Note that this will disable semaphores. |
Yep, just updated the controller. The test case takes a few hours to run though. |
Ok! That seems to have done it. Memory consumption is sitting at around 260MB with peaks around 1GB. The variance between my two test runs was pretty high, but the upper bound was much more bearable. Out of curiosity, what indicates that the |
@tymokvo I'm not very familar with Golang GC, but I've noted that when very large numbers of allocations happen, even thought the heap does not grow, real memory usage grows. Same thing happens with the Java Virtual Machine. It was the allocs dump that pointed to this. @sarabala1979 actions for you:
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Summary
Opened by request of @sarabala1979 on slack here.
What happened/what you expected to happen?
workflow-controller's memory usage increased monotonically for each submitted workflow. I expected its memory usage to free up some time after the workflow transitioned to a "finished" state.
Diagnostics
👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we'll go around in circles asking you for it:
What Kubernetes provider are you using?
GKE
What version of Argo Workflows are you running?
3.1.0-rc14
What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary
Emissary
Did this work in a previous version? I.e. is it a regression?
Unknown
Are you pasting thousands of log lines? That's too much information.
Yes, kind of, but I was asked to in this case.
The logs are 5.4MB, so I put them on GCS here.
The workflows in question completed at around 06:55 UTC
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: