workflow-controller memory usage increases monotonically #6532

tymokvo · 2021-08-11T17:34:10Z

Summary

Opened by request of @sarabala1979 on slack here.

What happened/what you expected to happen?

workflow-controller's memory usage increased monotonically for each submitted workflow. I expected its memory usage to free up some time after the workflow transitioned to a "finished" state.

Diagnostics

👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we'll go around in circles asking you for it:

What Kubernetes provider are you using?

GKE

What version of Argo Workflows are you running?

3.1.0-rc14

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary

Emissary

Did this work in a previous version? I.e. is it a regression?

Unknown

Are you pasting thousands of log lines? That's too much information.

Yes, kind of, but I was asked to in this case.

The logs are 5.4MB, so I put them on GCS here.

The workflows in question completed at around 06:55 UTC

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

sarabala1979 · 2021-08-11T17:41:31Z

I can see log Deleting TTL expired workflow so I think TTL is working. Can you provide little more information about your environment? how busy is your cluster?
Can you check your cluster whether already deleted workflow is still there?
Deleting TTL expired workflow '793fb579-caf4-49d1-a7e6-f4bb1d246095/bf14d17b-a55f-50d3-8420-f6196563b19e'"

tymokvo · 2021-08-11T17:45:46Z

The cluster is under pretty heavy load. This workflow bf14d17b-a55f-50d3-8420-f6196563b19e was one of 216 that a user submitted at once which created a total of 3493 nodes.

The workflow object is no longer present in the namespace.

sarabala1979 · 2021-08-11T17:49:32Z

@tymokvo it means TTL is deleting. Because of Big workflows and heavy load on new workflows controller informer size is growing.
Can you try to reduce TTL time?

tymokvo · 2021-08-11T17:50:45Z

Yep, I will do that and follow up here. Thanks!

alexec · 2021-08-11T17:53:08Z

Memory usage should be less that 1GB in most use cases:

How many workflows are you running.
How many nodes per workflows on average?

Can we get a pprof dump please?

This is an example: https://github.com/argoproj/argo-workflows/blob/master/hack/capture-pprof.sh

tymokvo · 2021-08-11T18:05:42Z

We have seen it increase to ~10GB after running 215 workflows with a total of 55,492 nodes over the course of 8 hours at which point the node was OOM and killed the workflow-controller and a few of our other monitoring services.

I will get the pprof dump asap.

tymokvo · 2021-08-11T19:43:03Z

pprof output files 1, 2, and 3. (I don't know what these are 😬 )

tymokvo · 2021-08-11T21:34:09Z

Between 2021-08-10T06:50:00 to 2021-08-11T17:30:00 (~34hrs) we ran 452 workflows. These were started in groups of related workflows that we call "jobs". The log file above is from when the workflow-controller was restarted while the jobs with 216 workflows were running until the end of the duration.

Here's a table of how the load over time breaks down:

job_id	started_at	finished_at	workflow_count	nodes
9b0feb31	2021-08-10 07:10:08	2021-08-10 07:17:32	1	1102
07084fdd	2021-08-10 07:20:05	2021-08-10 07:39:21	4	4408
27bfa22a	2021-08-10 07:22:12	2021-08-10 07:25:39	1	71
8c7d1f83	2021-08-10 08:11:15	2021-08-10 08:13:41	1	122
b50d9c33	2021-08-10 08:11:16	2021-08-10 08:14:05	1	122
64c3714e	2021-08-10 08:11:17	2021-08-10 08:14:02	1	122
c8d6a157	2021-08-10 08:30:24	2021-08-10 08:31:58	1	122
5f20fee0	2021-08-10 10:13:55	2021-08-10 10:27:48	1	2602
5bc52074	2021-08-10 11:54:06	2021-08-10 12:16:37	1	2602
b5cd4ac1	2021-08-10 12:17:03	2021-08-10 12:23:20	1	202
3d92c52d	2021-08-10 17:06:29	2021-08-10 17:13:20	1	485
fd8391df	2021-08-10 17:18:27	2021-08-10 17:24:30	1	468
88f43ad7	2021-08-10 17:18:35	2021-08-10 17:22:27	1	399
db98f1a7	2021-08-10 17:20:18	2021-08-10 17:23:15	1	400
98a90a6e	2021-08-10 18:13:43	2021-08-10 18:21:42	1	468
a7db6e70	2021-08-10 18:13:53	2021-08-10 18:19:01	1	399
6776dea9	2021-08-10 18:14:04	2021-08-10 18:18:38	1	400
b4309ad5	2021-08-11 06:37:28	2021-08-11 06:39:22	216	3493
25eecca4	2021-08-11 06:45:19	2021-08-11 06:55:13	216	4321

alexec · 2021-08-11T21:47:04Z

pprof output files 1, 2, and 3. (I don't know what these are 😬 )

The heap dump is missing, and allows appears to be attached twice? It's the heap I want to see most.

alexec · 2021-08-11T21:48:03Z

It might be good to set-up 30m: https://bit.ly/book-30m-with-argo-team

tymokvo · 2021-08-11T21:52:07Z

Ok, I will try to find some time. I ran it again and captured the heap dump svg here.

alexec · 2021-08-11T21:56:59Z

I though I'd jump in - but I didn't see anything out of the ordinary in the pprof dumps.

Can I ask you to review https://argoproj.github.io/argo-workflows/running-at-massive-scale/

Additionally, you didn't state if this is a regression or not, is this is a brand new installation?

tymokvo · 2021-08-11T22:07:17Z

It's hard to tell if it's a regression, it's the first time that we've experienced this specific issue. We weren't running this kind of load before upgrading to Argo 3 and we've been using 3.1.0-rc14 for a while now (though we had to fork and patch to mitigate #6276 )

Re: running at massive scale, we've implemented parallelism limits, and we're using the emissary executor, and the k8s cluster can handle the load (the good thing about the issue that we've been having is that the workflows have all been completing successfully). But, the memory consumption by workflow controller is starving all the other infrastructure services (fluentbit, prometheus, some custom eventing services, etc.)

I will look into implementing rate limiting and the reque time, though.

alexec · 2021-08-11T22:10:51Z

Try the re-queue setting, if that works great, if not - lets get on a call.

alexec · 2021-08-11T22:12:33Z

Coda - you really should be running the latest version which will be v3.1.5 (or v3.1.6 if that is released today). If needs be, fork and re-apply your patch.

tymokvo · 2021-08-11T22:16:22Z

Ok, I'll start there.

tymokvo · 2021-08-12T01:22:01Z

One thing I forgot to mention: we are using per-user namespaces with resource limits to prevent users starving each other of CPU/memory. This and our resource allocation settings amount to each namespace being able to run ~66 pods simultaneously.

In order to test the suggested changes appropriately, I've been trying to replicate the memory consumption behavior on our test cluster and have gotten similar behavior by running jobs that create >200 workflows in two different user namespaces simultaneously.

In the monitoring screenshot below, the spike in activity between 12:15 and 12:30 is in the same user namespace and the memory usage doesn't appear to increase.

In the spike in activity between 12:40 and 12:55, the workflows are the same two as the previous spike, but in separate namespaces. Here, the memory consumption does seem to increase.

pprof heap dump 1, 2, 3.

workflow-controller logs.

alexec · 2021-08-24T14:59:29Z

Did you try the newer version? I'm not clear.

tymokvo · 2021-08-24T16:19:28Z

Yeah, sorry for the delay. We have been doing a lot of testing in addition to fixing our own systems. I just ran the latest test over night.

We've upgraded to 3.1.6, increased the requeue time to 30s, and reduced the TTL time to 1200 seconds for all finished states. We're still seeing memory consumption by the controller exceed 10GB for some of our workflows.

This is the memory consumption reported by GCP metrics for the one I ran last night:

alexec · 2021-08-24T16:26:35Z

We're still seeing memory consumption

So generally reduced?

tymokvo · 2021-08-24T16:27:46Z

Yeah, initially it seemed to be resolved until I ran this test case again last night.

alexec · 2021-08-24T16:33:45Z

So we're making progress.

Can I request dumps when the memory usage is high again?

tymokvo · 2021-08-24T16:37:22Z

Sure thing, it's still over 10GB so I can do it now.

tymokvo · 2021-08-24T16:44:00Z

~~Dump 1, 2, 3~~

Ugh, hold on. It wasn't that high. I hadn't reset my monitoring timeline.

tymokvo · 2021-08-24T23:39:45Z

So this is the current kubectl top output:

POD                                    NAME                  CPU(cores)   MEMORY(bytes)   
workflow-controller-7f7bd4cdc9-tf4qb   argo-dumpster         2m           127Mi           
workflow-controller-7f7bd4cdc9-tf4qb   workflow-controller   89m          7733Mi

Here's the pprof allocs, heap, and profile.

The pprof calls were taking so long over HTTP that kubectl port-forward kept timing out so I had to make a pod in the cluster to generate them. If anyone else needs an image to do this, I made one here.

alexec · 2021-08-25T03:54:10Z

Hotspots

util.MergeTo
WorkflowSemaphoneKeyIndexFunc

alexec · 2021-08-25T03:56:45Z

Can you run you controller with INDEX_WORKFLOW_SEMAPHORE_KEYS=false? Note that this will disable semaphores.

tymokvo · 2021-08-25T17:15:38Z

Yep, just updated the controller. The test case takes a few hours to run though.

tymokvo · 2021-08-25T23:19:20Z

Ok! That seems to have done it. Memory consumption is sitting at around 260MB with peaks around 1GB. The variance between my two test runs was pretty high, but the upper bound was much more bearable.

Out of curiosity, what indicates that the WorkflowSemaphoreKeyIndexFunc was a problem? Hopefully we can fix issues like this in the future for ourselves.

alexec · 2021-08-26T16:49:13Z

@tymokvo I'm not very familar with Golang GC, but I've noted that when very large numbers of allocations happen, even thought the heap does not grow, real memory usage grows. Same thing happens with the Java Virtual Machine. It was the allocs dump that pointed to this.

@sarabala1979 actions for you:

For v3.2, just document the workaround in the scaling.md doc.
For v3.3 investigate a solution.

stale · 2021-10-27T03:27:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tymokvo added the type/bug label Aug 11, 2021

alexec added the try-newer-version label Aug 11, 2021

alexec self-assigned this Aug 24, 2021

whynowy added the solution/workaround There's a workaround, might not be great, but exists label Aug 26, 2021

alexec removed the try-newer-version label Aug 26, 2021

alexec added this to the v3.3 milestone Aug 26, 2021

alexec assigned sarabala1979 and unassigned alexec Aug 26, 2021

stale bot added the wontfix label Oct 27, 2021

stale bot closed this as completed Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflow-controller memory usage increases monotonically #6532

workflow-controller memory usage increases monotonically #6532

tymokvo commented Aug 11, 2021 •

edited

Loading

sarabala1979 commented Aug 11, 2021

tymokvo commented Aug 11, 2021 •

edited

Loading

sarabala1979 commented Aug 11, 2021

tymokvo commented Aug 11, 2021

alexec commented Aug 11, 2021

tymokvo commented Aug 11, 2021

tymokvo commented Aug 11, 2021

tymokvo commented Aug 11, 2021 •

edited

Loading

alexec commented Aug 11, 2021

alexec commented Aug 11, 2021

tymokvo commented Aug 11, 2021

alexec commented Aug 11, 2021

tymokvo commented Aug 11, 2021

alexec commented Aug 11, 2021

alexec commented Aug 11, 2021 •

edited

Loading

tymokvo commented Aug 11, 2021

tymokvo commented Aug 12, 2021

alexec commented Aug 24, 2021

tymokvo commented Aug 24, 2021

alexec commented Aug 24, 2021

tymokvo commented Aug 24, 2021

alexec commented Aug 24, 2021

tymokvo commented Aug 24, 2021

tymokvo commented Aug 24, 2021 •

edited

Loading

tymokvo commented Aug 24, 2021

alexec commented Aug 25, 2021

alexec commented Aug 25, 2021

tymokvo commented Aug 25, 2021

tymokvo commented Aug 25, 2021

alexec commented Aug 26, 2021 •

edited

Loading

stale bot commented Oct 27, 2021

workflow-controller memory usage increases monotonically #6532

workflow-controller memory usage increases monotonically #6532

Comments

tymokvo commented Aug 11, 2021 • edited Loading

Summary

Diagnostics

sarabala1979 commented Aug 11, 2021

tymokvo commented Aug 11, 2021 • edited Loading

sarabala1979 commented Aug 11, 2021

tymokvo commented Aug 11, 2021

alexec commented Aug 11, 2021

tymokvo commented Aug 11, 2021

tymokvo commented Aug 11, 2021

tymokvo commented Aug 11, 2021 • edited Loading

alexec commented Aug 11, 2021

alexec commented Aug 11, 2021

tymokvo commented Aug 11, 2021

alexec commented Aug 11, 2021

tymokvo commented Aug 11, 2021

alexec commented Aug 11, 2021

alexec commented Aug 11, 2021 • edited Loading

tymokvo commented Aug 11, 2021

tymokvo commented Aug 12, 2021

alexec commented Aug 24, 2021

tymokvo commented Aug 24, 2021

alexec commented Aug 24, 2021

tymokvo commented Aug 24, 2021

alexec commented Aug 24, 2021

tymokvo commented Aug 24, 2021

tymokvo commented Aug 24, 2021 • edited Loading

tymokvo commented Aug 24, 2021

alexec commented Aug 25, 2021

alexec commented Aug 25, 2021

tymokvo commented Aug 25, 2021

tymokvo commented Aug 25, 2021

alexec commented Aug 26, 2021 • edited Loading

stale bot commented Oct 27, 2021

tymokvo commented Aug 11, 2021 •

edited

Loading

tymokvo commented Aug 11, 2021 •

edited

Loading

tymokvo commented Aug 11, 2021 •

edited

Loading

alexec commented Aug 11, 2021 •

edited

Loading

tymokvo commented Aug 24, 2021 •

edited

Loading

alexec commented Aug 26, 2021 •

edited

Loading