Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow-controller memory usage increases monotonically #6532

Closed
tymokvo opened this issue Aug 11, 2021 · 31 comments
Closed

workflow-controller memory usage increases monotonically #6532

tymokvo opened this issue Aug 11, 2021 · 31 comments
Assignees
Labels
solution/workaround There's a workaround, might not be great, but exists type/bug
Milestone

Comments

@tymokvo
Copy link

tymokvo commented Aug 11, 2021

Summary

Opened by request of @sarabala1979 on slack here.

What happened/what you expected to happen?

workflow-controller's memory usage increased monotonically for each submitted workflow. I expected its memory usage to free up some time after the workflow transitioned to a "finished" state.

Diagnostics

👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we'll go around in circles asking you for it:

What Kubernetes provider are you using?

GKE

What version of Argo Workflows are you running?

3.1.0-rc14

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary

Emissary

Did this work in a previous version? I.e. is it a regression?

Unknown

Are you pasting thousands of log lines? That's too much information.

Yes, kind of, but I was asked to in this case.

The logs are 5.4MB, so I put them on GCS here.

The workflows in question completed at around 06:55 UTC


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@sarabala1979
Copy link
Member

I can see log Deleting TTL expired workflow so I think TTL is working. Can you provide little more information about your environment? how busy is your cluster?
Can you check your cluster whether already deleted workflow is still there?
Deleting TTL expired workflow '793fb579-caf4-49d1-a7e6-f4bb1d246095/bf14d17b-a55f-50d3-8420-f6196563b19e'"

@tymokvo
Copy link
Author

tymokvo commented Aug 11, 2021

The cluster is under pretty heavy load. This workflow bf14d17b-a55f-50d3-8420-f6196563b19e was one of 216 that a user submitted at once which created a total of 3493 nodes.

The workflow object is no longer present in the namespace.

@sarabala1979
Copy link
Member

@tymokvo it means TTL is deleting. Because of Big workflows and heavy load on new workflows controller informer size is growing.
Can you try to reduce TTL time?

@tymokvo
Copy link
Author

tymokvo commented Aug 11, 2021

Yep, I will do that and follow up here. Thanks!

@alexec
Copy link
Contributor

alexec commented Aug 11, 2021

Memory usage should be less that 1GB in most use cases:

  • How many workflows are you running.
  • How many nodes per workflows on average?

Can we get a pprof dump please?

This is an example: https://github.com/argoproj/argo-workflows/blob/master/hack/capture-pprof.sh

@tymokvo
Copy link
Author

tymokvo commented Aug 11, 2021

We have seen it increase to ~10GB after running 215 workflows with a total of 55,492 nodes over the course of 8 hours at which point the node was OOM and killed the workflow-controller and a few of our other monitoring services.

I will get the pprof dump asap.

@tymokvo
Copy link
Author

tymokvo commented Aug 11, 2021

pprof output files 1, 2, and 3. (I don't know what these are 😬 )

@tymokvo
Copy link
Author

tymokvo commented Aug 11, 2021

Between 2021-08-10T06:50:00 to 2021-08-11T17:30:00 (~34hrs) we ran 452 workflows. These were started in groups of related workflows that we call "jobs". The log file above is from when the workflow-controller was restarted while the jobs with 216 workflows were running until the end of the duration.

Here's a table of how the load over time breaks down:

job_id started_at finished_at workflow_count nodes
9b0feb31 2021-08-10 07:10:08 2021-08-10 07:17:32 1 1102
07084fdd 2021-08-10 07:20:05 2021-08-10 07:39:21 4 4408
27bfa22a 2021-08-10 07:22:12 2021-08-10 07:25:39 1 71
8c7d1f83 2021-08-10 08:11:15 2021-08-10 08:13:41 1 122
b50d9c33 2021-08-10 08:11:16 2021-08-10 08:14:05 1 122
64c3714e 2021-08-10 08:11:17 2021-08-10 08:14:02 1 122
c8d6a157 2021-08-10 08:30:24 2021-08-10 08:31:58 1 122
5f20fee0 2021-08-10 10:13:55 2021-08-10 10:27:48 1 2602
5bc52074 2021-08-10 11:54:06 2021-08-10 12:16:37 1 2602
b5cd4ac1 2021-08-10 12:17:03 2021-08-10 12:23:20 1 202
3d92c52d 2021-08-10 17:06:29 2021-08-10 17:13:20 1 485
fd8391df 2021-08-10 17:18:27 2021-08-10 17:24:30 1 468
88f43ad7 2021-08-10 17:18:35 2021-08-10 17:22:27 1 399
db98f1a7 2021-08-10 17:20:18 2021-08-10 17:23:15 1 400
98a90a6e 2021-08-10 18:13:43 2021-08-10 18:21:42 1 468
a7db6e70 2021-08-10 18:13:53 2021-08-10 18:19:01 1 399
6776dea9 2021-08-10 18:14:04 2021-08-10 18:18:38 1 400
b4309ad5 2021-08-11 06:37:28 2021-08-11 06:39:22 216 3493
25eecca4 2021-08-11 06:45:19 2021-08-11 06:55:13 216 4321

@alexec
Copy link
Contributor

alexec commented Aug 11, 2021

pprof output files 1, 2, and 3. (I don't know what these are 😬 )

The heap dump is missing, and allows appears to be attached twice? It's the heap I want to see most.

@alexec
Copy link
Contributor

alexec commented Aug 11, 2021

It might be good to set-up 30m: https://bit.ly/book-30m-with-argo-team

@tymokvo
Copy link
Author

tymokvo commented Aug 11, 2021

Ok, I will try to find some time. I ran it again and captured the heap dump svg here.

@alexec
Copy link
Contributor

alexec commented Aug 11, 2021

I though I'd jump in - but I didn't see anything out of the ordinary in the pprof dumps.

Can I ask you to review https://argoproj.github.io/argo-workflows/running-at-massive-scale/

Additionally, you didn't state if this is a regression or not, is this is a brand new installation?

@tymokvo
Copy link
Author

tymokvo commented Aug 11, 2021

It's hard to tell if it's a regression, it's the first time that we've experienced this specific issue. We weren't running this kind of load before upgrading to Argo 3 and we've been using 3.1.0-rc14 for a while now (though we had to fork and patch to mitigate #6276 )

Re: running at massive scale, we've implemented parallelism limits, and we're using the emissary executor, and the k8s cluster can handle the load (the good thing about the issue that we've been having is that the workflows have all been completing successfully). But, the memory consumption by workflow controller is starving all the other infrastructure services (fluentbit, prometheus, some custom eventing services, etc.)

I will look into implementing rate limiting and the reque time, though.

@alexec
Copy link
Contributor

alexec commented Aug 11, 2021

Try the re-queue setting, if that works great, if not - lets get on a call.

@alexec
Copy link
Contributor

alexec commented Aug 11, 2021

Coda - you really should be running the latest version which will be v3.1.5 (or v3.1.6 if that is released today). If needs be, fork and re-apply your patch.

@tymokvo
Copy link
Author

tymokvo commented Aug 11, 2021

Ok, I'll start there.

@tymokvo
Copy link
Author

tymokvo commented Aug 12, 2021

One thing I forgot to mention: we are using per-user namespaces with resource limits to prevent users starving each other of CPU/memory. This and our resource allocation settings amount to each namespace being able to run ~66 pods simultaneously.

In order to test the suggested changes appropriately, I've been trying to replicate the memory consumption behavior on our test cluster and have gotten similar behavior by running jobs that create >200 workflows in two different user namespaces simultaneously.

In the monitoring screenshot below, the spike in activity between 12:15 and 12:30 is in the same user namespace and the memory usage doesn't appear to increase.

In the spike in activity between 12:40 and 12:55, the workflows are the same two as the previous spike, but in separate namespaces. Here, the memory consumption does seem to increase.

workflow-controller-mem-increase

pprof heap dump 1, 2, 3.

workflow-controller logs.

@alexec
Copy link
Contributor

alexec commented Aug 24, 2021

Did you try the newer version? I'm not clear.

@alexec alexec self-assigned this Aug 24, 2021
@tymokvo
Copy link
Author

tymokvo commented Aug 24, 2021

Yeah, sorry for the delay. We have been doing a lot of testing in addition to fixing our own systems. I just ran the latest test over night.

We've upgraded to 3.1.6, increased the requeue time to 30s, and reduced the TTL time to 1200 seconds for all finished states. We're still seeing memory consumption by the controller exceed 10GB for some of our workflows.

This is the memory consumption reported by GCP metrics for the one I ran last night:

workflow-controller-post-changes

@alexec
Copy link
Contributor

alexec commented Aug 24, 2021

We're still seeing memory consumption

So generally reduced?

@tymokvo
Copy link
Author

tymokvo commented Aug 24, 2021

Yeah, initially it seemed to be resolved until I ran this test case again last night.

@alexec
Copy link
Contributor

alexec commented Aug 24, 2021

So we're making progress.

Can I request dumps when the memory usage is high again?

@tymokvo
Copy link
Author

tymokvo commented Aug 24, 2021

Sure thing, it's still over 10GB so I can do it now.

@tymokvo
Copy link
Author

tymokvo commented Aug 24, 2021

Dump 1, 2, 3

Ugh, hold on. It wasn't that high. I hadn't reset my monitoring timeline.

@tymokvo
Copy link
Author

tymokvo commented Aug 24, 2021

So this is the current kubectl top output:

POD                                    NAME                  CPU(cores)   MEMORY(bytes)   
workflow-controller-7f7bd4cdc9-tf4qb   argo-dumpster         2m           127Mi           
workflow-controller-7f7bd4cdc9-tf4qb   workflow-controller   89m          7733Mi

Here's the pprof allocs, heap, and profile.

The pprof calls were taking so long over HTTP that kubectl port-forward kept timing out so I had to make a pod in the cluster to generate them. If anyone else needs an image to do this, I made one here.

@alexec
Copy link
Contributor

alexec commented Aug 25, 2021

Hotspots

  • util.MergeTo
  • WorkflowSemaphoneKeyIndexFunc

@alexec
Copy link
Contributor

alexec commented Aug 25, 2021

Can you run you controller with INDEX_WORKFLOW_SEMAPHORE_KEYS=false? Note that this will disable semaphores.

@tymokvo
Copy link
Author

tymokvo commented Aug 25, 2021

Yep, just updated the controller. The test case takes a few hours to run though.

@tymokvo
Copy link
Author

tymokvo commented Aug 25, 2021

Ok! That seems to have done it. Memory consumption is sitting at around 260MB with peaks around 1GB. The variance between my two test runs was pretty high, but the upper bound was much more bearable.

Out of curiosity, what indicates that the WorkflowSemaphoreKeyIndexFunc was a problem? Hopefully we can fix issues like this in the future for ourselves.

@whynowy whynowy added the solution/workaround There's a workaround, might not be great, but exists label Aug 26, 2021
@alexec alexec added this to the v3.3 milestone Aug 26, 2021
@alexec alexec assigned sarabala1979 and unassigned alexec Aug 26, 2021
@alexec
Copy link
Contributor

alexec commented Aug 26, 2021

@tymokvo I'm not very familar with Golang GC, but I've noted that when very large numbers of allocations happen, even thought the heap does not grow, real memory usage grows. Same thing happens with the Java Virtual Machine. It was the allocs dump that pointed to this.

@sarabala1979 actions for you:

  • For v3.2, just document the workaround in the scaling.md doc.
  • For v3.3 investigate a solution.

@stale
Copy link

stale bot commented Oct 27, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 27, 2021
@stale stale bot closed this as completed Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solution/workaround There's a workaround, might not be great, but exists type/bug
Projects
None yet
Development

No branches or pull requests

4 participants