-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large workflow performance #2247
Comments
I have pushed a development build of the controller that should/may reduce this problem.
@paveq could you please try it out? |
Can you try the prototype here: https://bit.ly/argo-wf-prototypes |
I ran my relatively large workflow with Here is what I got: Nodes in workflow ~ 900 argoproj/workflow-controller:v2.7.6 parallelism 5 parallelism 10 parallelism 20 alexcollinsintuit/workflow-controller:feat-no-sleep parallelism 5 parallelism 10 parallelism 20 |
Hey guys, I just stumbled upon the same problem and did some investigations I wanted to share with you. I have the situation where I also have >1k items I'd like to process with fixed parallelism with Argo. Each item gets processed with a steps of two pods and some retries allowed, so there is some complexity involved. I used Argo v2.8.1. As a total non-Gopher I started adding a lot of log messages to get a glimpse on what you guys even do there. As far as I understood it, it makes sense to me, while it doesn't seem to be made for the use case (yet). Anyway, that's what I saw: The amount of pods started is sometimes nowhere near the defined parallelism limit. I earlier just looked at the CPU load and thought spawning pods and cleaning them up actually takes to much time, but that isn't the problem. Argo wouldn't be able to help me there anyway, so I figured a metric for success should be the amount of running pods. The cpu load on the workflow controller is high, even if I scale the deployment, that doesn't actually help The operate function takes up to 10 seconds The controller tries to schedule workflows although it already knows that the parallelism limit is hit
Some processing intensive results are reproduced a lot The web ui isn't helpful at that scale anymore So from all those things above I believe Argo could be doing better here. My little hacks to make it perform sure aren't ideal, but maybe something like this could find its way into a release v.2.8.2 eventually? I'd be happy about that =). Update: While that helped for a while, now skipping over all the already done tasks seems to be wasting time. Maybe some sort of "low water mark" would be good? Update 2: The low water mark, e.g. the index up to which the children have already been processed, helped keep the mean CPU load below 50% =). Update 3: Don't forget to keep update the nodeSteps for skipped nodes due to the low watermark :S. That gives you some hard to debug problem with an inconsistent wf state (Node pending while wf is failed already... |
Hey @alexec, I'd love to get rid of my own patches on argo and test the impact of the two flags on my workflow =). Could you provide some rationale for how I should tune the parameters for that use case? |
Summary
What level of performance should be expected when running large workflows consisting of lot of relatively fast tasks (in terms of execution time)?
Motivation
I'm currently running large workflows (looping through 500 items) where each single item consists of 9 separate DAG tasks. 1 task is handling main processing, rest are more lightweight such as API calls, transferring small amount of data etc.
It seems increasing workflow level parallelism does not improve processing rate linearly (increased from 5 to 10). Currently I'm running two workflows at same time with parallelism of 10 each, workflow-controller is consuming 1200m of CPU. It feels like using lot of CPU for what appears to be creation and deletion of pods.
Might the bottleneck be in constant updating of Workflow's nodes field, and the field compression/decompression? Does this update happen on each creation of pod? If so, performance of K8S API / etcd on GKE might also affect this to some extend?
The text was updated successfully, but these errors were encountered: