Large workflow performance #2247

paveq · 2020-02-16T20:34:10Z

Summary

What level of performance should be expected when running large workflows consisting of lot of relatively fast tasks (in terms of execution time)?

Motivation

I'm currently running large workflows (looping through 500 items) where each single item consists of 9 separate DAG tasks. 1 task is handling main processing, rest are more lightweight such as API calls, transferring small amount of data etc.

It seems increasing workflow level parallelism does not improve processing rate linearly (increased from 5 to 10). Currently I'm running two workflows at same time with parallelism of 10 each, workflow-controller is consuming 1200m of CPU. It feels like using lot of CPU for what appears to be creation and deletion of pods.

Might the bottleneck be in constant updating of Workflow's nodes field, and the field compression/decompression? Does this update happen on each creation of pod? If so, performance of K8S API / etcd on GKE might also affect this to some extend?

alexec · 2020-05-05T23:31:04Z

I have pushed a development build of the controller that should/may reduce this problem.

argoproj/workflow-controller:feat-no-sleep

@paveq could you please try it out?

ygapon-mio · 2020-05-08T03:21:09Z

@alexec I'd like to try out this build, but I don't see argoproj/workflow-controller:feat-no-sleep here

I have a workflow with about 900 items and can compare performance vs workflow-controller:v2.7.6.

alexec · 2020-05-08T22:08:25Z

Can you try the prototype here: https://bit.ly/argo-wf-prototypes

ygapon-mio · 2020-05-12T13:26:20Z

I ran my relatively large workflow with workflow-controller:feat-no-sleep and workflow-controller:v2.7.6 with different parallelism settings. Workflow contains just over 900 nodes, all small with just couple API call.

Here is what I got:

Nodes in workflow ~ 900

argoproj/workflow-controller:v2.7.6

parallelism 5
Duration: 25:17 min
cpu:
avg: 320 mcores

parallelism 10
Duration: 14:41 min
cpu:
avg: 332 mcores

parallelism 20
Duration: 12:26 min
cpu:
avg: 342 mcores

alexcollinsintuit/workflow-controller:feat-no-sleep

parallelism 5
Duration: 24:37 min
cpu:
avg: 358 mcores

parallelism 10
Duration: 16:06 min
cpu:
avg: 332 mcores

parallelism 20
Duration: 13:35 min
cpu:
avg: 429 mcores

hiu-phail · 2020-06-06T16:33:46Z

Hey guys,

I just stumbled upon the same problem and did some investigations I wanted to share with you. I have the situation where I also have >1k items I'd like to process with fixed parallelism with Argo. Each item gets processed with a steps of two pods and some retries allowed, so there is some complexity involved. I used Argo v2.8.1.

As a total non-Gopher I started adding a lot of log messages to get a glimpse on what you guys even do there. As far as I understood it, it makes sense to me, while it doesn't seem to be made for the use case (yet). Anyway, that's what I saw:

The amount of pods started is sometimes nowhere near the defined parallelism limit.

I earlier just looked at the CPU load and thought spawning pods and cleaning them up actually takes to much time, but that isn't the problem. Argo wouldn't be able to help me there anyway, so I figured a metric for success should be the amount of running pods.

The cpu load on the workflow controller is high, even if I scale the deployment, that doesn't actually help
I found some hints to the workflow controller is thought to scale horizontally. Actually I got a lower performance really as the two pods generated collisions. That's what brought me to the requests actually just taking very long. I first thought maybe parsing and compressing the state could already be an issue, but with jq and gzip one could easily see that they process in well under some milliseconds still, so that couldn't be the issue.

The operate function takes up to 10 seconds
So obviously the workflow controller is just not capable of reacting to pods stopping as it is busy with its own logic (why does no one care for my workloads ;<). I tried to dig a little deeper and found some potential fixes to help the situation.

The controller tries to schedule workflows although it already knows that the parallelism limit is hit
So I am not sure about why it is doing that, but at least for my use case, that seemed unnecessary. The area one could adapt is probably https://github.com/argoproj/argo/blob/master/workflow/controller/steps.go#L238 . I patched it with:

Add a flag to know parallelism prevented further starts
Break the loop after adding the generated child
Use that for initializing the complete flag later to prevent to make sure the wf isn't complete when things couldn't be scheduled

Some processing intensive results are reproduced a lot
I still noticed some high fluctuation of the amount of pods started, so I looked a little further and found the expandStep https://github.com/argoproj/argo/blob/master/workflow/controller/steps.go#L429 to take roughly 6-10 seconds tops. If I understand the method correctly, it basically generates the templates to spawn nodes from. It gets re-done everytime (I assume) as state is evil (it indeed is :S). But sadly recomputing just takes a lot of time. For checking if it would help anything if that didn't happen, I introduced a Map on the level of the WorkloadController, which caches the result of that method. Afterwards, I could nicely see that the duration of operate calls was indeed lower, taking a peak of 7 seconds in the beginning and 1 to 3 seconds in subsequent calls. Also the amount of pods was still not ideal, but I got an average of roughly 80 pods (with the limit of 90), and the minmum being around 65 so far. Earlier, it could happen to drop down to 20 or less. The mean cpu usage of the workflow controller still is at 60%, earlier it was >100%.

The web ui isn't helpful at that scale anymore
That's actually a pity. Showing all these nodes on the web ui doesn't really work out anymore. Maybe it would be possible to add some aggregation view when withParam or withList are too large, basically showing the percentage of done jobs instead?

So from all those things above I believe Argo could be doing better here. My little hacks to make it perform sure aren't ideal, but maybe something like this could find its way into a release v.2.8.2 eventually? I'd be happy about that =).

Update: While that helped for a while, now skipping over all the already done tasks seems to be wasting time. Maybe some sort of "low water mark" would be good?

Update 2: The low water mark, e.g. the index up to which the children have already been processed, helped keep the mean CPU load below 50% =).

Update 3: Don't forget to keep update the nodeSteps for skipped nodes due to the low watermark :S. That gives you some hard to debug problem with an inconsistent wf state (Node pending while wf is failed already...

hiu-phail · 2020-06-11T21:16:20Z

Hey @alexec, I'd love to get rid of my own patches on argo and test the impact of the two flags on my workflow =). Could you provide some rationale for how I should tune the parameters for that use case?

paveq added the question label Feb 16, 2020

sarabala1979 added the performance label Feb 17, 2020

alexec added scalability and removed performance labels Apr 10, 2020

alexec added this to the v2.10 milestone Apr 13, 2020

alexec modified the milestones: v2.10, v2.9 Apr 27, 2020

alexec linked a pull request May 5, 2020 that will close this issue

feat(controller): Improve throughput of many workflows. Fixes #2908 #2921

Merged

6 tasks

alexec linked a pull request Jun 9, 2020 that will close this issue

feat(controller): Add --qps and --burst flags to controller #3180

Merged

6 tasks

alexec closed this as completed in #2921 Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large workflow performance #2247

Large workflow performance #2247

paveq commented Feb 16, 2020 •

edited

Loading

alexec commented May 5, 2020

ygapon-mio commented May 8, 2020

alexec commented May 8, 2020

ygapon-mio commented May 12, 2020 •

edited

Loading

hiu-phail commented Jun 6, 2020 •

edited

Loading

hiu-phail commented Jun 11, 2020

Large workflow performance #2247

Large workflow performance #2247

Comments

paveq commented Feb 16, 2020 • edited Loading

Summary

Motivation

alexec commented May 5, 2020

ygapon-mio commented May 8, 2020

alexec commented May 8, 2020

ygapon-mio commented May 12, 2020 • edited Loading

hiu-phail commented Jun 6, 2020 • edited Loading

hiu-phail commented Jun 11, 2020

paveq commented Feb 16, 2020 •

edited

Loading

ygapon-mio commented May 12, 2020 •

edited

Loading

hiu-phail commented Jun 6, 2020 •

edited

Loading