Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large workflow performance #2247

Closed
paveq opened this issue Feb 16, 2020 · 6 comments · Fixed by #2921 or #3180
Closed

Large workflow performance #2247

paveq opened this issue Feb 16, 2020 · 6 comments · Fixed by #2921 or #3180
Milestone

Comments

@paveq
Copy link
Contributor

paveq commented Feb 16, 2020

Summary

What level of performance should be expected when running large workflows consisting of lot of relatively fast tasks (in terms of execution time)?

Motivation

I'm currently running large workflows (looping through 500 items) where each single item consists of 9 separate DAG tasks. 1 task is handling main processing, rest are more lightweight such as API calls, transferring small amount of data etc.

It seems increasing workflow level parallelism does not improve processing rate linearly (increased from 5 to 10). Currently I'm running two workflows at same time with parallelism of 10 each, workflow-controller is consuming 1200m of CPU. It feels like using lot of CPU for what appears to be creation and deletion of pods.

Might the bottleneck be in constant updating of Workflow's nodes field, and the field compression/decompression? Does this update happen on each creation of pod? If so, performance of K8S API / etcd on GKE might also affect this to some extend?

@paveq paveq added the question label Feb 16, 2020
@alexec alexec added this to the v2.10 milestone Apr 13, 2020
@alexec alexec modified the milestones: v2.10, v2.9 Apr 27, 2020
@alexec alexec linked a pull request May 5, 2020 that will close this issue
6 tasks
@alexec
Copy link
Contributor

alexec commented May 5, 2020

I have pushed a development build of the controller that should/may reduce this problem.

argoproj/workflow-controller:feat-no-sleep

@paveq could you please try it out?

@ygapon-mio
Copy link

@alexec I'd like to try out this build, but I don't see argoproj/workflow-controller:feat-no-sleep here

I have a workflow with about 900 items and can compare performance vs workflow-controller:v2.7.6.

@alexec
Copy link
Contributor

alexec commented May 8, 2020

Can you try the prototype here: https://bit.ly/argo-wf-prototypes

@ygapon-mio
Copy link

ygapon-mio commented May 12, 2020

I ran my relatively large workflow with workflow-controller:feat-no-sleep and workflow-controller:v2.7.6 with different parallelism settings. Workflow contains just over 900 nodes, all small with just couple API call.

Here is what I got:

Nodes in workflow ~ 900

argoproj/workflow-controller:v2.7.6

parallelism 5
Duration: 25:17 min
cpu:
avg: 320 mcores


parallelism 10
Duration: 14:41 min
cpu:
avg: 332 mcores


parallelism 20
Duration: 12:26 min
cpu:
avg: 342 mcores


alexcollinsintuit/workflow-controller:feat-no-sleep

parallelism 5
Duration: 24:37 min
cpu:
avg: 358 mcores


parallelism 10
Duration: 16:06 min
cpu:
avg: 332 mcores


parallelism 20
Duration: 13:35 min
cpu:
avg: 429 mcores

@hiu-phail
Copy link

hiu-phail commented Jun 6, 2020

Hey guys,

I just stumbled upon the same problem and did some investigations I wanted to share with you. I have the situation where I also have >1k items I'd like to process with fixed parallelism with Argo. Each item gets processed with a steps of two pods and some retries allowed, so there is some complexity involved. I used Argo v2.8.1.

As a total non-Gopher I started adding a lot of log messages to get a glimpse on what you guys even do there. As far as I understood it, it makes sense to me, while it doesn't seem to be made for the use case (yet). Anyway, that's what I saw:

The amount of pods started is sometimes nowhere near the defined parallelism limit.

I earlier just looked at the CPU load and thought spawning pods and cleaning them up actually takes to much time, but that isn't the problem. Argo wouldn't be able to help me there anyway, so I figured a metric for success should be the amount of running pods.

The cpu load on the workflow controller is high, even if I scale the deployment, that doesn't actually help
I found some hints to the workflow controller is thought to scale horizontally. Actually I got a lower performance really as the two pods generated collisions. That's what brought me to the requests actually just taking very long. I first thought maybe parsing and compressing the state could already be an issue, but with jq and gzip one could easily see that they process in well under some milliseconds still, so that couldn't be the issue.

The operate function takes up to 10 seconds
So obviously the workflow controller is just not capable of reacting to pods stopping as it is busy with its own logic (why does no one care for my workloads ;<). I tried to dig a little deeper and found some potential fixes to help the situation.

The controller tries to schedule workflows although it already knows that the parallelism limit is hit
So I am not sure about why it is doing that, but at least for my use case, that seemed unnecessary. The area one could adapt is probably https://github.com/argoproj/argo/blob/master/workflow/controller/steps.go#L238 . I patched it with:

  1. Add a flag to know parallelism prevented further starts
  2. Break the loop after adding the generated child
  3. Use that for initializing the complete flag later to prevent to make sure the wf isn't complete when things couldn't be scheduled

Some processing intensive results are reproduced a lot
I still noticed some high fluctuation of the amount of pods started, so I looked a little further and found the expandStep https://github.com/argoproj/argo/blob/master/workflow/controller/steps.go#L429 to take roughly 6-10 seconds tops. If I understand the method correctly, it basically generates the templates to spawn nodes from. It gets re-done everytime (I assume) as state is evil (it indeed is :S). But sadly recomputing just takes a lot of time. For checking if it would help anything if that didn't happen, I introduced a Map on the level of the WorkloadController, which caches the result of that method. Afterwards, I could nicely see that the duration of operate calls was indeed lower, taking a peak of 7 seconds in the beginning and 1 to 3 seconds in subsequent calls. Also the amount of pods was still not ideal, but I got an average of roughly 80 pods (with the limit of 90), and the minmum being around 65 so far. Earlier, it could happen to drop down to 20 or less. The mean cpu usage of the workflow controller still is at 60%, earlier it was >100%.

The web ui isn't helpful at that scale anymore
That's actually a pity. Showing all these nodes on the web ui doesn't really work out anymore. Maybe it would be possible to add some aggregation view when withParam or withList are too large, basically showing the percentage of done jobs instead?

So from all those things above I believe Argo could be doing better here. My little hacks to make it perform sure aren't ideal, but maybe something like this could find its way into a release v.2.8.2 eventually? I'd be happy about that =).

Update: While that helped for a while, now skipping over all the already done tasks seems to be wasting time. Maybe some sort of "low water mark" would be good?

Update 2: The low water mark, e.g. the index up to which the children have already been processed, helped keep the mean CPU load below 50% =).

Update 3: Don't forget to keep update the nodeSteps for skipped nodes due to the low watermark :S. That gives you some hard to debug problem with an inconsistent wf state (Node pending while wf is failed already...

@alexec alexec linked a pull request Jun 9, 2020 that will close this issue
6 tasks
@hiu-phail
Copy link

Hey @alexec, I'd love to get rid of my own patches on argo and test the impact of the two flags on my workflow =). Could you provide some rationale for how I should tune the parameters for that use case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants