-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected number of Get workflowtemplates API calls #11134
Comments
Sounds like a bug given high amount of API calls. Updated to bug so we can track and eventually diagnose the issue. |
I am also facing this issue, it actually causes the UI to timeout on workflow submission, thus I can only trigger workflows from the CLI. Any news on this issue? |
please consider this bug as a high severity one. This phenomenon is probably stealing kubernetes api QPS quota to the workflow controller in such a way the cron worfklows are not launched sometimes . In any system, reliability problems on cron daemon is catastrophic, cron daemon need to be as reliable as the sun raising on the morning |
Could you test the tag |
@terrytangyuan I'll test as well. Can i simply change the |
I think so |
The logs don't show much in my case, when I trigger the cronWorkflow it times out and returns 504 after 60s. The logs show; So not much to go on |
What CronWorkflow are you referring to? Perhaps create a separate issue to discuss? This issue is for workflowtemplates API calls. |
hi @terrytangyuan , yes sorry for the confusion. In our case we think that this flooding could be related to missing wf launchs from cronwf controller, but it may not. Just forget about it, as you stated, this is not the correct issue to discuss it. |
Yes it in fact dissapeared completly, a movement in logging level ? logs-from-workflow-controller-in-workflow-controller-84cdcd9f75-ct8sc (2).log |
What disappeared? Could you be more specific? |
hi @terrytangyuan ,
As i can check in changes for this image here the logging level for the call did not change. This fact, together with the fact that this image tag Kind regards. |
Is everything else working as expected now? Do you see other issues? |
No logging changes. Those calls may not be necessary. |
Sent a PR for this. #11343 Please continue monitoring it and let us know if you see additional issues. |
I do not manage to find weird things or errors so far after deploying that image, but do not take this as a definitive test i guess. Thanks @terrytangyuan :) |
Signed-off-by: Yuan Tang <[email protected]>
@Guillermogsjc I believe I might be seeing the same issue as you as far as the CronWorkflow not firing the Workflow and also a lot of calls to "GET WorkflowTemplate". But I'm actually not seeing how this fix resolves it. I see that the CronWorkflow is making that call as part of its validation logic here, but it appears to not even use the WorkflowTemplate Informer that Terry changed. |
hi @juliev0 our issue about cronwf launching was assisted by @tico24 here on slack At a cron were we have 200 concurrent launchs, we observed that some of them were missed on launch (nothing indicating a launch failure on the logs). This only happened from cron launch, if manually launching 200 parallel ones it worked OK, so we were guessing that workflow-controller pod was throttled by the K8S API and the wfs missed without proper logging about it. This throttling could be produced in part by the "Get Workflowtemplates innecesary and massive calls that @terrytangyuan just corrected, in our guesses. After Tim advices following the scaling good practices and here, everything worked again nicely, and we saw no more missed launchs on cronwf so far. Now wfs launch almost 14 minutes after cron time (mostly because rate creatiion limits and requeue times increased), but every cronwf is triggered properly. This, of course, does not mean that the "trigger missed from cronwf" issue is totally resolved or has any relation with this issue. It could be nice to put an eye on it if you reproduced this missing from cronwf behaviour, as missing cronwf launches is a critical situation. If I find again this behaviour, I will open new issue with logs and information about it. |
@Guillermogsjc Thanks for the response. I think the change put in by Terry only appeared to resolve this. I’m not sure which version you were running before but it seems like another change was put in recently to remove logging of all K8S API responses. At least when I ran with a version of master from early June I no longer saw all of those API response logs like before. I think we need to use the WorkflowTemplate informer (cache like you said) when the CronWorkflow needs to validate the WorkflowTemplate instead of making the GET calls. I will open up an enhancement issue for that and if I have time work on it. |
Also, I see in your original log: I also see this line:
This is your Controller basically failing the health check here because there are Workflows in the system that have never been seen by the Controller even once. Maybe somehow the throttling (and maybe the retries from throttling) are blocking goroutines from running to process these Workflows? |
Added #11372 |
You are right this commit might have slipped into the image I built: c6862e9 Could you perhaps change loglevel to DEBUG and try again? Apologies for the confusion. |
hi @terrytangyuan tried again with debug on and image in the attached example, there is a massive get workflow template calls at "2023-07-18T10:14", where 93 cronwf have to be launched. I guess that #11372 will improve a lot the cronwf cronwf controller launching k8s api burden. kubernetes v1.27 (so affected by QPS abd BURST bump) workflow controller launched as - workflow-controller
- '--qps=50'
- '--burst=100'
- '--workflow-ttl-workers=8'
- '--workflow-workers=100'
- '--loglevel=debug' all 93 workflows have been launched 5 minutes later at 10:19 happily (k8s v1.27 rocks here) logs-from-workflow-controller-in-workflow-controller-76ffd854c4-fk8lt (1).log |
Signed-off-by: Yuan Tang <[email protected]>
…argoproj#11343) Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Dillen Padhiar <[email protected]>
hi all,
just wondering one thing about the call from
workflow-controller
that gets workflow templates.I give the data then state the question/propposal:
gives 35 alive templates (I do not have cluster workflow templates). Some of those 35 templates are nested as some are DAG calling another templates in several tasks that have also DAGs inside, with a nesting depth with a maximum of 2.
When I get logs from
workflow-controller
, I get this:gives
and
gives
31229
calls.So, in about 24 minutes, there have been 31 K calls to the K8S api for getting workflowtemplates, in a deployment that has 35 workflow templates alive. It looks too much for a k8s object being updated in periods much bigger (depending on each use cases), but surely looks like too many calls for 35 templates that are static through those 24 minutes ( in my case).
Statement (in the case my guessings are not bad):
It would be possible to cache those calls from workflow-controller inside it in a coroutine that gets the templates and creates shared variables with them, in such a way the other coroutines go and consult this shared workflow templates information instead punishing K8S API so much?
A flag
--workflow-template-cache 0s
forworkflow-controller
would be very nice, and by having this, at each deployment it could be decided the balance about how much "limited K8S API calls money" is spent in updating workflow templates.Here the logs -> logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log
Here the workflow-controller configmap
Here the workflow-controller deployment patch
The text was updated successfully, but these errors were encountered: