-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a test to catch regressions in cli run time #32612
Conversation
threshold = 3.5 | ||
timing_result = timeit.timeit(stmt=timing_code, number=num_samples, setup=setup_code) / num_samples | ||
# Average run time of Airflow CLI should at least be within 3.5s | ||
assert timing_result < threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each airflow --help
currently takes more than three seconds? That’s bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each command must complete in <
less than 3.5 seconds. On my machine it usually takes are 2-2.5 seconds in rare cases spiking up to 3 if you CPU is under load for something, etc.
But yeah, the CLI module lives in the same code base as everything else, so to run it all of Airflow must be loaded (including the heavy hitters of conf
/settings
, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have different timings for different commands IMHO. And I agree with @uranusjr that if basic commands take > 3 seconds, this is wrong. Some of them ("config" that I worked on recently in #32604 need to initialize providers when it will be merged and I deliberately introduced lazy-loading the providers-manager for config in my change to only run it when actually needed. But maybe this is an indication that we still need to improve some of that . @uranusjr had done a lot of those lazy-loading things to make the time smaller for CLIs and some initialization, but 3 seconds for "most" commands seems really off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@potiuk @uranusjr I agree with all of this, but we can also agree that it is out of scope for this PR, right? I just want to establish some kind of threshold now in the most basic sense (i.e. just a simple --help
) and we can have much more follow up work to optimize it and lower the threshold once we make more optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A different threshold for each command sounds great but also sounds like a lot of work and way more chances to break something for what feels like marginal benefit to me. Either way, I'd propose we get the test implemented then work on improvements and adjust the test thresholds (hopefully down/tighter) as improvements are implemented, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the matter of risk. How likely it is that this test will break things for people running regular PRs on public runners.
We do not know. We - for now - know that it seems to work on self-hosted. Let's try few times on public runners. I will add label and close/reopem to check.
Yeah. I think it's a good idea and we already have some of that in a different form - not necessarily timing but definitely targetted to balooning time of certain operations We have a dedicated "perf_kit" util package for that and one part of it is for example counting the number of queries to DB which is another metrics of "performance" (I believe those tests helped or at least keep in check some of initial problems we had when adding some core features caused accidental changes in a number of queries to the DB.
Those are better, because they do not depend on the environment. But we have a few pure time-based: https://github.com/apache/airflow/blob/main/tests/core/test_core.py#L82 And the problem there is we have to put hight numbers to account for "slow" environment" - public runner having another test running in parallel - which speeds up test time a lot, but introduces high variability of timing the tests. This is lilkey why you have 3 seconds there (answering @uranusjr comment). The tricky part is that our tests are executed in various "environments" - i.e. they can often be run in parallel with other tests in order to speed up the execution - and that might get very tricky to figure out if the slowness is just environmental or a real regression. But that discussion actually gave me an interesting idea. Maybe we coudl extract all "timing" tests to separate test type and execute it always "alone" only with sqlite - less CPU utilization but more stability of the environment. We could add a separate job for those kind of tests and exclude them from regular tests easily - I think we should do it this way. |
Yupp, agreed, I think the limit now should be good enough for most cases. For example the regression I mentioned in description more than doubled the CLI run time from (up to about 8 seconds). So it will surely catch large regressions like that which is still helpful. But creating a new test suite as a follow-up could be a great optimization and would allow us to lower the threshold a bit. |
I am almost certainly overthinking this but if we're considering Future Work... what about instead of setting a hardcoded time limit, we stored the previous durations somewhere and calculate an average and max based on that? Maybe run that suite against |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can all agree that this is a solid first step. There is some good discussion going on, but whatever "next steps" come from that, I don't think they need to block this now.
Closed /reopen to try it on public runners (possibly we should repeat a few times) |
BTW. This is a constant risk - influx of tests that are flaky. I personally think that this is something that is difficult to handle, but it's a different story that we KNOW that a test can be flaky.. in such case we should make all the efforts to try to test it in various configurations. And we have tools for that:
Those are the tools all commiters have to "try" a risky PR to try to see if the change that has a possible side-effect is "sound enough", so that it does not introduce a problem. |
That's a great idea to test on public runners, I'll keep that in my back pocket for future changes. I'll close/reopen the tests a few times on public runners. So far the test is running quite well and is not even close to the slowest of the CLI test suite 😄 |
Triggering another round of tests, the last set were green |
That test was green as well,just re-triggered all. |
Issue created: #32620 |
And called for somone to implement it :) . Time to become less SPOF for the whole CI for me :) |
A lot of work has been done to optimize load time of the Executor modules so that they can be included in the CLI code path (so that executors can vend cli commands, part of AIP-51, see this PR). It has come up several times to add a test to assert the CLI runtime does not regress, while working on this code recently I found one such regression (OTEL related, not yet present in main, which I will fix in another PR) so I was motivated to add one. I played around with several options, but none captured the full extent of any delays/regressions beyond actually calling the airflow CLI in a subprocess call, which is what this unit test does.
Timing related tests are often troublesome because they can be flaky if, for example, they are run on an env that is bogged down/slow they may breach their assertions. But if we give a large enough (but still reasonable) threshold, I think we'll get more benefit than harm. But I'm curious to hear what others think.
CC: @potiuk