Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a test to catch regressions in cli run time #32612

Merged
merged 2 commits into from
Jul 15, 2023

Conversation

o-nikolas
Copy link
Contributor

@o-nikolas o-nikolas commented Jul 14, 2023

A lot of work has been done to optimize load time of the Executor modules so that they can be included in the CLI code path (so that executors can vend cli commands, part of AIP-51, see this PR). It has come up several times to add a test to assert the CLI runtime does not regress, while working on this code recently I found one such regression (OTEL related, not yet present in main, which I will fix in another PR) so I was motivated to add one. I played around with several options, but none captured the full extent of any delays/regressions beyond actually calling the airflow CLI in a subprocess call, which is what this unit test does.

Timing related tests are often troublesome because they can be flaky if, for example, they are run on an env that is bogged down/slow they may breach their assertions. But if we give a large enough (but still reasonable) threshold, I think we'll get more benefit than harm. But I'm curious to hear what others think.

CC: @potiuk

@o-nikolas o-nikolas requested review from potiuk and ferruzzi July 14, 2023 16:42
Comment on lines +278 to +281
threshold = 3.5
timing_result = timeit.timeit(stmt=timing_code, number=num_samples, setup=setup_code) / num_samples
# Average run time of Airflow CLI should at least be within 3.5s
assert timing_result < threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each airflow --help currently takes more than three seconds? That’s bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each command must complete in < less than 3.5 seconds. On my machine it usually takes are 2-2.5 seconds in rare cases spiking up to 3 if you CPU is under load for something, etc.

But yeah, the CLI module lives in the same code base as everything else, so to run it all of Airflow must be loaded (including the heavy hitters of conf/settings, etc)

Copy link
Member

@potiuk potiuk Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have different timings for different commands IMHO. And I agree with @uranusjr that if basic commands take > 3 seconds, this is wrong. Some of them ("config" that I worked on recently in #32604 need to initialize providers when it will be merged and I deliberately introduced lazy-loading the providers-manager for config in my change to only run it when actually needed. But maybe this is an indication that we still need to improve some of that . @uranusjr had done a lot of those lazy-loading things to make the time smaller for CLIs and some initialization, but 3 seconds for "most" commands seems really off.

Copy link
Contributor Author

@o-nikolas o-nikolas Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@potiuk @uranusjr I agree with all of this, but we can also agree that it is out of scope for this PR, right? I just want to establish some kind of threshold now in the most basic sense (i.e. just a simple --help) and we can have much more follow up work to optimize it and lower the threshold once we make more optimizations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different threshold for each command sounds great but also sounds like a lot of work and way more chances to break something for what feels like marginal benefit to me. Either way, I'd propose we get the test implemented then work on improvements and adjust the test thresholds (hopefully down/tighter) as improvements are implemented, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the matter of risk. How likely it is that this test will break things for people running regular PRs on public runners.

We do not know. We - for now - know that it seems to work on self-hosted. Let's try few times on public runners. I will add label and close/reopem to check.

@potiuk
Copy link
Member

potiuk commented Jul 14, 2023

Yeah. I think it's a good idea and we already have some of that in a different form - not necessarily timing but definitely targetted to balooning time of certain operations

We have a dedicated "perf_kit" util package for that and one part of it is for example counting the number of queries to DB which is another metrics of "performance" (I believe those tests helped or at least keep in check some of initial problems we had when adding some core features caused accidental changes in a number of queries to the DB.

Those are better, because they do not depend on the environment. But we have a few pure time-based:

https://github.com/apache/airflow/blob/main/tests/core/test_core.py#L82

And the problem there is we have to put hight numbers to account for "slow" environment" - public runner having another test running in parallel - which speeds up test time a lot, but introduces high variability of timing the tests. This is lilkey why you have 3 seconds there (answering @uranusjr comment).

The tricky part is that our tests are executed in various "environments" - i.e. they can often be run in parallel with other tests in order to speed up the execution - and that might get very tricky to figure out if the slowness is just environmental or a real regression.

But that discussion actually gave me an interesting idea. Maybe we coudl extract all "timing" tests to separate test type and execute it always "alone" only with sqlite - less CPU utilization but more stability of the environment.

We could add a separate job for those kind of tests and exclude them from regular tests easily - I think we should do it this way.

@o-nikolas
Copy link
Contributor Author

The tricky part is that our tests are executed in various "environments" - i.e. they can often be run in parallel with other tests in order to speed up the execution - and that might get very tricky to figure out if the slowness is just environmental or a real regression.

But that discussion actually gave me an interesting idea. Maybe we coudl extract all "timing" tests to separate test type and execute it always "alone" only with sqlite - less CPU utilization but more stability of the environment.

We could add a separate job for those kind of tests and exclude them from regular tests easily - I think we should do it this way.

Yupp, agreed, I think the limit now should be good enough for most cases. For example the regression I mentioned in description more than doubled the CLI run time from (up to about 8 seconds). So it will surely catch large regressions like that which is still helpful. But creating a new test suite as a follow-up could be a great optimization and would allow us to lower the threshold a bit.

@ferruzzi
Copy link
Contributor

I am almost certainly overthinking this but if we're considering Future Work... what about instead of setting a hardcoded time limit, we stored the previous durations somewhere and calculate an average and max based on that? Maybe run that suite against main on a schedule instead of in every CI run and assert that the timing is 1. not more than a fixed threshold and 2. not more than some percentage above the average successful duration?

Copy link
Contributor

@ferruzzi ferruzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can all agree that this is a solid first step. There is some good discussion going on, but whatever "next steps" come from that, I don't think they need to block this now.

@potiuk potiuk added the use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing) label Jul 14, 2023
@potiuk potiuk closed this Jul 14, 2023
@potiuk potiuk reopened this Jul 14, 2023
@potiuk
Copy link
Member

potiuk commented Jul 14, 2023

Closed /reopen to try it on public runners (possibly we should repeat a few times)

@potiuk
Copy link
Member

potiuk commented Jul 14, 2023

BTW. This is a constant risk - influx of tests that are flaky. I personally think that this is something that is difficult to handle, but it's a different story that we KNOW that a test can be flaky.. in such case we should make all the efforts to try to test it in various configurations.

And we have tools for that:

  • close/reopen a test few times to checkl
  • apply "use public runners" and repeat it

Those are the tools all commiters have to "try" a risky PR to try to see if the change that has a possible side-effect is "sound enough", so that it does not introduce a problem.

@o-nikolas
Copy link
Contributor Author

BTW. This is a constant risk - influx of tests that are flaky. I personally think that this is something that is difficult to handle, but it's a different story that we KNOW that a test can be flaky.. in such case we should make all the efforts to try to test it in various configurations.

And we have tools for that:

* close/reopen a test few times to checkl

* apply "use public runners" and repeat it

Those are the tools all commiters have to "try" a risky PR to try to see if the change that has a possible side-effect is "sound enough", so that it does not introduce a problem.

That's a great idea to test on public runners, I'll keep that in my back pocket for future changes. I'll close/reopen the tests a few times on public runners. So far the test is running quite well and is not even close to the slowest of the CLI test suite 😄

Screenshot from 2023-07-14 15-45-04

@o-nikolas
Copy link
Contributor Author

Triggering another round of tests, the last set were green

@ferruzzi
Copy link
Contributor

That test was green as well,just re-triggered all.

@potiuk
Copy link
Member

potiuk commented Jul 15, 2023

Issue created: #32620

@potiuk
Copy link
Member

potiuk commented Jul 15, 2023

And called for somone to implement it :) . Time to become less SPOF for the whole CI for me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants