Add a test to catch regressions in cli run time #32612

o-nikolas · 2023-07-14T16:41:44Z

A lot of work has been done to optimize load time of the Executor modules so that they can be included in the CLI code path (so that executors can vend cli commands, part of AIP-51, see this PR). It has come up several times to add a test to assert the CLI runtime does not regress, while working on this code recently I found one such regression (OTEL related, not yet present in main, which I will fix in another PR) so I was motivated to add one. I played around with several options, but none captured the full extent of any delays/regressions beyond actually calling the airflow CLI in a subprocess call, which is what this unit test does.

Timing related tests are often troublesome because they can be flaky if, for example, they are run on an env that is bogged down/slow they may breach their assertions. But if we give a large enough (but still reasonable) threshold, I think we'll get more benefit than harm. But I'm curious to hear what others think.

CC: @potiuk

uranusjr · 2023-07-14T16:56:47Z

tests/cli/test_cli_parser.py

+        threshold = 3.5
+        timing_result = timeit.timeit(stmt=timing_code, number=num_samples, setup=setup_code) / num_samples
+        # Average run time of Airflow CLI should at least be within 3.5s
+        assert timing_result < threshold


Each airflow --help currently takes more than three seconds? That’s bad.

Each command must complete in < less than 3.5 seconds. On my machine it usually takes are 2-2.5 seconds in rare cases spiking up to 3 if you CPU is under load for something, etc.

But yeah, the CLI module lives in the same code base as everything else, so to run it all of Airflow must be loaded (including the heavy hitters of conf/settings, etc)

We should have different timings for different commands IMHO. And I agree with @uranusjr that if basic commands take > 3 seconds, this is wrong. Some of them ("config" that I worked on recently in #32604 need to initialize providers when it will be merged and I deliberately introduced lazy-loading the providers-manager for config in my change to only run it when actually needed. But maybe this is an indication that we still need to improve some of that . @uranusjr had done a lot of those lazy-loading things to make the time smaller for CLIs and some initialization, but 3 seconds for "most" commands seems really off.

@potiuk @uranusjr I agree with all of this, but we can also agree that it is out of scope for this PR, right? I just want to establish some kind of threshold now in the most basic sense (i.e. just a simple --help) and we can have much more follow up work to optimize it and lower the threshold once we make more optimizations.

A different threshold for each command sounds great but also sounds like a lot of work and way more chances to break something for what feels like marginal benefit to me. Either way, I'd propose we get the test implemented then work on improvements and adjust the test thresholds (hopefully down/tighter) as improvements are implemented, right?

This is the matter of risk. How likely it is that this test will break things for people running regular PRs on public runners.

We do not know. We - for now - know that it seems to work on self-hosted. Let's try few times on public runners. I will add label and close/reopem to check.

potiuk · 2023-07-14T16:58:51Z

Yeah. I think it's a good idea and we already have some of that in a different form - not necessarily timing but definitely targetted to balooning time of certain operations

We have a dedicated "perf_kit" util package for that and one part of it is for example counting the number of queries to DB which is another metrics of "performance" (I believe those tests helped or at least keep in check some of initial problems we had when adding some core features caused accidental changes in a number of queries to the DB.

utils here: https://github.com/apache/airflow/blob/main/tests/test_utils/perf/perf_kit/sqlalchemy.py
usage here: https://github.com/apache/airflow/blob/main/tests/models/test_dag.py (look for assert_queries_count)

Those are better, because they do not depend on the environment. But we have a few pure time-based:

https://github.com/apache/airflow/blob/main/tests/core/test_core.py#L82

And the problem there is we have to put hight numbers to account for "slow" environment" - public runner having another test running in parallel - which speeds up test time a lot, but introduces high variability of timing the tests. This is lilkey why you have 3 seconds there (answering @uranusjr comment).

The tricky part is that our tests are executed in various "environments" - i.e. they can often be run in parallel with other tests in order to speed up the execution - and that might get very tricky to figure out if the slowness is just environmental or a real regression.

But that discussion actually gave me an interesting idea. Maybe we coudl extract all "timing" tests to separate test type and execute it always "alone" only with sqlite - less CPU utilization but more stability of the environment.

We could add a separate job for those kind of tests and exclude them from regular tests easily - I think we should do it this way.

o-nikolas · 2023-07-14T17:09:27Z

The tricky part is that our tests are executed in various "environments" - i.e. they can often be run in parallel with other tests in order to speed up the execution - and that might get very tricky to figure out if the slowness is just environmental or a real regression.

But that discussion actually gave me an interesting idea. Maybe we coudl extract all "timing" tests to separate test type and execute it always "alone" only with sqlite - less CPU utilization but more stability of the environment.

We could add a separate job for those kind of tests and exclude them from regular tests easily - I think we should do it this way.

Yupp, agreed, I think the limit now should be good enough for most cases. For example the regression I mentioned in description more than doubled the CLI run time from (up to about 8 seconds). So it will surely catch large regressions like that which is still helpful. But creating a new test suite as a follow-up could be a great optimization and would allow us to lower the threshold a bit.

ferruzzi · 2023-07-14T19:01:30Z

I am almost certainly overthinking this but if we're considering Future Work... what about instead of setting a hardcoded time limit, we stored the previous durations somewhere and calculate an average and max based on that? Maybe run that suite against main on a schedule instead of in every CI run and assert that the timing is 1. not more than a fixed threshold and 2. not more than some percentage above the average successful duration?

ferruzzi

I think we can all agree that this is a solid first step. There is some good discussion going on, but whatever "next steps" come from that, I don't think they need to block this now.

potiuk · 2023-07-14T21:04:19Z

Closed /reopen to try it on public runners (possibly we should repeat a few times)

potiuk · 2023-07-14T21:09:52Z

BTW. This is a constant risk - influx of tests that are flaky. I personally think that this is something that is difficult to handle, but it's a different story that we KNOW that a test can be flaky.. in such case we should make all the efforts to try to test it in various configurations.

And we have tools for that:

close/reopen a test few times to checkl
apply "use public runners" and repeat it

Those are the tools all commiters have to "try" a risky PR to try to see if the change that has a possible side-effect is "sound enough", so that it does not introduce a problem.

o-nikolas · 2023-07-14T22:45:31Z

BTW. This is a constant risk - influx of tests that are flaky. I personally think that this is something that is difficult to handle, but it's a different story that we KNOW that a test can be flaky.. in such case we should make all the efforts to try to test it in various configurations.

And we have tools for that:
* close/reopen a test few times to checkl

* apply "use public runners" and repeat it
Those are the tools all commiters have to "try" a risky PR to try to see if the change that has a possible side-effect is "sound enough", so that it does not introduce a problem.

That's a great idea to test on public runners, I'll keep that in my back pocket for future changes. I'll close/reopen the tests a few times on public runners. So far the test is running quite well and is not even close to the slowest of the CLI test suite 😄

o-nikolas · 2023-07-14T23:08:19Z

Triggering another round of tests, the last set were green

ferruzzi · 2023-07-15T00:06:34Z

That test was green as well,just re-triggered all.

potiuk · 2023-07-15T06:34:47Z

Issue created: #32620

potiuk · 2023-07-15T06:51:05Z

And called for somone to implement it :) . Time to become less SPOF for the whole CI for me :)

Add a test to catch regressions in cli run time

af5b7ea

boring-cyborg bot added the area:CLI label Jul 14, 2023

o-nikolas requested review from potiuk and ferruzzi July 14, 2023 16:42

uranusjr reviewed Jul 14, 2023

View reviewed changes

ferruzzi approved these changes Jul 14, 2023

View reviewed changes

potiuk added the use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing) label Jul 14, 2023

potiuk closed this Jul 14, 2023

potiuk reopened this Jul 14, 2023

Merge branch 'main' into onikolas/cli_time_test

d34964c

potiuk approved these changes Jul 15, 2023

View reviewed changes

potiuk merged commit 7fa524d into apache:main Jul 15, 2023

potiuk mentioned this pull request Jul 15, 2023

Add separate test job to run only time-bound tests in isolation #32620

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a test to catch regressions in cli run time #32612

Add a test to catch regressions in cli run time #32612

o-nikolas commented Jul 14, 2023 •

edited

Loading

uranusjr Jul 14, 2023

o-nikolas Jul 14, 2023

potiuk Jul 14, 2023 •

edited

Loading

o-nikolas Jul 14, 2023 •

edited

Loading

ferruzzi Jul 14, 2023

potiuk Jul 14, 2023

potiuk commented Jul 14, 2023

o-nikolas commented Jul 14, 2023

ferruzzi commented Jul 14, 2023

ferruzzi left a comment

potiuk commented Jul 14, 2023

potiuk commented Jul 14, 2023

o-nikolas commented Jul 14, 2023

o-nikolas commented Jul 14, 2023

ferruzzi commented Jul 15, 2023

potiuk commented Jul 15, 2023

potiuk commented Jul 15, 2023

Add a test to catch regressions in cli run time #32612

Add a test to catch regressions in cli run time #32612

Conversation

o-nikolas commented Jul 14, 2023 • edited Loading

uranusjr Jul 14, 2023

Choose a reason for hiding this comment

o-nikolas Jul 14, 2023

Choose a reason for hiding this comment

potiuk Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

o-nikolas Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

ferruzzi Jul 14, 2023

Choose a reason for hiding this comment

potiuk Jul 14, 2023

Choose a reason for hiding this comment

potiuk commented Jul 14, 2023

o-nikolas commented Jul 14, 2023

ferruzzi commented Jul 14, 2023

ferruzzi left a comment

Choose a reason for hiding this comment

potiuk commented Jul 14, 2023

potiuk commented Jul 14, 2023

o-nikolas commented Jul 14, 2023

o-nikolas commented Jul 14, 2023

ferruzzi commented Jul 15, 2023

potiuk commented Jul 15, 2023

potiuk commented Jul 15, 2023

o-nikolas commented Jul 14, 2023 •

edited

Loading

potiuk Jul 14, 2023 •

edited

Loading

o-nikolas Jul 14, 2023 •

edited

Loading