Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Regressions introduced in v4.51.1 and v4.55.3 #2308

Closed
jasongraham opened this issue Jan 4, 2020 · 20 comments
Closed

Performance Regressions introduced in v4.51.1 and v4.55.3 #2308

jasongraham opened this issue Jan 4, 2020 · 20 comments
Assignees
Labels
flaky-tests for when our tests only sometimes pass performance go faster! use less memory! tests/build/CI about testing or deployment *of* Hypothesis

Comments

@jasongraham
Copy link

jasongraham commented Jan 4, 2020

Hi folks.

I maintain an internal tool at my company and have been very happily using hypothesis as part of my test suite.

Over the past month, I've noticed a significant change in the duration of my test suite (a change in duration by a couple orders of magnitude, from ~5 minutes to >3 hours, when I run under a CI profile with max_examples=1000).

I've made an example that doesn't exhibit quite that severe difference, but illustrates using a simplified version of my data structures and test strategy, and run this example across several versions of Hypothesis.

It appears that the regressions I'm seeing in the performance were introduced in versions 4.51.1 and 4.55.3.

My hope is to make you folks aware of this and move toward figuring out how to improve the situation. If there's anything you'd like me to try, let me know.

Environment

  • Ubuntu 16.04 VM on Windows 10
  • Python 3.8.1
  • pytest 5.3.2

Test

See the following Gist (which includes a description of how I ran the test).
https://gist.github.com/jasongraham/177e4f854c0c52a0bf032707e574d65d

Results

Hypothesis Version Trial 1 Trial 2 Trial 3 Average
4.48.1 39.48 35.39 40.09 38.32
4.51.0 38.51 47.83 39.29 41.87667
4.51.1 96.29 99.97 94.53 96.93
4.53.3 96.09 94.25 96.15 95.49667
4.54.2 95.28 91.61 88.14 91.67667
4.55.2 98.53 92.6 100.7 97.27667
4.55.3 260.82 281.43 263.06 268.4367
4.55.4 247.1 267.03 251.14 255.09
4.57.1 272.13 255.11 262.63 263.29
5.0.1 250.1 269.7 245.83 255.21
5.1.0 251.97 246.43 258.3 252.2333

image

I see more severe increases using with larger max_examples, but the increase is also easily visible with max_examples=100 (not really visible with max_examples=10) if you don't want to waste as much time as I did.


Edit ~06 Jan 2020 09:31:12 AM PST: I must have been very tired writing text. I've cleaned it up and actually finished all of my thoughts.

@Zac-HD Zac-HD added the performance go faster! use less memory! label Jan 4, 2020
@Zac-HD
Copy link
Member

Zac-HD commented Jan 4, 2020

Thanks for the fantastic report @jasongraham!

@DRMacIver - #2294 suggests that these changes have also been reducing coverage 😬
Cross-references: version 4.51.1 was #2263, and 4.55.3 was #2261.

@Zac-HD Zac-HD added flaky-tests for when our tests only sometimes pass tests/build/CI about testing or deployment *of* Hypothesis labels Jan 11, 2020
@Zac-HD
Copy link
Member

Zac-HD commented Jan 11, 2020

As far as I can tell this is also the cause of our CI timeouts. While individually rare, over the builds we do for version+platform combinations there's a ~2/3 chance of needing to restart any particular CI run 😭

@Zalathar
Copy link
Contributor

Zalathar commented Jan 30, 2020

It looks like 4.51.1 is actually #2263. (I've edited Zac's post to include the correct cross-reference.)

@Zalathar
Copy link
Contributor

I've been poking around with pytest-profiling's SVG output, mainly across the first performance cliff (4.51.0 to 4.51.1).

Nothing major to report so far, but I have noticed that the latter version makes roughly twice as many calls to ConjectureData.draw (~300k vs ~600k), for roughly the same number of test_function calls (~1k).

@Zalathar
Copy link
Contributor

For the first cliff, most of the performance seems to be caused by af1463a specifically.

@Zalathar
Copy link
Contributor

Zalathar commented Feb 7, 2020

If I check out 4.51.0 and then alter zero_chance to artificially reduce the chosen zero-chance, the test takes longer and longer to run. If I force the zero-chance to 0.0, then it takes the same amount of time as 4.51.1.

This supports the idea that injecting fewer/no zero blocks is indeed somehow making the test slower.

(Precisely why this happens is still unclear, but it's plausible on its face, and a good lead.)

@Zalathar
Copy link
Contributor

Zalathar commented Feb 7, 2020

I tried looking at the test/draw timings collected by --hypothesis-show-statistics, but (at least on my machine) it tells me that a 15–30 second test is spending 50 seconds in the test function, which is obviously bogus.

(Presumably I'm hitting some kind of precision floor in time.monotonic(), so adding up a thousand single-test times gives a meaningless result.)

@Zalathar
Copy link
Contributor

Zalathar commented Feb 10, 2020

Looking across the second performance cliff, I found a much more dramatic difference in profile output.

Before the cliff, we spend ~99% of our time below generate_new_examples, and practically all of that is spent inside engine.test_function (mainly inside data.draw).

After the cliff, we still have ~99% in generate_new_examples, but engine.test_function is only ~32% (again, with most of that in data.draw).

I don't have a good handle on where the remaining 2/3 of profile time is spent, but it looks to be tied up in various pieces of bookkeeping in generate_new_examples.

(Ugh, GitHub makes it way too easy to unintentionally close an issue while commenting.)

@Zalathar Zalathar reopened this Feb 10, 2020
@Zalathar
Copy link
Contributor

Also, the fact that test_function drops to ~32% seems to agree with the overall increase in runtime after the cliff.

In other words, we seem to be doing roughly the same amount of test_function work as before, and the extra runtime seems to come from that mysterious new ~67%.

@Zalathar
Copy link
Contributor

Incidentally, I found that the targeting phase appears to make the profile results harder to read, so it helps to add phases=[Phase.generate] to turn targeting off.

(I was surprised to see the targeting code run without any target calls, but that's a mystery for another time.)

@Zalathar
Copy link
Contributor

Looks like most of the overhead is in the part of generate_new_examples that builds groups for use by the mutator.

It should be possible to arrange for this list to be built only 0–1 times per mutation loop, instead of once per iteration.

@Zac-HD
Copy link
Member

Zac-HD commented Feb 10, 2020

@Zalathar - I don't have anything to add to your investigation, but wanted to say that I really appreciate your contributions and persistence in tracking down weirdness in the internals. It makes a difference!

Zalathar added a commit to Zalathar/hypothesis that referenced this issue Feb 13, 2020
If data hasn't changed since the last iteration of the mutator loop,
then the example groups can be reused, and don't need to be
recalculated.

This restores a noticeable chunk of the performance lost in HypothesisWorks#2308.
Zalathar added a commit to Zalathar/hypothesis that referenced this issue Feb 13, 2020
If data hasn't changed since the last iteration of the mutator loop,
then the example groups can be reused, and don't need to be
recalculated.

This restores a noticeable chunk of the performance lost in HypothesisWorks#2308.
@Zac-HD
Copy link
Member

Zac-HD commented Feb 13, 2020

@jasongraham - we've just released a new version with @Zalathar's speedup patch for the mutator. If you could benchmark with 5.5.2 that would be really helpful!

@jasongraham
Copy link
Author

Thanks @Zalathar for your efforts looking into this!

Version Trail 1 Trail 2 Trail 3 Average
4.51.0 39.07 43.38 49.97 44.14
4.51.1 94.64 105.5 97.02 99.05333
4.55.3 279.99 263.35 268.99 270.7767
5.1.0 214.95 216.39 198.71 210.0167
5.2.0 178.46 241.71 216.73 212.3
5.5.1 207.1 198.9 207.75 204.5833
5.5.2 135.25 131.08 128.57 131.6333

I definitely see improvement matching the ~30s mentioned in #2353.

I'm not sure what to think about the decrease prior to 5.5.2 compared with 4.55.3, but it appears that a good fraction of the performance change I saw in the 2nd performance cliff since 4.51.1 has been mitigated.

Is there more ground that can be gained on that front?

Also, if I understand correctly the first performance cliff introduced in 4.51.1 is due to a change in the distribution of randomly generated data, which I assume was done for good reasons. Is it reasonable for me to assume that this isn't something to expect to change?

Are the problems I'm seeing for performance are being exacerbated by how my strategy is coded? Could I be mitigating some of these problems from my end?

@Zac-HD
Copy link
Member

Zac-HD commented Feb 14, 2020

I'm not sure what to think about the decrease prior to 5.5.2 compared with 4.55.3, but it appears that a good fraction of the performance change I saw in the 2nd performance cliff since 4.51.1 has been mitigated. Is there more ground that can be gained on that front?

Maybe! Skimming through the changelog we had a couple of fairly routine performance fixes, plus I think a one-off improvement when we dropped some Python 2 compatibility shims. So generic as well as focussed performance gains are likely still on the table - we just need someone who can volunteer to find them.

The first performance cliff introduced in 4.51.1 is due to a change in the distribution of randomly generated data, which I assume was done for good reasons. Is it reasonable for me to assume that this isn't something to expect to change?

I think we are likely to keep this, yes. When generating test cases there's always a tradeoff between quality (per-test-case probability of finding a bug) and speed (test case budget). In this case we can now generate data with correlations or repeats which would be astronomically unlikely before, so we're basically trading off some speed always for qualitative improvement in certain tests. But that does assume that we can make it faster again without losing the benefits!

Are the problems I'm seeing for performance are being exacerbated by how my strategy is coded? Could I be mitigating some of these problems from my end?

Maybe to both, but it's hard to tell if you can't share your strategy code 😉

@Zalathar
Copy link
Contributor

I'm not sure what to think about the decrease prior to 5.5.2 compared with 4.55.3, but it appears that a good fraction of the performance change I saw in the 2nd performance cliff since 4.51.1 has been mitigated.

Is there more ground that can be gained on that front?

Probably!

The code we use to generate duplicate values hasn't been designed or tuned with speed in mind. I took care of some very-low-hanging fruit (and got a big boost), but what's left seems like it should still have room for noticeable improvements.

(The caveat is that this will involve hard work, so I can't make any promises about if or when this will happen.)

Also, if I understand correctly the first performance cliff introduced in 4.51.1 is due to a change in the distribution of randomly generated data, which I assume was done for good reasons. Is it reasonable for me to assume that this isn't something to expect to change?

Hmm, we probably need @DRMacIver's opinion on this one.

The impression I get is that #2263 was mostly about reducing code complexity, and the changes to distribution might be larger than intended.

Are the problems I'm seeing for performance are being exacerbated by how my strategy is coded? Could I be mitigating some of these problems from my end?

I didn't see anything obviously bad in your sample code.

@rsokl
Copy link
Contributor

rsokl commented Feb 14, 2020

EDIT: I am re-running these. I realized that background processes started running during the timings which may have introduced misleading structure to the sequence of timings

@rsokl
Copy link
Contributor

rsokl commented Feb 14, 2020

Here are the timings for mygrad's test suite.
I am unable to revert prior to 4.53.2 because my test suite has that as a hard minimum version, but I do believe there was a substantial slowdown in 4.51.1 (as reported above).

image

@Zalathar
Copy link
Contributor

5.8.2 (#2390) contains another speedup that should be an improvement here.

@Zac-HD
Copy link
Member

Zac-HD commented Nov 17, 2020

It looks like we've recouped most of the performance dropped in those versions, with the remainder attributable to intended changes to the distribution of inputs.

We are of course very happy to accept performance-tuning PRs, including those informed by benchmarking previous versions, but it seems that we're not working on this and in the absence of specific plans I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-tests for when our tests only sometimes pass performance go faster! use less memory! tests/build/CI about testing or deployment *of* Hypothesis
Projects
None yet
Development

No branches or pull requests

5 participants