-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Regressions introduced in v4.51.1 and v4.55.3 #2308
Comments
Thanks for the fantastic report @jasongraham! @DRMacIver - #2294 suggests that these changes have also been reducing coverage 😬 |
As far as I can tell this is also the cause of our CI timeouts. While individually rare, over the builds we do for version+platform combinations there's a ~2/3 chance of needing to restart any particular CI run 😭 |
It looks like 4.51.1 is actually #2263. (I've edited Zac's post to include the correct cross-reference.) |
I've been poking around with Nothing major to report so far, but I have noticed that the latter version makes roughly twice as many calls to |
For the first cliff, most of the performance seems to be caused by af1463a specifically. |
If I check out 4.51.0 and then alter This supports the idea that injecting fewer/no zero blocks is indeed somehow making the test slower. (Precisely why this happens is still unclear, but it's plausible on its face, and a good lead.) |
I tried looking at the test/draw timings collected by (Presumably I'm hitting some kind of precision floor in |
Looking across the second performance cliff, I found a much more dramatic difference in profile output. Before the cliff, we spend ~99% of our time below After the cliff, we still have ~99% in I don't have a good handle on where the remaining 2/3 of profile time is spent, but it looks to be tied up in various pieces of bookkeeping in (Ugh, GitHub makes it way too easy to unintentionally close an issue while commenting.) |
Also, the fact that In other words, we seem to be doing roughly the same amount of |
Incidentally, I found that the targeting phase appears to make the profile results harder to read, so it helps to add (I was surprised to see the targeting code run without any |
Looks like most of the overhead is in the part of It should be possible to arrange for this list to be built only 0–1 times per mutation loop, instead of once per iteration. |
@Zalathar - I don't have anything to add to your investigation, but wanted to say that I really appreciate your contributions and persistence in tracking down weirdness in the internals. It makes a difference! |
If data hasn't changed since the last iteration of the mutator loop, then the example groups can be reused, and don't need to be recalculated. This restores a noticeable chunk of the performance lost in HypothesisWorks#2308.
If data hasn't changed since the last iteration of the mutator loop, then the example groups can be reused, and don't need to be recalculated. This restores a noticeable chunk of the performance lost in HypothesisWorks#2308.
@jasongraham - we've just released a new version with @Zalathar's speedup patch for the mutator. If you could benchmark with 5.5.2 that would be really helpful! |
Thanks @Zalathar for your efforts looking into this!
I definitely see improvement matching the ~30s mentioned in #2353. I'm not sure what to think about the decrease prior to 5.5.2 compared with 4.55.3, but it appears that a good fraction of the performance change I saw in the 2nd performance cliff since 4.51.1 has been mitigated. Is there more ground that can be gained on that front? Also, if I understand correctly the first performance cliff introduced in 4.51.1 is due to a change in the distribution of randomly generated data, which I assume was done for good reasons. Is it reasonable for me to assume that this isn't something to expect to change? Are the problems I'm seeing for performance are being exacerbated by how my strategy is coded? Could I be mitigating some of these problems from my end? |
Maybe! Skimming through the changelog we had a couple of fairly routine performance fixes, plus I think a one-off improvement when we dropped some Python 2 compatibility shims. So generic as well as focussed performance gains are likely still on the table - we just need someone who can volunteer to find them.
I think we are likely to keep this, yes. When generating test cases there's always a tradeoff between quality (per-test-case probability of finding a bug) and speed (test case budget). In this case we can now generate data with correlations or repeats which would be astronomically unlikely before, so we're basically trading off some speed always for qualitative improvement in certain tests. But that does assume that we can make it faster again without losing the benefits!
Maybe to both, but it's hard to tell if you can't share your strategy code 😉 |
Probably! The code we use to generate duplicate values hasn't been designed or tuned with speed in mind. I took care of some very-low-hanging fruit (and got a big boost), but what's left seems like it should still have room for noticeable improvements. (The caveat is that this will involve hard work, so I can't make any promises about if or when this will happen.)
Hmm, we probably need @DRMacIver's opinion on this one. The impression I get is that #2263 was mostly about reducing code complexity, and the changes to distribution might be larger than intended.
I didn't see anything obviously bad in your sample code. |
EDIT: I am re-running these. I realized that background processes started running during the timings which may have introduced misleading structure to the sequence of timings |
Here are the timings for mygrad's test suite. |
5.8.2 (#2390) contains another speedup that should be an improvement here. |
It looks like we've recouped most of the performance dropped in those versions, with the remainder attributable to intended changes to the distribution of inputs. We are of course very happy to accept performance-tuning PRs, including those informed by benchmarking previous versions, but it seems that we're not working on this and in the absence of specific plans I'm closing the issue. |
Hi folks.
I maintain an internal tool at my company and have been very happily using hypothesis as part of my test suite.
Over the past month, I've noticed a significant change in the duration of my test suite (a change in duration by a couple orders of magnitude, from ~5 minutes to >3 hours, when I run under a CI profile with
max_examples=1000
).I've made an example that doesn't exhibit quite that severe difference, but illustrates using a simplified version of my data structures and test strategy, and run this example across several versions of Hypothesis.
It appears that the regressions I'm seeing in the performance were introduced in versions 4.51.1 and 4.55.3.
My hope is to make you folks aware of this and move toward figuring out how to improve the situation. If there's anything you'd like me to try, let me know.
Environment
Test
See the following Gist (which includes a description of how I ran the test).
https://gist.github.com/jasongraham/177e4f854c0c52a0bf032707e574d65d
Results
I see more severe increases using with larger
max_examples
, but the increase is also easily visible withmax_examples=100
(not really visible withmax_examples=10
) if you don't want to waste as much time as I did.Edit ~06 Jan 2020 09:31:12 AM PST: I must have been very tired writing text. I've cleaned it up and actually finished all of my thoughts.
The text was updated successfully, but these errors were encountered: