Performance Regressions introduced in v4.51.1 and v4.55.3 #2308

jasongraham · 2020-01-04T05:05:21Z

Hi folks.

I maintain an internal tool at my company and have been very happily using hypothesis as part of my test suite.

Over the past month, I've noticed a significant change in the duration of my test suite (a change in duration by a couple orders of magnitude, from ~5 minutes to >3 hours, when I run under a CI profile with max_examples=1000).

I've made an example that doesn't exhibit quite that severe difference, but illustrates using a simplified version of my data structures and test strategy, and run this example across several versions of Hypothesis.

It appears that the regressions I'm seeing in the performance were introduced in versions 4.51.1 and 4.55.3.

My hope is to make you folks aware of this and move toward figuring out how to improve the situation. If there's anything you'd like me to try, let me know.

Environment

Ubuntu 16.04 VM on Windows 10
Python 3.8.1
pytest 5.3.2

Test

See the following Gist (which includes a description of how I ran the test).
https://gist.github.com/jasongraham/177e4f854c0c52a0bf032707e574d65d

Results

Hypothesis Version	Trial 1	Trial 2	Trial 3	Average
4.48.1	39.48	35.39	40.09	38.32
4.51.0	38.51	47.83	39.29	41.87667
4.51.1	96.29	99.97	94.53	96.93
4.53.3	96.09	94.25	96.15	95.49667
4.54.2	95.28	91.61	88.14	91.67667
4.55.2	98.53	92.6	100.7	97.27667
4.55.3	260.82	281.43	263.06	268.4367
4.55.4	247.1	267.03	251.14	255.09
4.57.1	272.13	255.11	262.63	263.29
5.0.1	250.1	269.7	245.83	255.21
5.1.0	251.97	246.43	258.3	252.2333

I see more severe increases using with larger max_examples, but the increase is also easily visible with max_examples=100 (not really visible with max_examples=10) if you don't want to waste as much time as I did.

Edit ~06 Jan 2020 09:31:12 AM PST: I must have been very tired writing text. I've cleaned it up and actually finished all of my thoughts.

The text was updated successfully, but these errors were encountered:

Zac-HD · 2020-01-04T05:25:06Z

Thanks for the fantastic report @jasongraham!

@DRMacIver - #2294 suggests that these changes have also been reducing coverage 😬
Cross-references: version 4.51.1 was #2263, and 4.55.3 was #2261.

Zac-HD · 2020-01-11T12:22:35Z

As far as I can tell this is also the cause of our CI timeouts. While individually rare, over the builds we do for version+platform combinations there's a ~2/3 chance of needing to restart any particular CI run 😭

Zalathar · 2020-01-30T10:04:57Z

It looks like 4.51.1 is actually #2263. (I've edited Zac's post to include the correct cross-reference.)

Zalathar · 2020-01-30T10:47:02Z

I've been poking around with pytest-profiling's SVG output, mainly across the first performance cliff (4.51.0 to 4.51.1).

Nothing major to report so far, but I have noticed that the latter version makes roughly twice as many calls to ConjectureData.draw (~300k vs ~600k), for roughly the same number of test_function calls (~1k).

Zalathar · 2020-01-30T10:56:01Z

For the first cliff, most of the performance seems to be caused by af1463a specifically.

Zalathar · 2020-02-07T09:51:26Z

If I check out 4.51.0 and then alter zero_chance to artificially reduce the chosen zero-chance, the test takes longer and longer to run. If I force the zero-chance to 0.0, then it takes the same amount of time as 4.51.1.

This supports the idea that injecting fewer/no zero blocks is indeed somehow making the test slower.

(Precisely why this happens is still unclear, but it's plausible on its face, and a good lead.)

Zalathar · 2020-02-07T09:55:39Z

I tried looking at the test/draw timings collected by --hypothesis-show-statistics, but (at least on my machine) it tells me that a 15–30 second test is spending 50 seconds in the test function, which is obviously bogus.

(Presumably I'm hitting some kind of precision floor in time.monotonic(), so adding up a thousand single-test times gives a meaningless result.)

Zalathar · 2020-02-10T09:54:21Z

Looking across the second performance cliff, I found a much more dramatic difference in profile output.

Before the cliff, we spend ~99% of our time below generate_new_examples, and practically all of that is spent inside engine.test_function (mainly inside data.draw).

After the cliff, we still have ~99% in generate_new_examples, but engine.test_function is only ~32% (again, with most of that in data.draw).

I don't have a good handle on where the remaining 2/3 of profile time is spent, but it looks to be tied up in various pieces of bookkeeping in generate_new_examples.

(Ugh, GitHub makes it way too easy to unintentionally close an issue while commenting.)

Zalathar · 2020-02-10T09:57:46Z

Also, the fact that test_function drops to ~32% seems to agree with the overall increase in runtime after the cliff.

In other words, we seem to be doing roughly the same amount of test_function work as before, and the extra runtime seems to come from that mysterious new ~67%.

Zalathar · 2020-02-10T10:04:58Z

Incidentally, I found that the targeting phase appears to make the profile results harder to read, so it helps to add phases=[Phase.generate] to turn targeting off.

(I was surprised to see the targeting code run without any target calls, but that's a mystery for another time.)

Zalathar · 2020-02-10T13:01:04Z

Looks like most of the overhead is in the part of generate_new_examples that builds groups for use by the mutator.

It should be possible to arrange for this list to be built only 0–1 times per mutation loop, instead of once per iteration.

Zac-HD · 2020-02-10T23:58:32Z

@Zalathar - I don't have anything to add to your investigation, but wanted to say that I really appreciate your contributions and persistence in tracking down weirdness in the internals. It makes a difference!

If data hasn't changed since the last iteration of the mutator loop, then the example groups can be reused, and don't need to be recalculated. This restores a noticeable chunk of the performance lost in HypothesisWorks#2308.

Zac-HD · 2020-02-13T11:46:10Z

@jasongraham - we've just released a new version with @Zalathar's speedup patch for the mutator. If you could benchmark with 5.5.2 that would be really helpful!

jasongraham · 2020-02-13T23:00:43Z

Thanks @Zalathar for your efforts looking into this!

Version	Trail 1	Trail 2	Trail 3	Average
4.51.0	39.07	43.38	49.97	44.14
4.51.1	94.64	105.5	97.02	99.05333
4.55.3	279.99	263.35	268.99	270.7767
5.1.0	214.95	216.39	198.71	210.0167
5.2.0	178.46	241.71	216.73	212.3
5.5.1	207.1	198.9	207.75	204.5833
5.5.2	135.25	131.08	128.57	131.6333

I definitely see improvement matching the ~30s mentioned in #2353.

I'm not sure what to think about the decrease prior to 5.5.2 compared with 4.55.3, but it appears that a good fraction of the performance change I saw in the 2nd performance cliff since 4.51.1 has been mitigated.

Is there more ground that can be gained on that front?

Also, if I understand correctly the first performance cliff introduced in 4.51.1 is due to a change in the distribution of randomly generated data, which I assume was done for good reasons. Is it reasonable for me to assume that this isn't something to expect to change?

Are the problems I'm seeing for performance are being exacerbated by how my strategy is coded? Could I be mitigating some of these problems from my end?

Zac-HD · 2020-02-14T02:50:09Z

I'm not sure what to think about the decrease prior to 5.5.2 compared with 4.55.3, but it appears that a good fraction of the performance change I saw in the 2nd performance cliff since 4.51.1 has been mitigated. Is there more ground that can be gained on that front?

Maybe! Skimming through the changelog we had a couple of fairly routine performance fixes, plus I think a one-off improvement when we dropped some Python 2 compatibility shims. So generic as well as focussed performance gains are likely still on the table - we just need someone who can volunteer to find them.

The first performance cliff introduced in 4.51.1 is due to a change in the distribution of randomly generated data, which I assume was done for good reasons. Is it reasonable for me to assume that this isn't something to expect to change?

I think we are likely to keep this, yes. When generating test cases there's always a tradeoff between quality (per-test-case probability of finding a bug) and speed (test case budget). In this case we can now generate data with correlations or repeats which would be astronomically unlikely before, so we're basically trading off some speed always for qualitative improvement in certain tests. But that does assume that we can make it faster again without losing the benefits!

Are the problems I'm seeing for performance are being exacerbated by how my strategy is coded? Could I be mitigating some of these problems from my end?

Maybe to both, but it's hard to tell if you can't share your strategy code 😉

Zalathar · 2020-02-14T07:36:48Z

I'm not sure what to think about the decrease prior to 5.5.2 compared with 4.55.3, but it appears that a good fraction of the performance change I saw in the 2nd performance cliff since 4.51.1 has been mitigated.

Is there more ground that can be gained on that front?

Probably!

The code we use to generate duplicate values hasn't been designed or tuned with speed in mind. I took care of some very-low-hanging fruit (and got a big boost), but what's left seems like it should still have room for noticeable improvements.

(The caveat is that this will involve hard work, so I can't make any promises about if or when this will happen.)

Also, if I understand correctly the first performance cliff introduced in 4.51.1 is due to a change in the distribution of randomly generated data, which I assume was done for good reasons. Is it reasonable for me to assume that this isn't something to expect to change?

Hmm, we probably need @DRMacIver's opinion on this one.

The impression I get is that #2263 was mostly about reducing code complexity, and the changes to distribution might be larger than intended.

Are the problems I'm seeing for performance are being exacerbated by how my strategy is coded? Could I be mitigating some of these problems from my end?

I didn't see anything obviously bad in your sample code.

rsokl · 2020-02-14T15:47:49Z

EDIT: I am re-running these. I realized that background processes started running during the timings which may have introduced misleading structure to the sequence of timings

rsokl · 2020-02-14T16:49:07Z

Here are the timings for mygrad's test suite.
I am unable to revert prior to 4.53.2 because my test suite has that as a hard minimum version, but I do believe there was a substantial slowdown in 4.51.1 (as reported above).

Zalathar · 2020-04-15T12:40:28Z

5.8.2 (#2390) contains another speedup that should be an improvement here.

Zac-HD · 2020-11-17T07:18:10Z

It looks like we've recouped most of the performance dropped in those versions, with the remainder attributable to intended changes to the distribution of inputs.

We are of course very happy to accept performance-tuning PRs, including those informed by benchmarking previous versions, but it seems that we're not working on this and in the absence of specific plans I'm closing the issue.

Zac-HD added the performance go faster! use less memory! label Jan 4, 2020

Zac-HD assigned DRMacIver Jan 4, 2020

rsokl mentioned this issue Jan 8, 2020

Make MultipleResults an iterable #2312

Merged

Zac-HD mentioned this issue Jan 11, 2020

Fix is_hypothesis_file internal function #2313

Merged

Zac-HD added flaky-tests for when our tests only sometimes pass tests/build/CI about testing or deployment *of* Hypothesis labels Jan 11, 2020

Zac-HD mentioned this issue Feb 4, 2020

Stop shrink phase after timeout when progress is very slow #2340

Closed

ahitrin mentioned this issue Feb 9, 2020

Property tests sometimes hang ahitrin/SiebenApp#92

Closed

Zalathar closed this as completed Feb 10, 2020

Zalathar reopened this Feb 10, 2020

Zalathar mentioned this issue Feb 13, 2020

Don't recreate example groups for mutation, except when necessary #2353

Merged

Zalathar mentioned this issue Apr 11, 2020

Speed up calculation of example groups used by the mutator #2390

Merged

Zac-HD closed this as completed Nov 17, 2020

BBBmau mentioned this issue Jun 18, 2024

example generation regression between 6.47.0 -> 6.103.1 #4014

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Regressions introduced in v4.51.1 and v4.55.3 #2308

Performance Regressions introduced in v4.51.1 and v4.55.3 #2308

jasongraham commented Jan 4, 2020 •

edited

Loading

Zac-HD commented Jan 4, 2020 •

edited by Zalathar

Loading

Zac-HD commented Jan 11, 2020

Zalathar commented Jan 30, 2020 •

edited

Loading

Zalathar commented Jan 30, 2020

Zalathar commented Jan 30, 2020

Zalathar commented Feb 7, 2020 •

edited

Loading

Zalathar commented Feb 7, 2020

Zalathar commented Feb 10, 2020 •

edited

Loading

Zalathar commented Feb 10, 2020

Zalathar commented Feb 10, 2020

Zalathar commented Feb 10, 2020

Zac-HD commented Feb 10, 2020

Zac-HD commented Feb 13, 2020

jasongraham commented Feb 13, 2020

Zac-HD commented Feb 14, 2020

Zalathar commented Feb 14, 2020

rsokl commented Feb 14, 2020 •

edited

Loading

rsokl commented Feb 14, 2020 •

edited

Loading

Zalathar commented Apr 15, 2020

Zac-HD commented Nov 17, 2020

Performance Regressions introduced in v4.51.1 and v4.55.3 #2308

Performance Regressions introduced in v4.51.1 and v4.55.3 #2308

Comments

jasongraham commented Jan 4, 2020 • edited Loading

Environment

Test

Results

Zac-HD commented Jan 4, 2020 • edited by Zalathar Loading

Zac-HD commented Jan 11, 2020

Zalathar commented Jan 30, 2020 • edited Loading

Zalathar commented Jan 30, 2020

Zalathar commented Jan 30, 2020

Zalathar commented Feb 7, 2020 • edited Loading

Zalathar commented Feb 7, 2020

Zalathar commented Feb 10, 2020 • edited Loading

Zalathar commented Feb 10, 2020

Zalathar commented Feb 10, 2020

Zalathar commented Feb 10, 2020

Zac-HD commented Feb 10, 2020

Zac-HD commented Feb 13, 2020

jasongraham commented Feb 13, 2020

Zac-HD commented Feb 14, 2020

Zalathar commented Feb 14, 2020

rsokl commented Feb 14, 2020 • edited Loading

rsokl commented Feb 14, 2020 • edited Loading

Zalathar commented Apr 15, 2020

Zac-HD commented Nov 17, 2020

jasongraham commented Jan 4, 2020 •

edited

Loading

Zac-HD commented Jan 4, 2020 •

edited by Zalathar

Loading

Zalathar commented Jan 30, 2020 •

edited

Loading

Zalathar commented Feb 7, 2020 •

edited

Loading

Zalathar commented Feb 10, 2020 •

edited

Loading

rsokl commented Feb 14, 2020 •

edited

Loading

rsokl commented Feb 14, 2020 •

edited

Loading