Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart example selection based on coverage #880

Merged
merged 57 commits into from
Sep 27, 2017

Conversation

DRMacIver
Copy link
Member

@DRMacIver DRMacIver commented Sep 20, 2017

This significantly ups Hypothesis's data generation game based on coverage information.

The basic idea is that we should look where we haven't before. This uses the coverage information to formalise that idea, and implements a randomized greedy algorithm to select mutation seeds that cover some previously underexplored area.

It also updates the default max_examples to be lower - in the like of #879, we should do something to offset the now slower tests, and I am reasonably confident at this point that we can drop the number of tests by half without a reduction in quality.

(This is particularly true because it brings Hypothesis into line with normal defaults - most other property-based testing libraries are only running 200 examples anyway).

It also removes branch tracking from our use of coverage - our example selection uses both positive and negative coverage targets, which should give us the same functionality as tracking arcs but with significantly faster tracing.

FAO @agroce - this is the algorithm I emailed you about, with a bunch of updates.

@Zac-HD
Copy link
Member

Zac-HD commented Sep 21, 2017

  • I really like the TargetSelector docstring - it explains the logic beautifully.
  • I still think branch coverage is better than line coverage. What's the speedup actually like for this? Either way, we should probably give the coverage setting branch/line/none options rather than forcing a particular type.
  • What happens for eg C extensions that we can't get coverage information for?

@DRMacIver
Copy link
Member Author

I still think branch coverage is better than line coverage.

It is. I was thinking that we weren't getting much benefit from it because the negated line tracking gives us mostly similar functionality, but this actually isn't true, so I've added it back in.

(For example discovery it's mostly similar, but it means that if people are tracing branch coverage in an enclosing coverage they don't get the data, and there are some differences)

What's the speedup actually like for this?

Hard to say. Depends on the workload. Can be pretty significant, though I'm working on some patches to coverage to improve performance that might mostly kill the difference if we're lucky anyway.

Either way, we should probably give the coverage setting branch/line/none options rather than forcing a particular type.

Why? Forcing a particular type seems exactly like we should do - Coverage is under our control, so we should tune it to get the results we want (also the default behaviour isn't anything clever, it's just line coverage).

What happens for eg C extensions that we can't get coverage information for?

You're mostly SOL. But the fact that we have a universal catch-all tag that means that in the limit where we get no coverage information at all this algorithm degrades to something that's approximately the same as the old one.

In theory we could use something like https://clang.llvm.org/docs/SanitizerCoverage.html to support tracing inside C extensions too - the logic is fully generic - but I think that's something to worry about later. :-)

@Zac-HD
Copy link
Member

Zac-HD commented Sep 21, 2017

Why? Forcing a particular type seems exactly like we should do

Some combination of allowing users to force branch coverage even if it's expensive, and not having confirmed that running in branch mode would interact well with users getting line coverage reports. The revert and digging into coverage source has reassured me on both counts.

in the limit where we get no coverage information at all this algorithm degrades to something that's approximately the same as the old one.

👍

@DRMacIver
Copy link
Member Author

not having confirmed that running in branch mode would interact well with users getting line coverage reports

Note that this is tested. The timid=True versions of the tests in test_coverage.py cover that case

(though now that I think about it we don't test that users do get branch information! I'll fix that)

@Zac-HD
Copy link
Member

Zac-HD commented Sep 21, 2017

Hmm, if the tests are non-obvious it might be worth adding a docstring or at least comments explaining what each is actually testing.

@DRMacIver
Copy link
Member Author

Hmm, if the tests are non-obvious it might be worth adding a docstring or at least comments explaining what each is actually testing.

I don't think the tests are non-obvious TBH. They're pretty explicit about what they're doing! (However I said the wrong thing above. It's branch=True/False not timid=True/False that affects whether branch coverage is being used in the enclosing Coverage instance. timid switches between the C and Python tracer)

@DRMacIver
Copy link
Member Author

The test failure on CI is weird. It reproduces reliably on CI but not when I run just that one test file. It's also a very strange test to be failing if nothing else is - AFAICT sometimes it's just not running the shrinker, but I have no idea at all why that might be. 😞

@DRMacIver
Copy link
Member Author

Ugh, right, yes, because we're no longer assigning last_data in mutation. FFS. I'm astonished and a little concerned that so many tests passed.

More evidence that last_data must die, but for now I've just removed this usage of it.

@DRMacIver
Copy link
Member Author

For posterity I have just rebuild the benchmark data and this is the result:

Different means for ints-valid=always-interesting=never: 3200.00 -> 1600.00. p=0.00000
Different means for ints-valid=always-interesting=size_lower_bound: 2222.88 -> 1114.22. p=0.00000
Different means for ints-valid=usually-interesting=size_lower_bound: 2525.70 -> 1228.54. p=0.00000
Different means for intlists-valid=always-interesting=never: 16613.63 -> 7474.76. p=0.00000
Different means for intlists-valid=always-interesting=lower_bound: 636.42 -> 751.96. p=0.00000
Different means for sizedintlists-valid=always-interesting=never: 13373.43 -> 6395.73. p=0.00000
Different means for sizedintlists-valid=always-interesting=lower_bound: 12765.75 -> 6239.59. p=0.00000
Different means for text-valid=always-interesting=never: 4692.06 -> 2125.17. p=0.00000
Different means for text-valid=always-interesting=lower_bound: 185.67 -> 212.57. p=0.00000
Different means for text5-valid=always-interesting=never: 11073.66 -> 4721.25. p=0.00000
Different means for text5-valid=always-interesting=lower_bound: 11018.45 -> 4713.03. p=0.00000
Different means for arrays10-valid=always-interesting=never: 2436.41 -> 1060.62. p=0.00000
Different means for arrays10-valid=always-interesting=lower_bound: 206.90 -> 244.44. p=0.00000
Different means for arrays10-valid=always-interesting=size_lower_bound: 9196.01 -> 12214.55. p=0.00000
Different means for arraysvar-valid=always-interesting=never: 1877.05 -> 835.11. p=0.00000
Different means for arraysvar-valid=always-interesting=lower_bound: 1758.11 -> 788.40. p=0.00000
Different means for arraysvar-valid=always-interesting=size_lower_bound: 11488.94 -> 8607.71. p=0.00000
Different means for arraysvar-valid=usually-interesting=size_lower_bound: 12978.72 -> 9130.25. p=0.00000
Different means for intlists-valid=always-interesting=minsum: 1445.76 -> 1323.22. p=0.00000
Different means for intlists-valid=always-interesting=has_duplicates: 2191.79 -> 2894.26. p=0.00000
Different means for intlists-valid=has_duplicates-interesting=minsum: 6924.44 -> 8593.41. p=0.00000
Different means for arrays10-valid=always-interesting=array_average: 2371.94 -> 1036.38. p=0.00000
Different means for arraysvar-valid=always-interesting=array_average: 1835.96 -> 817.12. p=0.00000
Different means for arrays10-valid=usually-interesting=array_average: 2548.95 -> 1131.49. p=0.00000
Different means for arraysvar-valid=usually-interesting=array_average: 1958.75 -> 904.74. p=0.00000

@DRMacIver
Copy link
Member Author

OK. I haven't seen a full green build yet, but I think this is probably ready for review now.

@DRMacIver
Copy link
Member Author

OK. It does indeed look like this test has become flaky on pypy. I'll look into why tomorrow. I still think this is mostly ready for review though.

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a few little nitpicks, this looks good to me.

I'm not confident I'd spot any subtle problems in this kind of change to the internals, but at least there's no really obvious problems 😌

)
data.tags = self.tag_intern_table.setdefault(tags, tags)

if data.status == Status.VALID:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why this changed from >= to == - is this just a matter of style and future proofing, or has the Status enum actually changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess I don't remember for sure. It's not a question of the enum having changed though - the idea would be to exclude interesting examples from this bit.

Attempting to backfill the reasoning, I think the idea is that we don't actually want to count interesting examples for our coverage tracking, because interesting examples may hit weird and surprising corners of coverage by virtue of their error, so they don't work well as a representative example.

multiple lines in a conditional so that either all or none of them will
be covered in a conditional) they are naturally deduplicated.
2. Popular coverage targets will largely be ignored for considering what
test ot run - if every example exhibits a coverage target, picking an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"test ot run"

RELEASE.rst Outdated

1. It significantly improves Hypothesis's ability to use coverage information
to find interesting examples.
2. It reduces the default ``max_examples`` setting to 100. This is intended to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the basis that most readers know very little about Hypothesis internals, I'd add some context and take out the caveat:

It reduces the default max_examples setting to from 200 to 100. This offsets ...

@DRMacIver
Copy link
Member Author

I'm not confident I'd spot any subtle problems in this kind of change to the internals

TBH this change has somewhat reduced my confidence in our build's ability to detect subtle problems too. A scary amount of our test suite passed with the flaky test logic being totally broken, and I spotted that my previous discovery_queue logic was egregiously wrong (it queued every example proportionate to amount of coverage it got, not once).

The nice thing about having released egregiously wrong example logic is that it turns out that now we know our users don't notice them either. 😉 The less nice thing is that it turns out that maybe a lot of the current logic isn't actually doing very much. The mixed thing is "Hey, at least this PR can't make it any worse...?". But this needs some thinking about.

@DRMacIver
Copy link
Member Author

I ended up having to make a change to the algorithm to always try a fully zeroed example first and wow did that make a big difference to the benchmarks:

Different means for ints-valid=always-interesting=never: 3200.00 -> 1600.00. p=0.00000
Different means for ints-valid=always-interesting=always: 48.00 -> 32.00. p=0.00000
Different means for ints-valid=always-interesting=usually: 52.51 -> 38.70. p=0.00000
Different means for ints-valid=always-interesting=size_lower_bound: 2222.88 -> 1084.13. p=0.00000
Different means for ints-valid=usually-interesting=size_lower_bound: 2525.70 -> 1230.80. p=0.00000
Different means for intlists-valid=always-interesting=never: 16613.63 -> 8641.70. p=0.00000
Different means for intlists-valid=always-interesting=always: 180.62 -> 2.00. p=0.00000
Different means for intlists-valid=always-interesting=usually: 186.74 -> 30.12. p=0.00000
Different means for sizedintlists-valid=always-interesting=never: 13373.43 -> 6261.73. p=0.00000
Different means for sizedintlists-valid=always-interesting=always: 161.86 -> 2.00. p=0.00000
Different means for sizedintlists-valid=always-interesting=usually: 311.50 -> 120.77. p=0.00000
Different means for sizedintlists-valid=always-interesting=lower_bound: 12765.75 -> 6214.69. p=0.00000
Different means for text-valid=always-interesting=never: 4692.06 -> 2339.28. p=0.00000
Different means for text-valid=always-interesting=always: 57.44 -> 2.00. p=0.00000
Different means for text-valid=always-interesting=usually: 58.93 -> 8.09. p=0.00000
Different means for text-valid=always-interesting=lower_bound: 185.67 -> 217.54. p=0.00000
Different means for text5-valid=always-interesting=never: 11073.66 -> 5312.99. p=0.00000
Different means for text5-valid=always-interesting=always: 129.19 -> 459.00. p=0.00000
Different means for text5-valid=always-interesting=usually: 287.47 -> 571.32. p=0.00000
Different means for text5-valid=always-interesting=lower_bound: 11018.45 -> 5262.69. p=0.00000
Different means for arrays10-valid=always-interesting=never: 2436.41 -> 1133.34. p=0.00000
Different means for arrays10-valid=always-interesting=always: 38.16 -> 6.00. p=0.00000
Different means for arrays10-valid=always-interesting=usually: 40.14 -> 11.35. p=0.00000
Different means for arrays10-valid=always-interesting=lower_bound: 206.90 -> 242.84. p=0.00000
Different means for arrays10-valid=always-interesting=size_lower_bound: 9196.01 -> 12486.45. p=0.00000
Different means for arraysvar-valid=always-interesting=never: 1877.05 -> 871.30. p=0.00000
Different means for arraysvar-valid=always-interesting=always: 25.83 -> 2.00. p=0.00000
Different means for arraysvar-valid=always-interesting=usually: 27.78 -> 7.27. p=0.00000
Different means for arraysvar-valid=always-interesting=lower_bound: 1758.11 -> 822.96. p=0.00000
Different means for arraysvar-valid=always-interesting=size_lower_bound: 11488.94 -> 9187.57. p=0.00000
Different means for arraysvar-valid=usually-interesting=size_lower_bound: 12978.72 -> 10343.95. p=0.00000
Different means for intlists-valid=always-interesting=minsum: 1445.76 -> 1322.71. p=0.00000
Different means for intlists-valid=always-interesting=has_duplicates: 2191.79 -> 2913.62. p=0.00000
Different means for intlists-valid=has_duplicates-interesting=minsum: 6924.44 -> 7759.47. p=0.00000
Different means for arrays10-valid=always-interesting=array_average: 2371.94 -> 1098.88. p=0.00000
Different means for arraysvar-valid=always-interesting=array_average: 1835.96 -> 850.91. p=0.00000
Different means for arrays10-valid=usually-interesting=array_average: 2548.95 -> 1193.01. p=0.00000
Different means for arraysvar-valid=usually-interesting=array_average: 1958.75 -> 943.73. p=0.00000

This is from master, not on top of my previous change.

@DRMacIver DRMacIver merged commit 9120b33 into master Sep 27, 2017
@DRMacIver DRMacIver deleted the DRMacIver/coverage-direction branch September 27, 2017 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants