Smart example selection based on coverage #880

DRMacIver · 2017-09-20T14:59:03Z

This significantly ups Hypothesis's data generation game based on coverage information.

The basic idea is that we should look where we haven't before. This uses the coverage information to formalise that idea, and implements a randomized greedy algorithm to select mutation seeds that cover some previously underexplored area.

It also updates the default max_examples to be lower - in the like of #879, we should do something to offset the now slower tests, and I am reasonably confident at this point that we can drop the number of tests by half without a reduction in quality.

(This is particularly true because it brings Hypothesis into line with normal defaults - most other property-based testing libraries are only running 200 examples anyway).

It also removes branch tracking from our use of coverage - our example selection uses both positive and negative coverage targets, which should give us the same functionality as tracking arcs but with significantly faster tracing.

FAO @agroce - this is the algorithm I emailed you about, with a bunch of updates.

Zac-HD · 2017-09-21T00:46:44Z

I really like the TargetSelector docstring - it explains the logic beautifully.
I still think branch coverage is better than line coverage. What's the speedup actually like for this? Either way, we should probably give the coverage setting branch/line/none options rather than forcing a particular type.
What happens for eg C extensions that we can't get coverage information for?

DRMacIver · 2017-09-21T07:32:52Z

I still think branch coverage is better than line coverage.

It is. I was thinking that we weren't getting much benefit from it because the negated line tracking gives us mostly similar functionality, but this actually isn't true, so I've added it back in.

(For example discovery it's mostly similar, but it means that if people are tracing branch coverage in an enclosing coverage they don't get the data, and there are some differences)

What's the speedup actually like for this?

Hard to say. Depends on the workload. Can be pretty significant, though I'm working on some patches to coverage to improve performance that might mostly kill the difference if we're lucky anyway.

Either way, we should probably give the coverage setting branch/line/none options rather than forcing a particular type.

Why? Forcing a particular type seems exactly like we should do - Coverage is under our control, so we should tune it to get the results we want (also the default behaviour isn't anything clever, it's just line coverage).

What happens for eg C extensions that we can't get coverage information for?

You're mostly SOL. But the fact that we have a universal catch-all tag that means that in the limit where we get no coverage information at all this algorithm degrades to something that's approximately the same as the old one.

In theory we could use something like https://clang.llvm.org/docs/SanitizerCoverage.html to support tracing inside C extensions too - the logic is fully generic - but I think that's something to worry about later. :-)

Zac-HD · 2017-09-21T08:24:39Z

Why? Forcing a particular type seems exactly like we should do

Some combination of allowing users to force branch coverage even if it's expensive, and not having confirmed that running in branch mode would interact well with users getting line coverage reports. The revert and digging into coverage source has reassured me on both counts.

in the limit where we get no coverage information at all this algorithm degrades to something that's approximately the same as the old one.

👍

DRMacIver · 2017-09-21T08:33:38Z

not having confirmed that running in branch mode would interact well with users getting line coverage reports

Note that this is tested. The timid=True versions of the tests in test_coverage.py cover that case

(though now that I think about it we don't test that users do get branch information! I'll fix that)

Zac-HD · 2017-09-21T09:50:39Z

Hmm, if the tests are non-obvious it might be worth adding a docstring or at least comments explaining what each is actually testing.

DRMacIver · 2017-09-22T17:27:12Z

Hmm, if the tests are non-obvious it might be worth adding a docstring or at least comments explaining what each is actually testing.

I don't think the tests are non-obvious TBH. They're pretty explicit about what they're doing! (However I said the wrong thing above. It's branch=True/False not timid=True/False that affects whether branch coverage is being used in the enclosing Coverage instance. timid switches between the C and Python tracer)

DRMacIver · 2017-09-22T17:28:22Z

The test failure on CI is weird. It reproduces reliably on CI but not when I run just that one test file. It's also a very strange test to be failing if nothing else is - AFAICT sometimes it's just not running the shrinker, but I have no idea at all why that might be. 😞

This reverts commit d7bf3a6.

This reverts commit d99eece.

DRMacIver · 2017-09-25T19:51:41Z

Ugh, right, yes, because we're no longer assigning last_data in mutation. FFS. I'm astonished and a little concerned that so many tests passed.

More evidence that last_data must die, but for now I've just removed this usage of it.

DRMacIver · 2017-09-25T19:57:44Z

For posterity I have just rebuild the benchmark data and this is the result:

Different means for ints-valid=always-interesting=never: 3200.00 -> 1600.00. p=0.00000
Different means for ints-valid=always-interesting=size_lower_bound: 2222.88 -> 1114.22. p=0.00000
Different means for ints-valid=usually-interesting=size_lower_bound: 2525.70 -> 1228.54. p=0.00000
Different means for intlists-valid=always-interesting=never: 16613.63 -> 7474.76. p=0.00000
Different means for intlists-valid=always-interesting=lower_bound: 636.42 -> 751.96. p=0.00000
Different means for sizedintlists-valid=always-interesting=never: 13373.43 -> 6395.73. p=0.00000
Different means for sizedintlists-valid=always-interesting=lower_bound: 12765.75 -> 6239.59. p=0.00000
Different means for text-valid=always-interesting=never: 4692.06 -> 2125.17. p=0.00000
Different means for text-valid=always-interesting=lower_bound: 185.67 -> 212.57. p=0.00000
Different means for text5-valid=always-interesting=never: 11073.66 -> 4721.25. p=0.00000
Different means for text5-valid=always-interesting=lower_bound: 11018.45 -> 4713.03. p=0.00000
Different means for arrays10-valid=always-interesting=never: 2436.41 -> 1060.62. p=0.00000
Different means for arrays10-valid=always-interesting=lower_bound: 206.90 -> 244.44. p=0.00000
Different means for arrays10-valid=always-interesting=size_lower_bound: 9196.01 -> 12214.55. p=0.00000
Different means for arraysvar-valid=always-interesting=never: 1877.05 -> 835.11. p=0.00000
Different means for arraysvar-valid=always-interesting=lower_bound: 1758.11 -> 788.40. p=0.00000
Different means for arraysvar-valid=always-interesting=size_lower_bound: 11488.94 -> 8607.71. p=0.00000
Different means for arraysvar-valid=usually-interesting=size_lower_bound: 12978.72 -> 9130.25. p=0.00000
Different means for intlists-valid=always-interesting=minsum: 1445.76 -> 1323.22. p=0.00000
Different means for intlists-valid=always-interesting=has_duplicates: 2191.79 -> 2894.26. p=0.00000
Different means for intlists-valid=has_duplicates-interesting=minsum: 6924.44 -> 8593.41. p=0.00000
Different means for arrays10-valid=always-interesting=array_average: 2371.94 -> 1036.38. p=0.00000
Different means for arraysvar-valid=always-interesting=array_average: 1835.96 -> 817.12. p=0.00000
Different means for arrays10-valid=usually-interesting=array_average: 2548.95 -> 1131.49. p=0.00000
Different means for arraysvar-valid=usually-interesting=array_average: 1958.75 -> 904.74. p=0.00000

DRMacIver · 2017-09-25T20:52:19Z

OK. I haven't seen a full green build yet, but I think this is probably ready for review now.

DRMacIver · 2017-09-25T22:26:03Z

OK. It does indeed look like this test has become flaky on pypy. I'll look into why tomorrow. I still think this is mostly ready for review though.

Zac-HD

With a few little nitpicks, this looks good to me.

I'm not confident I'd spot any subtle problems in this kind of change to the internals, but at least there's no really obvious problems 😌

Zac-HD · 2017-09-26T02:28:46Z

src/hypothesis/internal/conjecture/engine.py

+        )
+        data.tags = self.tag_intern_table.setdefault(tags, tags)
+
+        if data.status == Status.VALID:


I don't follow why this changed from >= to == - is this just a matter of style and future proofing, or has the Status enum actually changed?

I confess I don't remember for sure. It's not a question of the enum having changed though - the idea would be to exclude interesting examples from this bit.

Attempting to backfill the reasoning, I think the idea is that we don't actually want to count interesting examples for our coverage tracking, because interesting examples may hit weird and surprising corners of coverage by virtue of their error, so they don't work well as a representative example.

Zac-HD · 2017-09-26T02:32:41Z

src/hypothesis/internal/conjecture/engine.py

+       multiple lines in a conditional so that either all or none of them will
+       be covered in a conditional) they are naturally deduplicated.
+    2. Popular coverage targets will largely be ignored for considering what
+       test ot run - if every example exhibits a coverage target, picking an


"test ot run"

Zac-HD · 2017-09-26T02:37:14Z

RELEASE.rst

+
+1. It significantly improves Hypothesis's ability to use coverage information
+   to find interesting examples.
+2. It reduces the default ``max_examples`` setting to 100. This is intended to


On the basis that most readers know very little about Hypothesis internals, I'd add some context and take out the caveat:

It reduces the default max_examples setting to from 200 to 100. This offsets ...

DRMacIver · 2017-09-26T06:28:23Z

I'm not confident I'd spot any subtle problems in this kind of change to the internals

TBH this change has somewhat reduced my confidence in our build's ability to detect subtle problems too. A scary amount of our test suite passed with the flaky test logic being totally broken, and I spotted that my previous discovery_queue logic was egregiously wrong (it queued every example proportionate to amount of coverage it got, not once).

The nice thing about having released egregiously wrong example logic is that it turns out that now we know our users don't notice them either. 😉 The less nice thing is that it turns out that maybe a lot of the current logic isn't actually doing very much. The mixed thing is "Hey, at least this PR can't make it any worse...?". But this needs some thinking about.

DRMacIver · 2017-09-26T22:18:46Z

I ended up having to make a change to the algorithm to always try a fully zeroed example first and wow did that make a big difference to the benchmarks:

Different means for ints-valid=always-interesting=never: 3200.00 -> 1600.00. p=0.00000
Different means for ints-valid=always-interesting=always: 48.00 -> 32.00. p=0.00000
Different means for ints-valid=always-interesting=usually: 52.51 -> 38.70. p=0.00000
Different means for ints-valid=always-interesting=size_lower_bound: 2222.88 -> 1084.13. p=0.00000
Different means for ints-valid=usually-interesting=size_lower_bound: 2525.70 -> 1230.80. p=0.00000
Different means for intlists-valid=always-interesting=never: 16613.63 -> 8641.70. p=0.00000
Different means for intlists-valid=always-interesting=always: 180.62 -> 2.00. p=0.00000
Different means for intlists-valid=always-interesting=usually: 186.74 -> 30.12. p=0.00000
Different means for sizedintlists-valid=always-interesting=never: 13373.43 -> 6261.73. p=0.00000
Different means for sizedintlists-valid=always-interesting=always: 161.86 -> 2.00. p=0.00000
Different means for sizedintlists-valid=always-interesting=usually: 311.50 -> 120.77. p=0.00000
Different means for sizedintlists-valid=always-interesting=lower_bound: 12765.75 -> 6214.69. p=0.00000
Different means for text-valid=always-interesting=never: 4692.06 -> 2339.28. p=0.00000
Different means for text-valid=always-interesting=always: 57.44 -> 2.00. p=0.00000
Different means for text-valid=always-interesting=usually: 58.93 -> 8.09. p=0.00000
Different means for text-valid=always-interesting=lower_bound: 185.67 -> 217.54. p=0.00000
Different means for text5-valid=always-interesting=never: 11073.66 -> 5312.99. p=0.00000
Different means for text5-valid=always-interesting=always: 129.19 -> 459.00. p=0.00000
Different means for text5-valid=always-interesting=usually: 287.47 -> 571.32. p=0.00000
Different means for text5-valid=always-interesting=lower_bound: 11018.45 -> 5262.69. p=0.00000
Different means for arrays10-valid=always-interesting=never: 2436.41 -> 1133.34. p=0.00000
Different means for arrays10-valid=always-interesting=always: 38.16 -> 6.00. p=0.00000
Different means for arrays10-valid=always-interesting=usually: 40.14 -> 11.35. p=0.00000
Different means for arrays10-valid=always-interesting=lower_bound: 206.90 -> 242.84. p=0.00000
Different means for arrays10-valid=always-interesting=size_lower_bound: 9196.01 -> 12486.45. p=0.00000
Different means for arraysvar-valid=always-interesting=never: 1877.05 -> 871.30. p=0.00000
Different means for arraysvar-valid=always-interesting=always: 25.83 -> 2.00. p=0.00000
Different means for arraysvar-valid=always-interesting=usually: 27.78 -> 7.27. p=0.00000
Different means for arraysvar-valid=always-interesting=lower_bound: 1758.11 -> 822.96. p=0.00000
Different means for arraysvar-valid=always-interesting=size_lower_bound: 11488.94 -> 9187.57. p=0.00000
Different means for arraysvar-valid=usually-interesting=size_lower_bound: 12978.72 -> 10343.95. p=0.00000
Different means for intlists-valid=always-interesting=minsum: 1445.76 -> 1322.71. p=0.00000
Different means for intlists-valid=always-interesting=has_duplicates: 2191.79 -> 2913.62. p=0.00000
Different means for intlists-valid=has_duplicates-interesting=minsum: 6924.44 -> 7759.47. p=0.00000
Different means for arrays10-valid=always-interesting=array_average: 2371.94 -> 1098.88. p=0.00000
Different means for arraysvar-valid=always-interesting=array_average: 1835.96 -> 850.91. p=0.00000
Different means for arrays10-valid=usually-interesting=array_average: 2548.95 -> 1193.01. p=0.00000
Different means for arraysvar-valid=usually-interesting=array_average: 1958.75 -> 943.73. p=0.00000

This is from master, not on top of my previous change.

… non-essential anyway

DRMacIver mentioned this pull request Sep 20, 2017

coverage feature increases test runtime by factor 3 #879

Closed

DRMacIver added 22 commits September 22, 2017 18:56

Add initial prototype implementation of target selector

341a9bc

Add priority map implementation

d1dc300

Remove debugging print

fea4682

Revert "Add priority map implementation"

2088ac2

This reverts commit d7bf3a6.

More principled implementation for selecting tags

60c1834

Add new mutator that more reliably works for the long sequences test

9f2b69a

Move debug representation to ConjectureData

da6e358

Fix conjecture engine tests

8cf34ce

WIP

6401b89

Pull in some changes to control flow from generate-multi

47b0dc7

Lower default max_examples

9094ec9

Release notes

65a3a45

Better commenting

cd17252

Move untagged tag to a universal tag

60558a4

Intern tags for better memory usage

295ad35

Handle universal tags in the selector

51868a5

Add concept of negated tags (currently inefficiently stored)

48f9335

Better testing and fixed logic for TargetSelector

11ca9c7

Don't track arcs as negated lines give same functionality

d399195

Remove debugging print

13f2150

Generate hash and equality for Negated

8b433ff

Revert "Move debug representation to ConjectureData"

44348ef

This reverts commit d99eece.

New benchmark data

774f312

DRMacIver added 2 commits September 25, 2017 20:59

Note performance changes in RELEASE

8dd22df

Make test easier

15d6f6d

Zac-HD mentioned this pull request Sep 26, 2017

Improve reporting of benchmark script #893

Closed

Zac-HD approved these changes Sep 26, 2017

View reviewed changes

DRMacIver added 2 commits September 26, 2017 07:19

Fix typo

97d9bf1

Reword RELEASE

51a708d

DRMacIver added 3 commits September 26, 2017 07:36

Always try a fully zeroed buffer first

130f37d

Handle assignment of last_data inside test_function

07f7285

Don't always return the zero example

503dbd3

Zac-HD mentioned this pull request Sep 26, 2017

Bugfixes in inference from Numpy dtypes #826

Merged

3 tasks

DRMacIver added 7 commits September 26, 2017 15:15

Apparently that assignment did matter. Oops

a604171

Handle errors in comparison in example

7bee324

Don't do that equality comparison at all though

51dbae6

Fix test that started failing because we start with zero

70392d8

Add boundary markers to flaky regex test. Fixes #878

f04f513

Update release to note always trying the zero example first

ec6bd80

More benchmark updates

0c65192

DRMacIver added 2 commits September 27, 2017 08:21

Now that we handle zero specially that seems to never get hit, and is…

b05374b

… non-essential anyway

Update test to take into account first value possibly being zero

1d724f4

DRMacIver merged commit 9120b33 into master Sep 27, 2017

DRMacIver deleted the DRMacIver/coverage-direction branch September 27, 2017 08:52

Zac-HD mentioned this pull request Oct 8, 2017

Coverage based example discovery #169

Closed

Zalathar mentioned this pull request Feb 23, 2019

Commonize API on draw_bits #1826

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smart example selection based on coverage #880

Smart example selection based on coverage #880

DRMacIver commented Sep 20, 2017 •

edited

Loading

Zac-HD commented Sep 21, 2017 •

edited

Loading

DRMacIver commented Sep 21, 2017

Zac-HD commented Sep 21, 2017

DRMacIver commented Sep 21, 2017

Zac-HD commented Sep 21, 2017

DRMacIver commented Sep 22, 2017

DRMacIver commented Sep 22, 2017

DRMacIver commented Sep 25, 2017

DRMacIver commented Sep 25, 2017

DRMacIver commented Sep 25, 2017

DRMacIver commented Sep 25, 2017

Zac-HD left a comment

Zac-HD Sep 26, 2017

DRMacIver Sep 26, 2017

Zac-HD Sep 26, 2017

Zac-HD Sep 26, 2017

DRMacIver commented Sep 26, 2017

DRMacIver commented Sep 26, 2017

Smart example selection based on coverage #880

Smart example selection based on coverage #880

Conversation

DRMacIver commented Sep 20, 2017 • edited Loading

Zac-HD commented Sep 21, 2017 • edited Loading

DRMacIver commented Sep 21, 2017

Zac-HD commented Sep 21, 2017

DRMacIver commented Sep 21, 2017

Zac-HD commented Sep 21, 2017

DRMacIver commented Sep 22, 2017

DRMacIver commented Sep 22, 2017

DRMacIver commented Sep 25, 2017

DRMacIver commented Sep 25, 2017

DRMacIver commented Sep 25, 2017

DRMacIver commented Sep 25, 2017

Zac-HD left a comment

Choose a reason for hiding this comment

Zac-HD Sep 26, 2017

Choose a reason for hiding this comment

DRMacIver Sep 26, 2017

Choose a reason for hiding this comment

Zac-HD Sep 26, 2017

Choose a reason for hiding this comment

Zac-HD Sep 26, 2017

Choose a reason for hiding this comment

DRMacIver commented Sep 26, 2017

DRMacIver commented Sep 26, 2017

DRMacIver commented Sep 20, 2017 •

edited

Loading

Zac-HD commented Sep 21, 2017 •

edited

Loading