-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smart example selection based on coverage #880
Conversation
|
It is. I was thinking that we weren't getting much benefit from it because the negated line tracking gives us mostly similar functionality, but this actually isn't true, so I've added it back in. (For example discovery it's mostly similar, but it means that if people are tracing branch coverage in an enclosing coverage they don't get the data, and there are some differences)
Hard to say. Depends on the workload. Can be pretty significant, though I'm working on some patches to coverage to improve performance that might mostly kill the difference if we're lucky anyway.
Why? Forcing a particular type seems exactly like we should do - Coverage is under our control, so we should tune it to get the results we want (also the default behaviour isn't anything clever, it's just line coverage).
You're mostly SOL. But the fact that we have a In theory we could use something like https://clang.llvm.org/docs/SanitizerCoverage.html to support tracing inside C extensions too - the logic is fully generic - but I think that's something to worry about later. :-) |
Some combination of allowing users to force branch coverage even if it's expensive, and not having confirmed that running in branch mode would interact well with users getting line coverage reports. The revert and digging into coverage source has reassured me on both counts.
👍 |
Note that this is tested. The timid=True versions of the tests in test_coverage.py cover that case (though now that I think about it we don't test that users do get branch information! I'll fix that) |
Hmm, if the tests are non-obvious it might be worth adding a docstring or at least comments explaining what each is actually testing. |
I don't think the tests are non-obvious TBH. They're pretty explicit about what they're doing! (However I said the wrong thing above. It's branch=True/False not timid=True/False that affects whether branch coverage is being used in the enclosing Coverage instance. timid switches between the C and Python tracer) |
The test failure on CI is weird. It reproduces reliably on CI but not when I run just that one test file. It's also a very strange test to be failing if nothing else is - AFAICT sometimes it's just not running the shrinker, but I have no idea at all why that might be. 😞 |
Ugh, right, yes, because we're no longer assigning last_data in mutation. FFS. I'm astonished and a little concerned that so many tests passed. More evidence that last_data must die, but for now I've just removed this usage of it. |
For posterity I have just rebuild the benchmark data and this is the result:
|
OK. I haven't seen a full green build yet, but I think this is probably ready for review now. |
OK. It does indeed look like this test has become flaky on pypy. I'll look into why tomorrow. I still think this is mostly ready for review though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a few little nitpicks, this looks good to me.
I'm not confident I'd spot any subtle problems in this kind of change to the internals, but at least there's no really obvious problems 😌
) | ||
data.tags = self.tag_intern_table.setdefault(tags, tags) | ||
|
||
if data.status == Status.VALID: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow why this changed from >=
to ==
- is this just a matter of style and future proofing, or has the Status enum actually changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confess I don't remember for sure. It's not a question of the enum having changed though - the idea would be to exclude interesting examples from this bit.
Attempting to backfill the reasoning, I think the idea is that we don't actually want to count interesting examples for our coverage tracking, because interesting examples may hit weird and surprising corners of coverage by virtue of their error, so they don't work well as a representative example.
multiple lines in a conditional so that either all or none of them will | ||
be covered in a conditional) they are naturally deduplicated. | ||
2. Popular coverage targets will largely be ignored for considering what | ||
test ot run - if every example exhibits a coverage target, picking an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"test ot run"
RELEASE.rst
Outdated
|
||
1. It significantly improves Hypothesis's ability to use coverage information | ||
to find interesting examples. | ||
2. It reduces the default ``max_examples`` setting to 100. This is intended to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the basis that most readers know very little about Hypothesis internals, I'd add some context and take out the caveat:
It reduces the default
max_examples
setting to from 200 to 100. This offsets ...
TBH this change has somewhat reduced my confidence in our build's ability to detect subtle problems too. A scary amount of our test suite passed with the flaky test logic being totally broken, and I spotted that my previous The nice thing about having released egregiously wrong example logic is that it turns out that now we know our users don't notice them either. 😉 The less nice thing is that it turns out that maybe a lot of the current logic isn't actually doing very much. The mixed thing is "Hey, at least this PR can't make it any worse...?". But this needs some thinking about. |
I ended up having to make a change to the algorithm to always try a fully zeroed example first and wow did that make a big difference to the benchmarks:
This is from master, not on top of my previous change. |
This significantly ups Hypothesis's data generation game based on coverage information.
The basic idea is that we should look where we haven't before. This uses the coverage information to formalise that idea, and implements a randomized greedy algorithm to select mutation seeds that cover some previously underexplored area.
It also updates the default
max_examples
to be lower - in the like of #879, we should do something to offset the now slower tests, and I am reasonably confident at this point that we can drop the number of tests by half without a reduction in quality.(This is particularly true because it brings Hypothesis into line with normal defaults - most other property-based testing libraries are only running 200 examples anyway).
It also removes branch tracking from our use of coverage - our example selection uses both positive and negative coverage targets, which should give us the same functionality as tracking arcs but with significantly faster tracing.
FAO @agroce - this is the algorithm I emailed you about, with a bunch of updates.