Discuss adding labels for fuzzed values #94

mgold · 2019-05-10T03:26:14Z

There's a good discussion happening on discourse about adding labels and coverage requirements to elm-test. Don't do what I did, watch the 50 minute talk linked to in the first post before trying to say something intelligent.

There may be breaking changes that come out of the discussion to I want to hold open the opportunity to make them.

Janiczek · 2022-07-26T22:58:22Z

I'd like to revive this discussion, as I'd be happy to work on this and we're closing in on releasing 2.0.0.

Sorry for pings in case those are unwanted - I'm basing them on a) who participated in the Discourse discussion, and b) who I usually use as my sounding board with testing/fuzzing ideas. In no particular order, @BrianHicks @gampleman @harrysarson @drathier @ianmackenzie @rtfeldman @jonathanjouty

There are IMHO a few separate ideas, I'll try to summarize (paraphrasing @mgold's summary on Discourse):

Report the count and/or percentage of rejected values when a Fuzz.invalid has been triggered in any way (Fuzz.filter, Fuzz.invalid by itself or inside Fuzz.andThen or any other way, etc.), so as to warn about inefficient fuzzers. This likely doesn't need to be configured in any way. (Perhaps it also belongs in a totally separate ticket, but I feel it's related.)
Add a way to categorize the values and report the counts and/or percentages of these categories (to test the fuzzers). Something like Fuzz.label : (a -> String) -> Fuzzer a -> Fuzzer a might be a simple API, but by virtue of being an opaque function we don't know all the possible strings that might fall out of it. Representations like List (String, a -> Bool) are more amenable to exhaustive checking in this sense, see point 4.
Be able to show examples of the categories (to test the classifying function / list of predicates). Unsure in which module should this live, but it might best live as its own function similar to Fuzz.examples, usable from REPL and perhaps connected to the CLI tool. Something like showLabelExamples : List (String, a -> Bool) -> Fuzzer a -> List (String, Maybe a) perhaps, or if it's baked into the fuzzer, then just showLabelExamples : Fuzzer a -> List (String, Maybe a)? (Maybe with an integer specifying the number of tries; also maybe parameterized by seed or returning a Generator.)
Be able to expect/demand certain coverage of those categories. A few axes here also: Nonzero vs certain percentage; just reporting a warning if not met vs running enough tests (ignoring runs : Int config) to be statistically confident ™️ [1] [2] [3] that it is/is not met. This requires the List variant of the classification API, to be able to list all possible classes. (It's incompatible with the (a -> String) way to classify fuzzers.)

First of all, is there something I'm missing? The Discourse post is long, the John Hughes' talk has many ideas, there are various implementations in various PBT libraries...

Other than that, would you folks be OK with me working on these? Currently I favor something like

myFuzzer : Fuzzer Int
myFuzzer =
    Fuzz.intRange 1 20

justReportFuzzer : Fuzzer Int
justReportFuzzer =
    myFuzzer
        |> Fuzz.reportCoverage
            [ ( "Lower boundary (1)", \n -> n == 1 )
            , ( "Upper boundary (20)", \n -> n == 20 )
            , ( "In the middle (2..19)", \n -> n > 1 && n <= 20 )
            , ( "Outside boundaries??", \n -> n < 1 || n > 20 )
            ]

expectCoverageFuzzer : Fuzzer Int
expectCoverageFuzzer =
    myFuzzer
        |> Fuzz.expectCoverage
            [ ( 10, "Lower boundary (1)", \n -> n == 1 )
            , ( 10, "Upper boundary (20)", \n -> n == 20 )
            , ( 50, "In the middle (2..19)", \n -> n > 1 && n <= 20 )
            , ( 0, "Outside boundaries??", \n -> n < 1 || n > 20 )
            ]

mandateCoverageFuzzer : Fuzzer Int
mandateCoverageFuzzer =
    myFuzzer
        |> Fuzz.mandateCoverage
            [ ( 10, "Lower boundary (1)", \n -> n == 1 )
            , ( 10, "Upper boundary (20)", \n -> n == 20 )
            , ( 50, "In the middle (2..19)", \n -> n > 1 && n <= 20 )
            , ( 0, "Outside boundaries??", \n -> n < 1 || n > 20 )
            ]

-- Usage in tests doesn't change, there is no Test/Expect API change:

justReport : Test
justReport =
    Test.fuzz justReportFuzzer "justReport" <| \n -> Expect.true

expectCoverage : Test
expectCoverage =
    Test.fuzz expectCoverageFuzzer "expectCoverage" <| \n -> Expect.true

mandateCoverage : Test
mandateCoverage =
    Test.fuzz mandateCoverageFuzzer "mandateCoverage" <| \n -> Expect.true

-- Possible to test the labels out in REPL:

> Fuzz.labelExamples { tries = 100, seed = 1 } mandateCoverageFuzzer
[ ( "Lower boundary (1)", Just 1 )
, ( "Upper boundary (20)", Just 20 )
, ( "In the middle (2..19)", Just 5 )
, ( "Outside boundaries??", Nothing )
]

Janiczek · 2022-07-27T05:04:03Z

A few clarifications on my current intentions with the above:

0 would be special in that any occurrence of the label would be counted as a failure (as opposed to other values meaning "at least that percentage")
with the List based approach a given value could accumulate multiple labels, like with Html.classList. For a list like

[ ( "green", isGreen )
, ( "red", isRed )
, ( "big", isBig )
]

we'd report all the occurring overlaps:

green, big: 8% (23x)
green: 21% (128x)
red: 32% (211x)

(eg. red,big == 0x, red,green == 0x, red,green,big == 0x, big == 23x)
( ⚠️ the numbers are bogus )

What I'm not sure about is whether green should also count the occurrences from green, big, or whether they'd be separate. I think the latter makes more sense but it might be nice to spell it out for the user.

gampleman · 2022-07-29T10:25:00Z

I'd like to re-iterate from that discussion that I strongly believe that the classification makes a lot more sense in the Expect than in the Fuzzer. What counts as what condition depends on the code under test, not on the fuzzer being used. This makes fuzzers less re-usable, which I think is a bad thing to encourage.

Specifically I love for instance hacking on elm-geometry, since the library has a very well developed library of reusable fuzzers that expressing new tests is typically super easy. I think we should try to guide users into that kind of design.

Janiczek · 2022-07-29T11:33:12Z

@gampleman In addition, I've realized during implementation that making classification live in Fuzzer has the additional problem of having to be done as a last step.

bad : Fuzzer Bool
bad =
  Fuzz.intRange 0 10
  |> Fuzz.reportCoverage [...]
  |> Fuzz.map ((==) 5)

can't retain the coverage request from the first step.
So I like that making the API Expect-centric makes the problem go away entirely.

jonathanjouty · 2022-07-29T12:21:02Z

@Janiczek This looks really nice.
A couple of thoughts below:

Fuzz.expectCoverage and Fuzz.mandateCoverage: (a bit late as you've already commented on this in the PR 😄 but posting here for posterity) I feel like we should aim to give users the "right" default option, and perhaps this means only providing mandateCoverage and reportCoverage (i.e. do not include expectCoverage).
If a user wants to see the categories, then reportCoverage is sufficient. What would a warning look like? What extra information would a warning give the user? With a warning, I feel like once a user is done writing the fuzzer, the warning output will be relegated to the depths of CI runs and never looked at properly, and hence a failure is more useful.
Special case for 0 coverage: This might have a problem of discoverability. Perhaps a custom type could force the user to be explicit about their expectations? (say, something like NotPresent and Coverage Int).
Output numbers (as seen in Fuzzer classification and coverage #187) Personally the total number of runs and percentages are enough, the number of runs for each category adds noise to the output IMO (Haskell QuickCheck seems to include only total and percentages).

Janiczek · 2022-07-29T12:33:12Z

I feel like we should aim to give users the "right" default option, and perhaps this means only providing mandateCoverage and reportCoverage (i.e. do not include expectCoverage).

For posterity (I'm editing the PR description a bunch), I've went with reportCoverage for printing it out, and expectCoverage as the "mandate+fail if not satisfied" option. The option that only warned if not satistied is gone.

Special case for 0 coverage

The current API I'm trying out is

type ExpectedCoverage
  = Zero -- "this variant is never generated!"
  | AtLeast Int -- the bread and butter option that most users will use
  | MoreThan Int

The MoreThan option is debatable, as I think it will only be useful for "more than zero" and at that point the user could probably choose an arbitrary percentage, like AtLeast 1. Should we have it at all?

Personally the total number of runs and percentages are enough, the number of runs for each category adds noise to the output IMO

Fair, I can experiment and see how it looks without the totals!

gampleman · 2022-07-29T12:39:58Z

What's the point of Zero? Also are counts better than percentages?

Janiczek · 2022-07-29T12:42:47Z

@gampleman The ExpectedCoverage would all be percentages. I think it doesn't make sense to put absolute counts there as those would need to change in your test source code everytime you changed --runs

Janiczek · 2022-07-29T12:46:06Z

For @jonathanjouty: Here are the two versions (with and without counts) together:

I personally like the version with counts better, I'll leave this detail up for discussion for now.

gampleman · 2022-07-29T13:16:45Z

Maybe we could add little bar charts? 😍

Janiczek · 2022-07-29T19:19:31Z

@gampleman What about these?

Janiczek · 2022-07-29T19:30:40Z

Also I figured I never answered this.

What's the point of Zero?

So, I can imagine a situation where you're trying to fuzz a specific variant of your data - perhaps a balanced binary tree that isn't full.

So your fuzzer hopefully generates a subset of all possible trees (doesn't generate full trees), and you'd want to check that fuzzer. In the particular test that needs trees to not be full, to test something on them (perhaps some kind of insert operation), you would mandate this distribution of trees from your fuzzer:

Test.fuzzWith
  { runs = 10000
  , coverage = Test.expectCoverage [ (Test.zero, "full", Tree.isFull) ]
  }
  nonFullTreeFuzzer
  "insert does XYZ"
  <| \tree -> ...

Is it too convoluted as an example usecase for Zero?

gampleman · 2022-07-29T19:58:50Z

I'm not sure why you wouldn't just use Expect.fail...

Why not put the categorisation into Expect? Now it's in Test, which is kind of wierd since it's only accessible through fuzzWith, but that makes it more awkward to use with fuzz2 and friends (of course the more knowledgable might work out that fuzz2 can be achieved with Fuzz.tuple, but it's an extra conceptual step).

The api I would think would just be Expect.label : String -> Expectation -> Expectation or the asserting version could then be the same, but without the need to have the classification functions.

I think this is sensible, since the main usage would anyway be:

myTest = Test.fuzz myfuzzer "custom type" <|
   \data ->
       case date of 
            Foo xyzzy ->
                foo xyzzy
                       |> Expect.greaterThan 30
                       |> Expect.label "foo"
            Bar [single] ->
                    ...
                       |> Expect.label "single bar"

etc.

Edit: than reportCoverage could just be kind of like skip, i.e. a little toggle-able thingy you can tack on to a test when you're working on it...

Janiczek · 2022-07-29T23:07:07Z

With the Expect.label : String -> Expectation -> Expectation version, I wonder whether there are cases where you'd want to label a value conditionally in a way that doesn't fully mirror the conditions the expectation is taking. Would we be locking ourselves out of some potentially useful expressiveness?

I guess you could still do

myTest = Test.fuzz myfuzzer "custom type" <|
   \data ->
       (case date of 
            Foo xyzzy ->
                foo xyzzy
                       |> Expect.greaterThan 30
                       |> Expect.label "foo"
            Bar [single] ->
                    ...
                       |> Expect.label "single bar"
        )
        |> Expect.labelIf isBleh "bleh"
        |> Expect.labelIf isBlah "blah"

at the end of the test, it just isn't very nice with the need to parenthesize.
The other way round, with the labelIf at the beginning and <| adds indentation - also not very beautiful.

I'm guessing the Test and Expect ways to do this are roughly equivalent, but:

Expect
➕ allows you to label "contextually" in the condition paths created for the purpose of the expectation itself
➕ reportCoverage is simpler: Test.reportCoverage <| Test.fuzz ...
➖ conditional labeling that doesn't map to expectation conditions needs parentheses or indentation
Test
➖ touches the FuzzOptions record (more things to upgrade for folks)
➕ conditional labeling doesn't get in your way syntax-wise

Anything that I'm missing?

gampleman · 2022-07-30T08:08:23Z

I’m wondering if don’t need to experiment with this feature in practice to get enough perspective to decide this correctly. I’m not confident of a good use case for labelling “independently”, since the purpose is to explore different code paths, but perhaps the conditionals are inside the code you are testing and it’s more convenient to do the categorisation outside.

I think another disadvantage of the way it’s setup in Test is that it doesn’t naturally lead to exhaustiveness checking by the compiler, unlike a more standard implementation using a case statement.

Furthermore I think that ergonomically I’d like to be able to leave the logic of labelling in the code, but have an easy way to turn on/off reporting.

mgold added Design Question Needs design discussion fuzzers Concerns randomness or simplifiers major release blocker Issue to be resolved before next major version bump labels May 10, 2019

Janiczek mentioned this issue Jun 23, 2022

Optionally report stats about fuzz tests (generation, shrinking, rejections) #184

Open

Janiczek mentioned this issue Jul 27, 2022

Fuzzer classification and coverage #187

Merged

22 tasks

gampleman mentioned this issue Jul 29, 2022

Default runs too low #190

Open

Janiczek closed this as completed in #187 Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss adding labels for fuzzed values #94

Discuss adding labels for fuzzed values #94

mgold commented May 10, 2019

Janiczek commented Jul 26, 2022 •

edited

Loading

Janiczek commented Jul 27, 2022 •

edited

Loading

gampleman commented Jul 29, 2022

Janiczek commented Jul 29, 2022

jonathanjouty commented Jul 29, 2022 •

edited

Loading

Janiczek commented Jul 29, 2022 •

edited

Loading

gampleman commented Jul 29, 2022

Janiczek commented Jul 29, 2022

Janiczek commented Jul 29, 2022 •

edited

Loading

gampleman commented Jul 29, 2022

Janiczek commented Jul 29, 2022

Janiczek commented Jul 29, 2022 •

edited

Loading

gampleman commented Jul 29, 2022 •

edited

Loading

Janiczek commented Jul 29, 2022 •

edited

Loading

gampleman commented Jul 30, 2022

Discuss adding labels for fuzzed values #94

Discuss adding labels for fuzzed values #94

Comments

mgold commented May 10, 2019

Janiczek commented Jul 26, 2022 • edited Loading

Janiczek commented Jul 27, 2022 • edited Loading

gampleman commented Jul 29, 2022

Janiczek commented Jul 29, 2022

jonathanjouty commented Jul 29, 2022 • edited Loading

Janiczek commented Jul 29, 2022 • edited Loading

gampleman commented Jul 29, 2022

Janiczek commented Jul 29, 2022

Janiczek commented Jul 29, 2022 • edited Loading

gampleman commented Jul 29, 2022

Janiczek commented Jul 29, 2022

Janiczek commented Jul 29, 2022 • edited Loading

gampleman commented Jul 29, 2022 • edited Loading

Janiczek commented Jul 29, 2022 • edited Loading

gampleman commented Jul 30, 2022

Janiczek commented Jul 26, 2022 •

edited

Loading

Janiczek commented Jul 27, 2022 •

edited

Loading

jonathanjouty commented Jul 29, 2022 •

edited

Loading

Janiczek commented Jul 29, 2022 •

edited

Loading

Janiczek commented Jul 29, 2022 •

edited

Loading

Janiczek commented Jul 29, 2022 •

edited

Loading

gampleman commented Jul 29, 2022 •

edited

Loading

Janiczek commented Jul 29, 2022 •

edited

Loading