Skip values (moreso, RandomRuns) we have already tested #207

Janiczek · 2022-10-13T21:08:51Z

WIP

closes #11 without need for major version since we don't add a new Failure variant.

Allows FuzzOptions.runs * 2 skips of values already seen.
This, ironically, speeds up our fuzzer distribution machinery as well.

Exhaustive checking (#188) would help us save even more unneeded work (we'd know we can stop skipping) -- and I want to do that -- but I'm not trying that in this PR. The skipping works reasonably well even without it.

Should be working already! Now just measuring whether this has any noticeable impact, with tests/randomized-tests.sh.

gampleman

Do we have any concerns about memory use, particularly for large run count scenarios?

gampleman · 2022-10-14T08:19:34Z

src/Fuzz/Internal.elm

+gen : List Int -> Fuzzer a -> GenResult a
+gen randomRun fuzzer =
+    generate (PRNG.hardcoded (RandomRun.fromList randomRun)) fuzzer
+
+
+genR : Int -> Fuzzer a -> GenResult a
+genR seed fuzzer =


I think these could use a more descriptive name 😛

Sorry these shouldn't be in this PR 😅 Just wanted to check stuff in my REPL, and forgot about them when git add .-ing

Hopefully they're now all gone!

gampleman · 2022-10-14T08:41:30Z

src/Test/Fuzz.elm

@@ -282,7 +309,7 @@ allSufficientlyCovered c state normalizedDistributionCount =
                                        True

                                    AtLeast n ->
-                                        Test.Distribution.Internal.sufficientlyCovered state.runsElapsed count (n / 100)
+                                        Test.Distribution.Internal.sufficientlyCovered (state.runsElapsed + state.runsSkipped) count (n / 100)


Does it make sense to include the skipped runs in the distribution coverage? I'm not saying no, but I think it's a question with a non-obvious answer...

I think the answer should be yes: we already know what the fuzzed value was when skipping, so we can count it into the fuzzer distribution. What we skipped was running the test function.

But is it valuable information? Imagine that I want to classify even/odd, and my generator generates the following:

0, 1, 2, 4, 1, 6, 1, 8, 1, 10, 1, 12, 1

Then I'll see that I had a an 50%/50% distribution of even/odd, but does it give me any more confidence about what I'm testing?

I believe fuzzer distribution is about classifying what the fuzzer gives you, irregardless of the test function (... -> Expectation). ie. you could just as well Fuzz.examples 10000 myFuzzer in the REPL and then do some statistics on that, and get the same result.

Test skipping only deals with skipping the test function, not skipping the fuzzing. This PR shouldn't change the behaviour of the fuzzer distribution reporting.

I believe I understand your issue, but it seems to me it's unrelated to this PR.

it seems to me it's unrelated to this PR

That's fair enough. I think the main relation here is that the implementation here could implement both concerns by simultaneously being simpler. However, if you don't feel like wading into those waters in this PR, that's fine.

I think the main relation here is that the implementation here could implement both concerns by simultaneously being simpler.

I think you're suggesting we remove skipped values from the distribution counts. But wouldn't that invalidate the claim that the distribution counts deal with what the fuzzer returns (again, thinking about Fuzz.examples 10000 myFuzzer)? That would change it into "unique values the fuzzer returns". I don't know if that's helpful.

It would basically change

Distribution report: ==================== 2-19: 90.1% (27007x) ███████████████████████████░░░ 1: 5% (1485x) █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 20: 4.9% (1468x) █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ <1, >20: 0% (0x) ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

into

Distribution report: ==================== 2-19: 90% (18x) ███████████████████████████░░░ 1: 5% (1x) █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 20: 5% (1x) █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ <1, >20: 0% (0x) ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

That to me is a different feature 🤷

Unique values is what we care about though isn't it? So distribution of unique values is the valuable thing that we should be highlighting to developers.

Also I disagree that distributions are just about the Fuzzer - they are about giving you confidence that your fuzz test is adequately sampling the problem space and not just focusing on some particular corner of it.

Additionally there could be some nice highlighting of overproduction of duplicates:

Distribution report: ==================== 2-19: 90% (18x) ███████████████████████████░░░ 1: 5% (1x) █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ⚠️ Fuzzer produces 1480 duplicated values 20: 5% (1x) █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ <1, >20: 0% (0x) ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

or some such, but that's perhaps just overcomplicating things for little benefit.

Unique values is what we care about though isn't it? So distribution of unique values is the valuable thing that we should be highlighting to developers.

That's where our understanding differs currently -- I'll think about it and give the idea a chance! Perhaps you're right

gampleman · 2022-10-14T09:21:32Z

src/Test/Fuzz.elm

                            }
            , failure = Just failure
            }

        Nothing ->
-            if state.runsElapsed < c.runsNeeded then
+            if state.runsElapsed < c.runsNeeded && state.runsSkipped < c.skipsAllowed then


Shouldn't there be some sort of distinct failure when the skipsAllowed is exceeded? That basically means your Fuzzer is kind of rubbish, no?

skipsAllowed exceeded could mean you've exhausted all possible values (or at least, finding new values is improbable). Consider Fuzz.bool or Fuzz.intRange 1 10. Over 100 runs those could skip 98 times and 90 times respectively (depending on your settings)

For this reason I don't think we should fail. If anything, we could stop early, but let's do that when implementing exhaustiveness? This way skipping basically only lets you fuzz more diverse inputs / not waste time on things you already know answer to

I guess we can punt on this question until we implement exhaustiveness. Because in those cases the test runner should just switch to exhaustive checking, so this kind of failure should be reserved for more pathological fuzzers:

Fuzz.floatRange 0 1 |> Fuzz.andThen (\p -> if p < 0.9 then Fuzz.constant 0 else Fuzz.intRange 0 100 )

which would just be kind of bad, but not really exhaustive...

Janiczek · 2022-10-14T09:46:52Z

tests/src/Runner/String.elm

@@ -75,25 +75,19 @@ fromExpectation labels expectation summary =
            expectation
                |> Test.Runner.getDistributionReport
                |> Runner.String.Distribution.report labels
-
-        summaryWithDistribution : Summary


Changes in this module improve the visuals of how we present the fuzzer distribution tables in our homebrew test runner. (This is not exposed to users in any way)

Janiczek · 2022-10-14T09:58:43Z

Do we have any concerns about memory use, particularly for large run count scenarios?

I can try to measure that.

For now here are runtime counts: I've tried over multiple repos and the result is roughly the same everywhere: skipping doesn't give any performance benefit (it adds an overhead if anything), all it does is give you more diverse inputs.

gampleman · 2022-10-14T10:19:56Z

What number of runs are those charts for?

Janiczek · 2022-10-14T10:43:39Z

@gampleman 50 runs per configuration

gampleman · 2022-10-14T12:20:06Z

I wonder if 50 is too low to:

a) manifest all that much duplication
b) exert significant memory pressure from potentially large sets of generated values

Could we also test with say 10,000 runs?

Janiczek · 2022-10-14T12:23:47Z

Could we also test with say 10,000 runs?

Each run of the elm-test test suite (where various fuzzers differ, some have runs=100, some have runs=10000) is about 10s on average. Getting 10k test suite runs would take me 27 hours straight (unless somehow parallelized)

EDIT: sorry if I previously led you to believe the number 50 was for the runs=... configuration!
Basically what the charts above are is: I run something akin to for SEED in {1..50}; do time elm-test --seed=$SEED; done; and collect the numbers into a table. I do this for various configurations (master code, this PR with multiplier 1,2,5,10) to compare the effect on runtime across a large test suite.

gampleman · 2022-10-14T12:41:48Z

Ah OK, that makes sense. I was a bit surprised you were doing fewer than the default 100...

Janiczek · 2022-10-14T12:48:59Z

Out of curiosity I'll try another set of 50 seeds * {master,skip01,skip02,skip05,skip10} with the --fuzz=10000 option active, to see whether perhaps there is some difference there. But I'd only expect performance savings where the overhead of generating+skipping extra values would be lower than that of running the test on the would-be-skipped values -- this PR is not about short-circuiting 🙂

Janiczek · 2022-10-14T15:30:16Z

Tried the same thing with defaultRuns = 10000, as said above.

So eg. in test suite runs taking ~18s the skip x2 approach added ~1.5s to the runtime.

gampleman · 2022-10-14T15:50:02Z

Yeah I think we would need some nice test for the F-metric to be able to also see the upside of a PR like this. FWIW I don't think that's a too bad perf degradation, and ultimately helps with the mission of actually finding bugs, so 👍

Janiczek · 2022-10-14T20:19:24Z

F-metric

I think I vaguely recall this mentioned in our discussion of quasirandom numbers? I'm a stats noob, could you please share some Wikipedia/paper links for its definition etc.? My googling is returning F-score (not sure it is what you're talking about) and F-metric being some kind of web development stuff (that definitely isn't it) :)

gampleman · 2022-10-17T08:37:50Z

It's a horrible name, since it's almost impossible to google for. I mentioned it in here:

[F-metric is] defined as the number of test cases the test system needs to generate before a defect is uncovered

So I suspect we could make a benchmark of Fuzz tests that fail in various circumstances and then run them with a high number of runs, and report out how many runs it took to find each failure condition. I suspect this present PR would then present a potentially substantial improvement on such a benchmark.

You could also not report out runs but the time it took to find each bug. Then that would be a fairly sensible performance metric, since that's in some ways more meaningful than how long it takes to generate and run some arbitrary number of cases.

Janiczek added 2 commits October 13, 2022 23:00

Skip values (moreso, RandomRuns) we have already tested

5dd5cc2

Fix a condition

73e6710

gampleman reviewed Oct 14, 2022

View reviewed changes

Janiczek added 3 commits October 14, 2022 11:44

Tweak the ratio

434f819

Cleanup the diff

72b744b

Skip the failing test properly

21e9ba4

Janiczek commented Oct 14, 2022

View reviewed changes

Janiczek added 3 commits October 14, 2022 11:47

Cleanup some more

ae2a149

It fails anyway because of Test.skip

73c0351

Make it compile

5b47273

gampleman approved these changes Oct 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip values (moreso, RandomRuns) we have already tested #207

Skip values (moreso, RandomRuns) we have already tested #207

Janiczek commented Oct 13, 2022 •

edited

Loading

gampleman left a comment

gampleman Oct 14, 2022

Janiczek Oct 14, 2022

Janiczek Oct 14, 2022

gampleman Oct 14, 2022

Janiczek Oct 14, 2022

gampleman Oct 14, 2022

Janiczek Oct 14, 2022

gampleman Oct 14, 2022 •

edited

Loading

Janiczek Oct 14, 2022 •

edited

Loading

gampleman Oct 14, 2022

Janiczek Oct 14, 2022

gampleman Oct 14, 2022

Janiczek Oct 14, 2022

Janiczek Oct 14, 2022 •

edited

Loading

gampleman Oct 14, 2022

Janiczek Oct 14, 2022

Janiczek commented Oct 14, 2022

gampleman commented Oct 14, 2022

Janiczek commented Oct 14, 2022 •

edited

Loading

gampleman commented Oct 14, 2022

Janiczek commented Oct 14, 2022 •

edited

Loading

gampleman commented Oct 14, 2022

Janiczek commented Oct 14, 2022

Janiczek commented Oct 14, 2022

gampleman commented Oct 14, 2022

Janiczek commented Oct 14, 2022

gampleman commented Oct 17, 2022 •

edited

Loading

Skip values (moreso, RandomRuns) we have already tested #207

Are you sure you want to change the base?

Skip values (moreso, RandomRuns) we have already tested #207

Conversation

Janiczek commented Oct 13, 2022 • edited Loading

WIP

gampleman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gampleman Oct 14, 2022 • edited Loading

Choose a reason for hiding this comment

Janiczek Oct 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Janiczek Oct 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Janiczek commented Oct 14, 2022

gampleman commented Oct 14, 2022

Janiczek commented Oct 14, 2022 • edited Loading

gampleman commented Oct 14, 2022

Janiczek commented Oct 14, 2022 • edited Loading

gampleman commented Oct 14, 2022

Janiczek commented Oct 14, 2022

Janiczek commented Oct 14, 2022

gampleman commented Oct 14, 2022

Janiczek commented Oct 14, 2022

gampleman commented Oct 17, 2022 • edited Loading

Janiczek commented Oct 13, 2022 •

edited

Loading

gampleman Oct 14, 2022 •

edited

Loading

Janiczek Oct 14, 2022 •

edited

Loading

Janiczek Oct 14, 2022 •

edited

Loading

Janiczek commented Oct 14, 2022 •

edited

Loading

Janiczek commented Oct 14, 2022 •

edited

Loading

gampleman commented Oct 17, 2022 •

edited

Loading